ML8 min read
Building Your First ML Pipeline
Learn how to build clean, reproducible ML pipelines using scikit-learn.
Sarah Chen
December 19, 2025
0.0k0
Building Your First ML Pipeline
Stop writing messy ML code! Pipelines make your workflow clean, reproducible, and bug-free.
The Problem Without Pipelines
# Messy approach
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Easy to forget!
imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train_scaled)
X_test_imputed = imputer.transform(X_test_scaled)
model = LogisticRegression()
model.fit(X_train_imputed, y_train)
predictions = model.predict(X_test_imputed)
Problems:
- Easy to forget to transform test data
- Messy variable names
- Hard to do cross-validation correctly
- Error-prone when deploying
The Solution: Pipelines!
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
# Clean approach
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('model', LogisticRegression())
])
# One line to train
pipeline.fit(X_train, y_train)
# One line to predict
predictions = pipeline.predict(X_test)
All transformations are applied automatically in order!
Creating Pipelines
Basic Pipeline
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('step1_name', Transformer1()),
('step2_name', Transformer2()),
('model', Classifier())
])
Using make_pipeline (Auto-names)
from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(
SimpleImputer(),
StandardScaler(),
LogisticRegression()
)
# Steps named automatically: simpleimputer, standardscaler, logisticregression
Handling Different Column Types
Real data has mixed types. Use ColumnTransformer:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# Define column groups
numerical_features = ['age', 'income', 'score']
categorical_features = ['city', 'gender', 'education']
# Different preprocessing for different types
preprocessor = ColumnTransformer([
('num', Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
]), numerical_features),
('cat', Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
]), categorical_features)
])
# Full pipeline
pipeline = Pipeline([
('preprocessing', preprocessor),
('model', LogisticRegression())
])
pipeline.fit(X_train, y_train)
Cross-Validation with Pipelines
from sklearn.model_selection import cross_val_score
# Cross-validation is now SAFE
# Preprocessing is done correctly for each fold!
scores = cross_val_score(pipeline, X, y, cv=5)
print(f"CV Score: {scores.mean():.3f} ± {scores.std():.3f}")
Without a pipeline, preprocessing might leak information between folds!
Hyperparameter Tuning with Pipelines
Access nested parameters with double underscore:
from sklearn.model_selection import GridSearchCV
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
# Parameter names: stepname__parameter
param_grid = {
'model__C': [0.1, 1, 10],
'model__penalty': ['l1', 'l2']
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
Complete Example
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
# Load data
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
# Identify column types
numerical = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical = X.select_dtypes(include=['object']).columns.tolist()
# Build preprocessing
num_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
cat_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer([
('num', num_transformer, numerical),
('cat', cat_transformer, categorical)
])
# Full pipeline
pipeline = Pipeline([
('prep', preprocessor),
('clf', RandomForestClassifier(n_estimators=100))
])
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Cross-validation
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print(f"CV Scores: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
# Train and evaluate
pipeline.fit(X_train, y_train)
test_score = pipeline.score(X_test, y_test)
print(f"Test Score: {test_score:.3f}")
Saving and Loading Pipelines
import joblib
# Save entire pipeline (preprocessing + model)
joblib.dump(pipeline, 'my_model_pipeline.pkl')
# Load later
loaded_pipeline = joblib.load('my_model_pipeline.pkl')
# Ready to predict!
new_data = pd.DataFrame({'age': [30], 'income': [50000], 'city': ['NYC']})
prediction = loaded_pipeline.predict(new_data)
Pipeline Best Practices
- Include ALL preprocessing in the pipeline
- Use ColumnTransformer for mixed data types
- Use cross_val_score for reliable evaluation
- Save the whole pipeline for deployment
- Name your steps clearly
Benefits Summary
| Benefit | Why It Matters |
|---|---|
| No data leakage | Preprocessing fitted only on training data |
| Reproducible | Same steps every time |
| Easy deployment | Save/load complete workflow |
| Clean code | No messy intermediate variables |
| Safe CV | Each fold preprocessed correctly |
Key Takeaway
Pipelines aren't optional—they're essential for professional ML work. They prevent bugs, ensure reproducibility, and make your code cleaner.
Start every project with a pipeline. You'll thank yourself later!
#Machine Learning#Pipeline#Scikit-learn#Beginner