ML8 min read

Building Your First ML Pipeline

Learn how to build clean, reproducible ML pipelines using scikit-learn.

Sarah Chen
December 19, 2025
0.0k0

Building Your First ML Pipeline

Stop writing messy ML code! Pipelines make your workflow clean, reproducible, and bug-free.

The Problem Without Pipelines

# Messy approach
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Easy to forget!

imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train_scaled)
X_test_imputed = imputer.transform(X_test_scaled)

model = LogisticRegression()
model.fit(X_train_imputed, y_train)
predictions = model.predict(X_test_imputed)

Problems:

  • Easy to forget to transform test data
  • Messy variable names
  • Hard to do cross-validation correctly
  • Error-prone when deploying

The Solution: Pipelines!

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

# Clean approach
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

# One line to train
pipeline.fit(X_train, y_train)

# One line to predict
predictions = pipeline.predict(X_test)

All transformations are applied automatically in order!

Creating Pipelines

Basic Pipeline

from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('step1_name', Transformer1()),
    ('step2_name', Transformer2()),
    ('model', Classifier())
])

Using make_pipeline (Auto-names)

from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    SimpleImputer(),
    StandardScaler(),
    LogisticRegression()
)
# Steps named automatically: simpleimputer, standardscaler, logisticregression

Handling Different Column Types

Real data has mixed types. Use ColumnTransformer:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Define column groups
numerical_features = ['age', 'income', 'score']
categorical_features = ['city', 'gender', 'education']

# Different preprocessing for different types
preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]), numerical_features),
    
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
    ]), categorical_features)
])

# Full pipeline
pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('model', LogisticRegression())
])

pipeline.fit(X_train, y_train)

Cross-Validation with Pipelines

from sklearn.model_selection import cross_val_score

# Cross-validation is now SAFE
# Preprocessing is done correctly for each fold!
scores = cross_val_score(pipeline, X, y, cv=5)
print(f"CV Score: {scores.mean():.3f} ± {scores.std():.3f}")

Without a pipeline, preprocessing might leak information between folds!

Hyperparameter Tuning with Pipelines

Access nested parameters with double underscore:

from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

# Parameter names: stepname__parameter
param_grid = {
    'model__C': [0.1, 1, 10],
    'model__penalty': ['l1', 'l2']
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")

Complete Example

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

# Load data
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']

# Identify column types
numerical = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical = X.select_dtypes(include=['object']).columns.tolist()

# Build preprocessing
num_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

cat_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', num_transformer, numerical),
    ('cat', cat_transformer, categorical)
])

# Full pipeline
pipeline = Pipeline([
    ('prep', preprocessor),
    ('clf', RandomForestClassifier(n_estimators=100))
])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Cross-validation
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print(f"CV Scores: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

# Train and evaluate
pipeline.fit(X_train, y_train)
test_score = pipeline.score(X_test, y_test)
print(f"Test Score: {test_score:.3f}")

Saving and Loading Pipelines

import joblib

# Save entire pipeline (preprocessing + model)
joblib.dump(pipeline, 'my_model_pipeline.pkl')

# Load later
loaded_pipeline = joblib.load('my_model_pipeline.pkl')

# Ready to predict!
new_data = pd.DataFrame({'age': [30], 'income': [50000], 'city': ['NYC']})
prediction = loaded_pipeline.predict(new_data)

Pipeline Best Practices

  1. Include ALL preprocessing in the pipeline
  2. Use ColumnTransformer for mixed data types
  3. Use cross_val_score for reliable evaluation
  4. Save the whole pipeline for deployment
  5. Name your steps clearly

Benefits Summary

Benefit Why It Matters
No data leakage Preprocessing fitted only on training data
Reproducible Same steps every time
Easy deployment Save/load complete workflow
Clean code No messy intermediate variables
Safe CV Each fold preprocessed correctly

Key Takeaway

Pipelines aren't optional—they're essential for professional ML work. They prevent bugs, ensure reproducibility, and make your code cleaner.

Start every project with a pipeline. You'll thank yourself later!

#Machine Learning#Pipeline#Scikit-learn#Beginner