ML8 min read

Building Your First ML Pipeline

Learn how to build clean, reproducible ML pipelines using scikit-learn.

Sarah Chen
December 19, 2025
0.0k0

Building Your First ML Pipeline

Stop writing messy ML code! Pipelines make your workflow clean, reproducible, and bug-free.

The Problem Without Pipelines

```python # Messy approach scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Easy to forget!

imputer = SimpleImputer() X_train_imputed = imputer.fit_transform(X_train_scaled) X_test_imputed = imputer.transform(X_test_scaled)

model = LogisticRegression() model.fit(X_train_imputed, y_train) predictions = model.predict(X_test_imputed) ```

Problems: - Easy to forget to transform test data - Messy variable names - Hard to do cross-validation correctly - Error-prone when deploying

The Solution: Pipelines!

```python from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.impute import SimpleImputer from sklearn.linear_model import LogisticRegression

Clean approach pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()), ('model', LogisticRegression()) ])

One line to train pipeline.fit(X_train, y_train)

One line to predict predictions = pipeline.predict(X_test) ```

All transformations are applied automatically in order!

Creating Pipelines

### Basic Pipeline ```python from sklearn.pipeline import Pipeline

pipeline = Pipeline([ ('step1_name', Transformer1()), ('step2_name', Transformer2()), ('model', Classifier()) ]) ```

### Using make_pipeline (Auto-names) ```python from sklearn.pipeline import make_pipeline

pipeline = make_pipeline( SimpleImputer(), StandardScaler(), LogisticRegression() ) # Steps named automatically: simpleimputer, standardscaler, logisticregression ```

Handling Different Column Types

Real data has mixed types. Use **ColumnTransformer**:

```python from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer

Define column groups numerical_features = ['age', 'income', 'score'] categorical_features = ['city', 'gender', 'education']

Different preprocessing for different types preprocessor = ColumnTransformer([ ('num', Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]), numerical_features), ('cat', Pipeline([ ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), ('encoder', OneHotEncoder(handle_unknown='ignore')) ]), categorical_features) ])

Full pipeline pipeline = Pipeline([ ('preprocessing', preprocessor), ('model', LogisticRegression()) ])

pipeline.fit(X_train, y_train) ```

Cross-Validation with Pipelines

```python from sklearn.model_selection import cross_val_score

Cross-validation is now SAFE # Preprocessing is done correctly for each fold! scores = cross_val_score(pipeline, X, y, cv=5) print(f"CV Score: {scores.mean():.3f} ± {scores.std():.3f}") ```

Without a pipeline, preprocessing might leak information between folds!

Hyperparameter Tuning with Pipelines

Access nested parameters with double underscore:

```python from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', LogisticRegression()) ])

Parameter names: stepname__parameter param_grid = { 'model__C': [0.1, 1, 10], 'model__penalty': ['l1', 'l2'] }

grid_search = GridSearchCV(pipeline, param_grid, cv=5) grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}") print(f"Best score: {grid_search.best_score_:.3f}") ```

Complete Example

```python import pandas as pd from sklearn.model_selection import train_test_split, cross_val_score from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import RandomForestClassifier

Load data df = pd.read_csv('data.csv') X = df.drop('target', axis=1) y = df['target']

Identify column types numerical = X.select_dtypes(include=['int64', 'float64']).columns.tolist() categorical = X.select_dtypes(include=['object']).columns.tolist()

Build preprocessing num_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ])

cat_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore')) ])

preprocessor = ColumnTransformer([ ('num', num_transformer, numerical), ('cat', cat_transformer, categorical) ])

Full pipeline pipeline = Pipeline([ ('prep', preprocessor), ('clf', RandomForestClassifier(n_estimators=100)) ])

Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Cross-validation cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5) print(f"CV Scores: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

Train and evaluate pipeline.fit(X_train, y_train) test_score = pipeline.score(X_test, y_test) print(f"Test Score: {test_score:.3f}") ```

Saving and Loading Pipelines

```python import joblib

Save entire pipeline (preprocessing + model) joblib.dump(pipeline, 'my_model_pipeline.pkl')

Load later loaded_pipeline = joblib.load('my_model_pipeline.pkl')

Ready to predict! new_data = pd.DataFrame({'age': [30], 'income': [50000], 'city': ['NYC']}) prediction = loaded_pipeline.predict(new_data) ```

Pipeline Best Practices

1. **Include ALL preprocessing** in the pipeline 2. **Use ColumnTransformer** for mixed data types 3. **Use cross_val_score** for reliable evaluation 4. **Save the whole pipeline** for deployment 5. **Name your steps** clearly

Benefits Summary

| Benefit | Why It Matters | |---------|---------------| | No data leakage | Preprocessing fitted only on training data | | Reproducible | Same steps every time | | Easy deployment | Save/load complete workflow | | Clean code | No messy intermediate variables | | Safe CV | Each fold preprocessed correctly |

Key Takeaway

Pipelines aren't optional—they're essential for professional ML work. They prevent bugs, ensure reproducibility, and make your code cleaner.

Start every project with a pipeline. You'll thank yourself later!

#Machine Learning#Pipeline#Scikit-learn#Beginner