Building Your First ML Pipeline
Learn how to build clean, reproducible ML pipelines using scikit-learn.
Building Your First ML Pipeline
Stop writing messy ML code! Pipelines make your workflow clean, reproducible, and bug-free.
The Problem Without Pipelines
```python # Messy approach scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Easy to forget!
imputer = SimpleImputer() X_train_imputed = imputer.fit_transform(X_train_scaled) X_test_imputed = imputer.transform(X_test_scaled)
model = LogisticRegression() model.fit(X_train_imputed, y_train) predictions = model.predict(X_test_imputed) ```
Problems: - Easy to forget to transform test data - Messy variable names - Hard to do cross-validation correctly - Error-prone when deploying
The Solution: Pipelines!
```python from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.impute import SimpleImputer from sklearn.linear_model import LogisticRegression
Clean approach pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()), ('model', LogisticRegression()) ])
One line to train pipeline.fit(X_train, y_train)
One line to predict predictions = pipeline.predict(X_test) ```
All transformations are applied automatically in order!
Creating Pipelines
### Basic Pipeline ```python from sklearn.pipeline import Pipeline
pipeline = Pipeline([ ('step1_name', Transformer1()), ('step2_name', Transformer2()), ('model', Classifier()) ]) ```
### Using make_pipeline (Auto-names) ```python from sklearn.pipeline import make_pipeline
pipeline = make_pipeline( SimpleImputer(), StandardScaler(), LogisticRegression() ) # Steps named automatically: simpleimputer, standardscaler, logisticregression ```
Handling Different Column Types
Real data has mixed types. Use **ColumnTransformer**:
```python from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer
Define column groups numerical_features = ['age', 'income', 'score'] categorical_features = ['city', 'gender', 'education']
Different preprocessing for different types preprocessor = ColumnTransformer([ ('num', Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]), numerical_features), ('cat', Pipeline([ ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), ('encoder', OneHotEncoder(handle_unknown='ignore')) ]), categorical_features) ])
Full pipeline pipeline = Pipeline([ ('preprocessing', preprocessor), ('model', LogisticRegression()) ])
pipeline.fit(X_train, y_train) ```
Cross-Validation with Pipelines
```python from sklearn.model_selection import cross_val_score
Cross-validation is now SAFE # Preprocessing is done correctly for each fold! scores = cross_val_score(pipeline, X, y, cv=5) print(f"CV Score: {scores.mean():.3f} ± {scores.std():.3f}") ```
Without a pipeline, preprocessing might leak information between folds!
Hyperparameter Tuning with Pipelines
Access nested parameters with double underscore:
```python from sklearn.model_selection import GridSearchCV
pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', LogisticRegression()) ])
Parameter names: stepname__parameter param_grid = { 'model__C': [0.1, 1, 10], 'model__penalty': ['l1', 'l2'] }
grid_search = GridSearchCV(pipeline, param_grid, cv=5) grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}") print(f"Best score: {grid_search.best_score_:.3f}") ```
Complete Example
```python import pandas as pd from sklearn.model_selection import train_test_split, cross_val_score from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import RandomForestClassifier
Load data df = pd.read_csv('data.csv') X = df.drop('target', axis=1) y = df['target']
Identify column types numerical = X.select_dtypes(include=['int64', 'float64']).columns.tolist() categorical = X.select_dtypes(include=['object']).columns.tolist()
Build preprocessing num_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ])
cat_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore')) ])
preprocessor = ColumnTransformer([ ('num', num_transformer, numerical), ('cat', cat_transformer, categorical) ])
Full pipeline pipeline = Pipeline([ ('prep', preprocessor), ('clf', RandomForestClassifier(n_estimators=100)) ])
Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Cross-validation cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5) print(f"CV Scores: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
Train and evaluate pipeline.fit(X_train, y_train) test_score = pipeline.score(X_test, y_test) print(f"Test Score: {test_score:.3f}") ```
Saving and Loading Pipelines
```python import joblib
Save entire pipeline (preprocessing + model) joblib.dump(pipeline, 'my_model_pipeline.pkl')
Load later loaded_pipeline = joblib.load('my_model_pipeline.pkl')
Ready to predict! new_data = pd.DataFrame({'age': [30], 'income': [50000], 'city': ['NYC']}) prediction = loaded_pipeline.predict(new_data) ```
Pipeline Best Practices
1. **Include ALL preprocessing** in the pipeline 2. **Use ColumnTransformer** for mixed data types 3. **Use cross_val_score** for reliable evaluation 4. **Save the whole pipeline** for deployment 5. **Name your steps** clearly
Benefits Summary
| Benefit | Why It Matters | |---------|---------------| | No data leakage | Preprocessing fitted only on training data | | Reproducible | Same steps every time | | Easy deployment | Save/load complete workflow | | Clean code | No messy intermediate variables | | Safe CV | Each fold preprocessed correctly |
Key Takeaway
Pipelines aren't optional—they're essential for professional ML work. They prevent bugs, ensure reproducibility, and make your code cleaner.
Start every project with a pipeline. You'll thank yourself later!