ML in Production: From Notebook to Deployment
Learn the essentials of deploying ML models to production, from saving models to serving predictions.
ML in Production: From Notebook to Deployment
A model in a notebook is useless. Let's get it serving real predictions.
The Production Pipeline
``` Training Pipeline: Data → Preprocess → Train → Evaluate → Save Model
Inference Pipeline: Request → Load Model → Preprocess → Predict → Response ```
Step 1: Save Your Model
```python import joblib import pickle
Method 1: joblib (recommended for sklearn) joblib.dump(model, 'model.joblib') model = joblib.load('model.joblib')
Method 2: pickle with open('model.pkl', 'wb') as f: pickle.dump(model, f)
Save preprocessing too! joblib.dump(scaler, 'scaler.joblib') joblib.dump(encoder, 'encoder.joblib') ```
Step 2: Create Inference Pipeline
Keep preprocessing consistent:
```python class ModelPipeline: def __init__(self, model_path, scaler_path): self.model = joblib.load(model_path) self.scaler = joblib.load(scaler_path) def preprocess(self, data): # Same preprocessing as training return self.scaler.transform(data) def predict(self, data): processed = self.preprocess(data) return self.model.predict(processed) def predict_proba(self, data): processed = self.preprocess(data) return self.model.predict_proba(processed) ```
Step 3: Create API Endpoint
Using FastAPI (recommended):
```python from fastapi import FastAPI from pydantic import BaseModel import numpy as np
app = FastAPI()
Load model at startup pipeline = ModelPipeline('model.joblib', 'scaler.joblib')
class PredictionRequest(BaseModel): features: list[float]
class PredictionResponse(BaseModel): prediction: int probability: float
@app.post("/predict", response_model=PredictionResponse) def predict(request: PredictionRequest): features = np.array(request.features).reshape(1, -1) prediction = pipeline.predict(features)[0] probability = pipeline.predict_proba(features)[0].max() return PredictionResponse( prediction=int(prediction), probability=float(probability) )
Health check @app.get("/health") def health(): return {"status": "healthy"} ```
Step 4: Containerize with Docker
```dockerfile FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt
COPY model.joblib . COPY scaler.joblib . COPY app.py .
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"] ```
```bash # Build and run docker build -t ml-model . docker run -p 8000:8000 ml-model ```
Step 5: Monitor Your Model
```python import logging from datetime import datetime
logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__)
@app.post("/predict") def predict(request: PredictionRequest): start_time = datetime.now() features = np.array(request.features).reshape(1, -1) prediction = pipeline.predict(features)[0] probability = pipeline.predict_proba(features)[0].max() # Log for monitoring latency = (datetime.now() - start_time).total_seconds() logger.info(f"Prediction: {prediction}, Prob: {probability:.3f}, Latency: {latency:.3f}s") return PredictionResponse( prediction=int(prediction), probability=float(probability) ) ```
Common Production Issues
### 1. Training-Serving Skew
**Problem:** Preprocessing differs between training and serving.
**Solution:** Use the exact same preprocessing code:
```python # Save the entire pipeline from sklearn.pipeline import Pipeline
full_pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', RandomForestClassifier()) ]) full_pipeline.fit(X_train, y_train) joblib.dump(full_pipeline, 'full_pipeline.joblib') ```
### 2. Model Drift
**Problem:** Model performance degrades over time.
**Solution:** Monitor predictions and retrain:
```python # Track prediction distribution def log_prediction_stats(predictions): logger.info(f"Mean: {np.mean(predictions):.3f}") logger.info(f"Std: {np.std(predictions):.3f}") logger.info(f"Class distribution: {np.bincount(predictions)}") ```
### 3. Latency Issues
**Solution:** Optimize or use lighter models:
```python # Measure latency import time
def benchmark_model(model, X_test, n_runs=100): times = [] for _ in range(n_runs): start = time.time() model.predict(X_test[:1]) times.append(time.time() - start) print(f"Mean latency: {np.mean(times)*1000:.2f}ms") print(f"P99 latency: {np.percentile(times, 99)*1000:.2f}ms") ```
Production Checklist
- [ ] Model and preprocessing saved together - [ ] API endpoint tested - [ ] Input validation - [ ] Error handling - [ ] Health check endpoint - [ ] Logging for monitoring - [ ] Docker containerization - [ ] Load testing done - [ ] Rollback plan ready
Key Takeaway
Production ML is about consistency and reliability. Save models with their preprocessing, create clean APIs, containerize for portability, and monitor everything. The best model is worthless if it can't serve predictions reliably. Start simple, add complexity only when needed!