ML10 min read

ML in Production: From Notebook to Deployment

Learn the essentials of deploying ML models to production, from saving models to serving predictions.

Sarah Chen
December 19, 2025
0.0k0

ML in Production: From Notebook to Deployment

A model in a notebook is useless. Let's get it serving real predictions.

The Production Pipeline

``` Training Pipeline: Data → Preprocess → Train → Evaluate → Save Model

Inference Pipeline: Request → Load Model → Preprocess → Predict → Response ```

Step 1: Save Your Model

```python import joblib import pickle

Method 1: joblib (recommended for sklearn) joblib.dump(model, 'model.joblib') model = joblib.load('model.joblib')

Method 2: pickle with open('model.pkl', 'wb') as f: pickle.dump(model, f)

Save preprocessing too! joblib.dump(scaler, 'scaler.joblib') joblib.dump(encoder, 'encoder.joblib') ```

Step 2: Create Inference Pipeline

Keep preprocessing consistent:

```python class ModelPipeline: def __init__(self, model_path, scaler_path): self.model = joblib.load(model_path) self.scaler = joblib.load(scaler_path) def preprocess(self, data): # Same preprocessing as training return self.scaler.transform(data) def predict(self, data): processed = self.preprocess(data) return self.model.predict(processed) def predict_proba(self, data): processed = self.preprocess(data) return self.model.predict_proba(processed) ```

Step 3: Create API Endpoint

Using FastAPI (recommended):

```python from fastapi import FastAPI from pydantic import BaseModel import numpy as np

app = FastAPI()

Load model at startup pipeline = ModelPipeline('model.joblib', 'scaler.joblib')

class PredictionRequest(BaseModel): features: list[float]

class PredictionResponse(BaseModel): prediction: int probability: float

@app.post("/predict", response_model=PredictionResponse) def predict(request: PredictionRequest): features = np.array(request.features).reshape(1, -1) prediction = pipeline.predict(features)[0] probability = pipeline.predict_proba(features)[0].max() return PredictionResponse( prediction=int(prediction), probability=float(probability) )

Health check @app.get("/health") def health(): return {"status": "healthy"} ```

Step 4: Containerize with Docker

```dockerfile FROM python:3.10-slim

WORKDIR /app

COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt

COPY model.joblib . COPY scaler.joblib . COPY app.py .

EXPOSE 8000

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"] ```

```bash # Build and run docker build -t ml-model . docker run -p 8000:8000 ml-model ```

Step 5: Monitor Your Model

```python import logging from datetime import datetime

logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__)

@app.post("/predict") def predict(request: PredictionRequest): start_time = datetime.now() features = np.array(request.features).reshape(1, -1) prediction = pipeline.predict(features)[0] probability = pipeline.predict_proba(features)[0].max() # Log for monitoring latency = (datetime.now() - start_time).total_seconds() logger.info(f"Prediction: {prediction}, Prob: {probability:.3f}, Latency: {latency:.3f}s") return PredictionResponse( prediction=int(prediction), probability=float(probability) ) ```

Common Production Issues

### 1. Training-Serving Skew

**Problem:** Preprocessing differs between training and serving.

**Solution:** Use the exact same preprocessing code:

```python # Save the entire pipeline from sklearn.pipeline import Pipeline

full_pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', RandomForestClassifier()) ]) full_pipeline.fit(X_train, y_train) joblib.dump(full_pipeline, 'full_pipeline.joblib') ```

### 2. Model Drift

**Problem:** Model performance degrades over time.

**Solution:** Monitor predictions and retrain:

```python # Track prediction distribution def log_prediction_stats(predictions): logger.info(f"Mean: {np.mean(predictions):.3f}") logger.info(f"Std: {np.std(predictions):.3f}") logger.info(f"Class distribution: {np.bincount(predictions)}") ```

### 3. Latency Issues

**Solution:** Optimize or use lighter models:

```python # Measure latency import time

def benchmark_model(model, X_test, n_runs=100): times = [] for _ in range(n_runs): start = time.time() model.predict(X_test[:1]) times.append(time.time() - start) print(f"Mean latency: {np.mean(times)*1000:.2f}ms") print(f"P99 latency: {np.percentile(times, 99)*1000:.2f}ms") ```

Production Checklist

- [ ] Model and preprocessing saved together - [ ] API endpoint tested - [ ] Input validation - [ ] Error handling - [ ] Health check endpoint - [ ] Logging for monitoring - [ ] Docker containerization - [ ] Load testing done - [ ] Rollback plan ready

Key Takeaway

Production ML is about consistency and reliability. Save models with their preprocessing, create clean APIs, containerize for portability, and monitor everything. The best model is worthless if it can't serve predictions reliably. Start simple, add complexity only when needed!

#Machine Learning#MLOps#Deployment#Production#Intermediate