ML10 min read

ML in Production: From Notebook to Deployment

Learn the essentials of deploying ML models to production, from saving models to serving predictions.

Sarah Chen
December 19, 2025
0.0k0

ML in Production: From Notebook to Deployment

A model in a notebook is useless. Let's get it serving real predictions.

The Production Pipeline

Training Pipeline:
Data → Preprocess → Train → Evaluate → Save Model

Inference Pipeline:
Request → Load Model → Preprocess → Predict → Response

Step 1: Save Your Model

import joblib
import pickle

# Method 1: joblib (recommended for sklearn)
joblib.dump(model, 'model.joblib')
model = joblib.load('model.joblib')

# Method 2: pickle
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

# Save preprocessing too!
joblib.dump(scaler, 'scaler.joblib')
joblib.dump(encoder, 'encoder.joblib')

Step 2: Create Inference Pipeline

Keep preprocessing consistent:

class ModelPipeline:
    def __init__(self, model_path, scaler_path):
        self.model = joblib.load(model_path)
        self.scaler = joblib.load(scaler_path)
    
    def preprocess(self, data):
        # Same preprocessing as training
        return self.scaler.transform(data)
    
    def predict(self, data):
        processed = self.preprocess(data)
        return self.model.predict(processed)
    
    def predict_proba(self, data):
        processed = self.preprocess(data)
        return self.model.predict_proba(processed)

Step 3: Create API Endpoint

Using FastAPI (recommended):

from fastapi import FastAPI
from pydantic import BaseModel
import numpy as np

app = FastAPI()

# Load model at startup
pipeline = ModelPipeline('model.joblib', 'scaler.joblib')

class PredictionRequest(BaseModel):
    features: list[float]

class PredictionResponse(BaseModel):
    prediction: int
    probability: float

@app.post("/predict", response_model=PredictionResponse)
def predict(request: PredictionRequest):
    features = np.array(request.features).reshape(1, -1)
    prediction = pipeline.predict(features)[0]
    probability = pipeline.predict_proba(features)[0].max()
    
    return PredictionResponse(
        prediction=int(prediction),
        probability=float(probability)
    )

# Health check
@app.get("/health")
def health():
    return {"status": "healthy"}

Step 4: Containerize with Docker

FROM python:3.10-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY model.joblib .
COPY scaler.joblib .
COPY app.py .

EXPOSE 8000

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
# Build and run
docker build -t ml-model .
docker run -p 8000:8000 ml-model

Step 5: Monitor Your Model

import logging
from datetime import datetime

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@app.post("/predict")
def predict(request: PredictionRequest):
    start_time = datetime.now()
    
    features = np.array(request.features).reshape(1, -1)
    prediction = pipeline.predict(features)[0]
    probability = pipeline.predict_proba(features)[0].max()
    
    # Log for monitoring
    latency = (datetime.now() - start_time).total_seconds()
    logger.info(f"Prediction: {prediction}, Prob: {probability:.3f}, Latency: {latency:.3f}s")
    
    return PredictionResponse(
        prediction=int(prediction),
        probability=float(probability)
    )

Common Production Issues

1. Training-Serving Skew

Problem: Preprocessing differs between training and serving.

Solution: Use the exact same preprocessing code:

# Save the entire pipeline
from sklearn.pipeline import Pipeline

full_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier())
])
full_pipeline.fit(X_train, y_train)
joblib.dump(full_pipeline, 'full_pipeline.joblib')

2. Model Drift

Problem: Model performance degrades over time.

Solution: Monitor predictions and retrain:

# Track prediction distribution
def log_prediction_stats(predictions):
    logger.info(f"Mean: {np.mean(predictions):.3f}")
    logger.info(f"Std: {np.std(predictions):.3f}")
    logger.info(f"Class distribution: {np.bincount(predictions)}")

3. Latency Issues

Solution: Optimize or use lighter models:

# Measure latency
import time

def benchmark_model(model, X_test, n_runs=100):
    times = []
    for _ in range(n_runs):
        start = time.time()
        model.predict(X_test[:1])
        times.append(time.time() - start)
    
    print(f"Mean latency: {np.mean(times)*1000:.2f}ms")
    print(f"P99 latency: {np.percentile(times, 99)*1000:.2f}ms")

Production Checklist

  • Model and preprocessing saved together
  • API endpoint tested
  • Input validation
  • Error handling
  • Health check endpoint
  • Logging for monitoring
  • Docker containerization
  • Load testing done
  • Rollback plan ready

Key Takeaway

Production ML is about consistency and reliability. Save models with their preprocessing, create clean APIs, containerize for portability, and monitor everything. The best model is worthless if it can't serve predictions reliably. Start simple, add complexity only when needed!

#Machine Learning#MLOps#Deployment#Production#Intermediate