ML10 min read
ML in Production: From Notebook to Deployment
Learn the essentials of deploying ML models to production, from saving models to serving predictions.
Sarah Chen
December 19, 2025
0.0k0
ML in Production: From Notebook to Deployment
A model in a notebook is useless. Let's get it serving real predictions.
The Production Pipeline
Training Pipeline:
Data → Preprocess → Train → Evaluate → Save Model
Inference Pipeline:
Request → Load Model → Preprocess → Predict → Response
Step 1: Save Your Model
import joblib
import pickle
# Method 1: joblib (recommended for sklearn)
joblib.dump(model, 'model.joblib')
model = joblib.load('model.joblib')
# Method 2: pickle
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
# Save preprocessing too!
joblib.dump(scaler, 'scaler.joblib')
joblib.dump(encoder, 'encoder.joblib')
Step 2: Create Inference Pipeline
Keep preprocessing consistent:
class ModelPipeline:
def __init__(self, model_path, scaler_path):
self.model = joblib.load(model_path)
self.scaler = joblib.load(scaler_path)
def preprocess(self, data):
# Same preprocessing as training
return self.scaler.transform(data)
def predict(self, data):
processed = self.preprocess(data)
return self.model.predict(processed)
def predict_proba(self, data):
processed = self.preprocess(data)
return self.model.predict_proba(processed)
Step 3: Create API Endpoint
Using FastAPI (recommended):
from fastapi import FastAPI
from pydantic import BaseModel
import numpy as np
app = FastAPI()
# Load model at startup
pipeline = ModelPipeline('model.joblib', 'scaler.joblib')
class PredictionRequest(BaseModel):
features: list[float]
class PredictionResponse(BaseModel):
prediction: int
probability: float
@app.post("/predict", response_model=PredictionResponse)
def predict(request: PredictionRequest):
features = np.array(request.features).reshape(1, -1)
prediction = pipeline.predict(features)[0]
probability = pipeline.predict_proba(features)[0].max()
return PredictionResponse(
prediction=int(prediction),
probability=float(probability)
)
# Health check
@app.get("/health")
def health():
return {"status": "healthy"}
Step 4: Containerize with Docker
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY model.joblib .
COPY scaler.joblib .
COPY app.py .
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
# Build and run
docker build -t ml-model .
docker run -p 8000:8000 ml-model
Step 5: Monitor Your Model
import logging
from datetime import datetime
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@app.post("/predict")
def predict(request: PredictionRequest):
start_time = datetime.now()
features = np.array(request.features).reshape(1, -1)
prediction = pipeline.predict(features)[0]
probability = pipeline.predict_proba(features)[0].max()
# Log for monitoring
latency = (datetime.now() - start_time).total_seconds()
logger.info(f"Prediction: {prediction}, Prob: {probability:.3f}, Latency: {latency:.3f}s")
return PredictionResponse(
prediction=int(prediction),
probability=float(probability)
)
Common Production Issues
1. Training-Serving Skew
Problem: Preprocessing differs between training and serving.
Solution: Use the exact same preprocessing code:
# Save the entire pipeline
from sklearn.pipeline import Pipeline
full_pipeline = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestClassifier())
])
full_pipeline.fit(X_train, y_train)
joblib.dump(full_pipeline, 'full_pipeline.joblib')
2. Model Drift
Problem: Model performance degrades over time.
Solution: Monitor predictions and retrain:
# Track prediction distribution
def log_prediction_stats(predictions):
logger.info(f"Mean: {np.mean(predictions):.3f}")
logger.info(f"Std: {np.std(predictions):.3f}")
logger.info(f"Class distribution: {np.bincount(predictions)}")
3. Latency Issues
Solution: Optimize or use lighter models:
# Measure latency
import time
def benchmark_model(model, X_test, n_runs=100):
times = []
for _ in range(n_runs):
start = time.time()
model.predict(X_test[:1])
times.append(time.time() - start)
print(f"Mean latency: {np.mean(times)*1000:.2f}ms")
print(f"P99 latency: {np.percentile(times, 99)*1000:.2f}ms")
Production Checklist
- Model and preprocessing saved together
- API endpoint tested
- Input validation
- Error handling
- Health check endpoint
- Logging for monitoring
- Docker containerization
- Load testing done
- Rollback plan ready
Key Takeaway
Production ML is about consistency and reliability. Save models with their preprocessing, create clean APIs, containerize for portability, and monitor everything. The best model is worthless if it can't serve predictions reliably. Start simple, add complexity only when needed!
#Machine Learning#MLOps#Deployment#Production#Intermediate