A/B Testing for Machine Learning Models
Learn how to properly A/B test ML models before deploying to production.
A/B Testing for Machine Learning Models
Your new model has better offline metrics. But does it actually perform better with real users? A/B testing tells you for sure.
Why A/B Test ML Models?
Offline metrics can be misleading:
- Test data might not represent production
- User behavior might change
- Side effects you didn't measure
A/B testing measures what really matters: business outcomes.
Basic A/B Test Setup
Users
│
├── 50% → Model A (Control/Current)
│ │
│ └── Measure: clicks, conversions, revenue
│
└── 50% → Model B (Treatment/New)
│
└── Measure: clicks, conversions, revenue
Compare results → Statistical significance → Deploy winner
Key Metrics to Track
Primary metrics (what you're trying to improve):
- Click-through rate
- Conversion rate
- Revenue
- Engagement time
Guardrail metrics (don't want to hurt):
- Page load time
- Error rate
- User complaints
Sample Size Calculation
Don't stop early! Calculate required sample size:
from scipy import stats
import numpy as np
def calculate_sample_size(baseline_rate, minimum_effect, alpha=0.05, power=0.8):
"""
baseline_rate: current conversion rate (e.g., 0.05 for 5%)
minimum_effect: smallest improvement worth detecting (relative, e.g., 0.1 for 10%)
"""
effect_size = baseline_rate * minimum_effect
# Using normal approximation
z_alpha = stats.norm.ppf(1 - alpha/2)
z_beta = stats.norm.ppf(power)
p1 = baseline_rate
p2 = baseline_rate + effect_size
p_avg = (p1 + p2) / 2
n = (2 * p_avg * (1 - p_avg) * (z_alpha + z_beta)**2) / effect_size**2
return int(np.ceil(n))
# Example: 5% baseline, want to detect 10% relative improvement
sample_size = calculate_sample_size(0.05, 0.1)
print(f"Need {sample_size} users per group")
Running the Test
import hashlib
def assign_variant(user_id, experiment_name, variants=['A', 'B']):
"""Consistently assign user to variant"""
hash_input = f"{user_id}_{experiment_name}"
hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
variant_index = hash_value % len(variants)
return variants[variant_index]
# Usage
variant = assign_variant(user_id='user123', experiment_name='model_v2_test')
if variant == 'A':
prediction = model_a.predict(features)
else:
prediction = model_b.predict(features)
Analyzing Results
from scipy import stats
def analyze_ab_test(control_conversions, control_total,
treatment_conversions, treatment_total):
# Conversion rates
control_rate = control_conversions / control_total
treatment_rate = treatment_conversions / treatment_total
# Lift
lift = (treatment_rate - control_rate) / control_rate
# Statistical test
contingency_table = [
[control_conversions, control_total - control_conversions],
[treatment_conversions, treatment_total - treatment_conversions]
]
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"Control rate: {control_rate:.4f}")
print(f"Treatment rate: {treatment_rate:.4f}")
print(f"Lift: {lift:.2%}")
print(f"P-value: {p_value:.4f}")
print(f"Significant: {p_value < 0.05}")
return {
'control_rate': control_rate,
'treatment_rate': treatment_rate,
'lift': lift,
'p_value': p_value,
'significant': p_value < 0.05
}
# Example
results = analyze_ab_test(
control_conversions=500, control_total=10000,
treatment_conversions=550, treatment_total=10000
)
Common Mistakes
1. Peeking and Stopping Early
Day 1: Treatment winning! Ship it? NO.
Day 3: Control winning! Stop? NO.
Day 7: Reached sample size. Now analyze.
2. Wrong Randomization Unit
- Testing recommendation model? Randomize by user, not by pageview
- Same user should always see same variant
3. Not Accounting for Novelty Effect
- Users might engage more with anything new
- Run test long enough for novelty to wear off
4. Multiple Testing Problem
- Testing 10 metrics? Some will be "significant" by chance
- Use Bonferroni correction or designate primary metric
Decision Framework
Is treatment significantly better?
├── Yes, and guardrails OK → Deploy treatment
├── Yes, but guardrails hurt → Investigate, maybe don't deploy
├── No difference → Keep control (simpler)
└── Significantly worse → Definitely keep control
Key Takeaway
A/B testing is the gold standard for ML model evaluation. Calculate required sample size upfront, don't peek at results, and always track guardrail metrics. Offline improvements don't always translate to online gains - let real users be the judge. Only deploy when you have statistically significant improvement in metrics that matter.