A/B Testing for Machine Learning Models

Your new model has better offline metrics. But does it actually perform better with real users? A/B testing tells you for sure.

Why A/B Test ML Models?

Offline metrics can be misleading:

Test data might not represent production
User behavior might change
Side effects you didn't measure

A/B testing measures what really matters: business outcomes.

Basic A/B Test Setup

Users
  │
  ├── 50% → Model A (Control/Current)
  │           │
  │           └── Measure: clicks, conversions, revenue
  │
  └── 50% → Model B (Treatment/New)
              │
              └── Measure: clicks, conversions, revenue

Compare results → Statistical significance → Deploy winner

Key Metrics to Track

Primary metrics (what you're trying to improve):

Click-through rate
Conversion rate
Revenue
Engagement time

Guardrail metrics (don't want to hurt):

Page load time
Error rate
User complaints

Sample Size Calculation

Don't stop early! Calculate required sample size:

from scipy import stats
import numpy as np

def calculate_sample_size(baseline_rate, minimum_effect, alpha=0.05, power=0.8):
    """
    baseline_rate: current conversion rate (e.g., 0.05 for 5%)
    minimum_effect: smallest improvement worth detecting (relative, e.g., 0.1 for 10%)
    """
    effect_size = baseline_rate * minimum_effect
    
    # Using normal approximation
    z_alpha = stats.norm.ppf(1 - alpha/2)
    z_beta = stats.norm.ppf(power)
    
    p1 = baseline_rate
    p2 = baseline_rate + effect_size
    p_avg = (p1 + p2) / 2
    
    n = (2 * p_avg * (1 - p_avg) * (z_alpha + z_beta)**2) / effect_size**2
    
    return int(np.ceil(n))

# Example: 5% baseline, want to detect 10% relative improvement
sample_size = calculate_sample_size(0.05, 0.1)
print(f"Need {sample_size} users per group")

Running the Test

import hashlib

def assign_variant(user_id, experiment_name, variants=['A', 'B']):
    """Consistently assign user to variant"""
    hash_input = f"{user_id}_{experiment_name}"
    hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
    variant_index = hash_value % len(variants)
    return variants[variant_index]

# Usage
variant = assign_variant(user_id='user123', experiment_name='model_v2_test')
if variant == 'A':
    prediction = model_a.predict(features)
else:
    prediction = model_b.predict(features)

Analyzing Results

from scipy import stats

def analyze_ab_test(control_conversions, control_total,
                    treatment_conversions, treatment_total):
    # Conversion rates
    control_rate = control_conversions / control_total
    treatment_rate = treatment_conversions / treatment_total
    
    # Lift
    lift = (treatment_rate - control_rate) / control_rate
    
    # Statistical test
    contingency_table = [
        [control_conversions, control_total - control_conversions],
        [treatment_conversions, treatment_total - treatment_conversions]
    ]
    chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
    
    print(f"Control rate: {control_rate:.4f}")
    print(f"Treatment rate: {treatment_rate:.4f}")
    print(f"Lift: {lift:.2%}")
    print(f"P-value: {p_value:.4f}")
    print(f"Significant: {p_value < 0.05}")
    
    return {
        'control_rate': control_rate,
        'treatment_rate': treatment_rate,
        'lift': lift,
        'p_value': p_value,
        'significant': p_value < 0.05
    }

# Example
results = analyze_ab_test(
    control_conversions=500, control_total=10000,
    treatment_conversions=550, treatment_total=10000
)

Common Mistakes

1. Peeking and Stopping Early

Day 1: Treatment winning! Ship it? NO.
Day 3: Control winning! Stop? NO.
Day 7: Reached sample size. Now analyze.

2. Wrong Randomization Unit

Testing recommendation model? Randomize by user, not by pageview
Same user should always see same variant

3. Not Accounting for Novelty Effect

Users might engage more with anything new
Run test long enough for novelty to wear off

4. Multiple Testing Problem

Testing 10 metrics? Some will be "significant" by chance
Use Bonferroni correction or designate primary metric

Decision Framework

Is treatment significantly better?
├── Yes, and guardrails OK → Deploy treatment
├── Yes, but guardrails hurt → Investigate, maybe don't deploy
├── No difference → Keep control (simpler)
└── Significantly worse → Definitely keep control

Key Takeaway

A/B testing is the gold standard for ML model evaluation. Calculate required sample size upfront, don't peek at results, and always track guardrail metrics. Offline improvements don't always translate to online gains - let real users be the judge. Only deploy when you have statistically significant improvement in metrics that matter.

A/B Testing for Machine Learning Models

A/B Testing for Machine Learning Models

Why A/B Test ML Models?

Basic A/B Test Setup

Key Metrics to Track

Sample Size Calculation

Running the Test

Analyzing Results

Common Mistakes

1. Peeking and Stopping Early

2. Wrong Randomization Unit

3. Not Accounting for Novelty Effect

4. Multiple Testing Problem

Decision Framework

Key Takeaway

More on ML

What is Machine Learning? A Simple Introduction

Supervised vs Unsupervised Learning Explained

Understanding Training, Validation, and Test Sets