ML8 min read

A/B Testing for Machine Learning Models

Learn how to properly A/B test ML models before deploying to production.

Sarah Chen
December 19, 2025
0.0k0

A/B Testing for Machine Learning Models

Your new model has better offline metrics. But does it actually perform better with real users? A/B testing tells you for sure.

Why A/B Test ML Models?

Offline metrics can be misleading: - Test data might not represent production - User behavior might change - Side effects you didn't measure

A/B testing measures what really matters: business outcomes.

Basic A/B Test Setup

``` Users │ ├── 50% → Model A (Control/Current) │ │ │ └── Measure: clicks, conversions, revenue │ └── 50% → Model B (Treatment/New) │ └── Measure: clicks, conversions, revenue

Compare results → Statistical significance → Deploy winner ```

Key Metrics to Track

**Primary metrics** (what you're trying to improve): - Click-through rate - Conversion rate - Revenue - Engagement time

**Guardrail metrics** (don't want to hurt): - Page load time - Error rate - User complaints

Sample Size Calculation

Don't stop early! Calculate required sample size:

```python from scipy import stats import numpy as np

def calculate_sample_size(baseline_rate, minimum_effect, alpha=0.05, power=0.8): """ baseline_rate: current conversion rate (e.g., 0.05 for 5%) minimum_effect: smallest improvement worth detecting (relative, e.g., 0.1 for 10%) """ effect_size = baseline_rate * minimum_effect # Using normal approximation z_alpha = stats.norm.ppf(1 - alpha/2) z_beta = stats.norm.ppf(power) p1 = baseline_rate p2 = baseline_rate + effect_size p_avg = (p1 + p2) / 2 n = (2 * p_avg * (1 - p_avg) * (z_alpha + z_beta)**2) / effect_size**2 return int(np.ceil(n))

Example: 5% baseline, want to detect 10% relative improvement sample_size = calculate_sample_size(0.05, 0.1) print(f"Need {sample_size} users per group") ```

Running the Test

```python import hashlib

def assign_variant(user_id, experiment_name, variants=['A', 'B']): """Consistently assign user to variant""" hash_input = f"{user_id}_{experiment_name}" hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16) variant_index = hash_value % len(variants) return variants[variant_index]

Usage variant = assign_variant(user_id='user123', experiment_name='model_v2_test') if variant == 'A': prediction = model_a.predict(features) else: prediction = model_b.predict(features) ```

Analyzing Results

```python from scipy import stats

def analyze_ab_test(control_conversions, control_total, treatment_conversions, treatment_total): # Conversion rates control_rate = control_conversions / control_total treatment_rate = treatment_conversions / treatment_total # Lift lift = (treatment_rate - control_rate) / control_rate # Statistical test contingency_table = [ [control_conversions, control_total - control_conversions], [treatment_conversions, treatment_total - treatment_conversions] ] chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table) print(f"Control rate: {control_rate:.4f}") print(f"Treatment rate: {treatment_rate:.4f}") print(f"Lift: {lift:.2%}") print(f"P-value: {p_value:.4f}") print(f"Significant: {p_value < 0.05}") return { 'control_rate': control_rate, 'treatment_rate': treatment_rate, 'lift': lift, 'p_value': p_value, 'significant': p_value < 0.05 }

Example results = analyze_ab_test( control_conversions=500, control_total=10000, treatment_conversions=550, treatment_total=10000 ) ```

Common Mistakes

### 1. Peeking and Stopping Early ``` Day 1: Treatment winning! Ship it? NO. Day 3: Control winning! Stop? NO. Day 7: Reached sample size. Now analyze. ```

### 2. Wrong Randomization Unit - Testing recommendation model? Randomize by user, not by pageview - Same user should always see same variant

### 3. Not Accounting for Novelty Effect - Users might engage more with anything new - Run test long enough for novelty to wear off

### 4. Multiple Testing Problem - Testing 10 metrics? Some will be "significant" by chance - Use Bonferroni correction or designate primary metric

Decision Framework

``` Is treatment significantly better? ├── Yes, and guardrails OK → Deploy treatment ├── Yes, but guardrails hurt → Investigate, maybe don't deploy ├── No difference → Keep control (simpler) └── Significantly worse → Definitely keep control ```

Key Takeaway

A/B testing is the gold standard for ML model evaluation. Calculate required sample size upfront, don't peek at results, and always track guardrail metrics. Offline improvements don't always translate to online gains - let real users be the judge. Only deploy when you have statistically significant improvement in metrics that matter.

#Machine Learning#A/B Testing#Model Deployment#Intermediate