AI6 min read
A/B Testing for AI
Test AI models in production scientifically.
Robert Anderson
December 18, 2025
0.0k0
Test models properly.
What is A/B Testing?
Compare two versions to see which performs better.
Version A: Current model (baseline)
Version B: New model (challenger)
Why A/B Test AI?
Offline accuracy ≠ Real-world performance
Reasons:
- User behavior differs
- Data drift
- Business metrics matter more than accuracy
Basic Setup
import random
def get_model_for_user(user_id):
# Randomly assign 50% to each model
if random.random() < 0.5:
return 'model_a' # Current model
else:
return 'model_b' # New model
# Track results
results = {
'model_a': {'shown': 0, 'clicked': 0},
'model_b': {'shown': 0, 'clicked': 0}
}
Real Example - Recommendation System
from flask import Flask, request
app = Flask(__name__)
@app.route('/recommend')
def recommend():
user_id = request.args.get('user_id')
# Assign model
model_version = assign_model(user_id)
if model_version == 'A':
recommendations = model_a.predict(user_id)
else:
recommendations = model_b.predict(user_id)
# Log for analysis
log_experiment(user_id, model_version, recommendations)
return {'recommendations': recommendations}
Track Metrics
import pandas as pd
# Collect results
results_df = pd.DataFrame({
'user_id': [1, 2, 3, 4, 5, 6],
'model': ['A', 'B', 'A', 'B', 'A', 'B'],
'clicked': [1, 1, 0, 1, 1, 1]
})
# Calculate metrics
by_model = results_df.groupby('model')['clicked'].agg(['sum', 'count', 'mean'])
print(by_model)
# model sum count mean
# A 2 3 0.67
# B 3 3 1.00
Statistical Significance
from scipy import stats
# Get data for both groups
a_clicks = results_df[results_df['model'] == 'A']['clicked']
b_clicks = results_df[results_df['model'] == 'B']['clicked']
# T-test
t_stat, p_value = stats.ttest_ind(a_clicks, b_clicks)
print(f"P-value: {p_value}")
if p_value < 0.05:
print("✅ Result is statistically significant!")
else:
print("❌ Not enough evidence, need more data")
Sample Size Calculation
How many users needed?
from statsmodels.stats.power import zt_ind_solve_power
# Calculate required sample size
n = zt_ind_solve_power(
effect_size=0.2, # Expected improvement
alpha=0.05, # Significance level
power=0.8 # Statistical power
)
print(f"Need {int(n)} users per group")
Multi-Armed Bandit
Smart A/B test that learns:
import numpy as np
class EpsilonGreedy:
def __init__(self, epsilon=0.1):
self.epsilon = epsilon
self.counts = {'A': 0, 'B': 0}
self.values = {'A': 0, 'B': 0}
def select(self):
# Sometimes explore (random)
if random.random() < self.epsilon:
return random.choice(['A', 'B'])
# Usually exploit (best so far)
return 'A' if self.values['A'] > self.values['B'] else 'B'
def update(self, model, reward):
self.counts[model] += 1
n = self.counts[model]
# Update average reward
self.values[model] = ((n - 1) / n) * self.values[model] + reward / n
# Use it
bandit = EpsilonGreedy()
for user in users:
model = bandit.select()
reward = get_reward(user, model)
bandit.update(model, reward)
Monitoring Dashboard
import matplotlib.pyplot as plt
def plot_results(results_df):
# Click-through rate over time
results_df['date'] = pd.to_datetime(results_df['timestamp'])
for model in ['A', 'B']:
data = results_df[results_df['model'] == model]
ctr = data.groupby('date')['clicked'].mean()
plt.plot(ctr, label=f'Model {model}')
plt.legend()
plt.title('Click-Through Rate Over Time')
plt.xlabel('Date')
plt.ylabel('CTR')
plt.show()
Best Practices
- Run for at least 1-2 weeks
- Need 1000+ users minimum
- Check multiple metrics
- Watch for novelty effect
- Document everything
Remember
- Offline metrics ≠ online performance
- Need statistical significance
- Monitor continuously
- Consider multi-armed bandits
#AI#Intermediate#Testing