AI6 min read

A/B Testing for AI

Test AI models in production scientifically.

Robert Anderson
December 18, 2025
0.0k0

Test models properly.

What is A/B Testing?

Compare two versions to see which performs better.

Version A: Current model (baseline)
Version B: New model (challenger)

Why A/B Test AI?

Offline accuracy ≠ Real-world performance

Reasons:

  • User behavior differs
  • Data drift
  • Business metrics matter more than accuracy

Basic Setup

import random

def get_model_for_user(user_id):
    # Randomly assign 50% to each model
    if random.random() < 0.5:
        return 'model_a'  # Current model
    else:
        return 'model_b'  # New model

# Track results
results = {
    'model_a': {'shown': 0, 'clicked': 0},
    'model_b': {'shown': 0, 'clicked': 0}
}

Real Example - Recommendation System

from flask import Flask, request

app = Flask(__name__)

@app.route('/recommend')
def recommend():
    user_id = request.args.get('user_id')
    
    # Assign model
    model_version = assign_model(user_id)
    
    if model_version == 'A':
        recommendations = model_a.predict(user_id)
    else:
        recommendations = model_b.predict(user_id)
    
    # Log for analysis
    log_experiment(user_id, model_version, recommendations)
    
    return {'recommendations': recommendations}

Track Metrics

import pandas as pd

# Collect results
results_df = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5, 6],
    'model': ['A', 'B', 'A', 'B', 'A', 'B'],
    'clicked': [1, 1, 0, 1, 1, 1]
})

# Calculate metrics
by_model = results_df.groupby('model')['clicked'].agg(['sum', 'count', 'mean'])

print(by_model)
# model  sum  count  mean
# A      2    3      0.67
# B      3    3      1.00

Statistical Significance

from scipy import stats

# Get data for both groups
a_clicks = results_df[results_df['model'] == 'A']['clicked']
b_clicks = results_df[results_df['model'] == 'B']['clicked']

# T-test
t_stat, p_value = stats.ttest_ind(a_clicks, b_clicks)

print(f"P-value: {p_value}")

if p_value < 0.05:
    print("✅ Result is statistically significant!")
else:
    print("❌ Not enough evidence, need more data")

Sample Size Calculation

How many users needed?

from statsmodels.stats.power import zt_ind_solve_power

# Calculate required sample size
n = zt_ind_solve_power(
    effect_size=0.2,  # Expected improvement
    alpha=0.05,        # Significance level
    power=0.8          # Statistical power
)

print(f"Need {int(n)} users per group")

Multi-Armed Bandit

Smart A/B test that learns:

import numpy as np

class EpsilonGreedy:
    def __init__(self, epsilon=0.1):
        self.epsilon = epsilon
        self.counts = {'A': 0, 'B': 0}
        self.values = {'A': 0, 'B': 0}
    
    def select(self):
        # Sometimes explore (random)
        if random.random() < self.epsilon:
            return random.choice(['A', 'B'])
        
        # Usually exploit (best so far)
        return 'A' if self.values['A'] > self.values['B'] else 'B'
    
    def update(self, model, reward):
        self.counts[model] += 1
        n = self.counts[model]
        
        # Update average reward
        self.values[model] = ((n - 1) / n) * self.values[model] + reward / n

# Use it
bandit = EpsilonGreedy()

for user in users:
    model = bandit.select()
    reward = get_reward(user, model)
    bandit.update(model, reward)

Monitoring Dashboard

import matplotlib.pyplot as plt

def plot_results(results_df):
    # Click-through rate over time
    results_df['date'] = pd.to_datetime(results_df['timestamp'])
    
    for model in ['A', 'B']:
        data = results_df[results_df['model'] == model]
        ctr = data.groupby('date')['clicked'].mean()
        plt.plot(ctr, label=f'Model {model}')
    
    plt.legend()
    plt.title('Click-Through Rate Over Time')
    plt.xlabel('Date')
    plt.ylabel('CTR')
    plt.show()

Best Practices

  1. Run for at least 1-2 weeks
  2. Need 1000+ users minimum
  3. Check multiple metrics
  4. Watch for novelty effect
  5. Document everything

Remember

  • Offline metrics ≠ online performance
  • Need statistical significance
  • Monitor continuously
  • Consider multi-armed bandits
#AI#Intermediate#Testing