AI6 min read

A/B Testing for AI

Test AI models in production scientifically.

Robert Anderson
December 18, 2025
0.0k0

Test models properly.

What is A/B Testing?

Compare two versions to see which performs better.

**Version A**: Current model (baseline) **Version B**: New model (challenger)

Why A/B Test AI?

Offline accuracy ≠ Real-world performance

**Reasons**: - User behavior differs - Data drift - Business metrics matter more than accuracy

Basic Setup

```python import random

def get_model_for_user(user_id): # Randomly assign 50% to each model if random.random() < 0.5: return 'model_a' # Current model else: return 'model_b' # New model

Track results results = { 'model_a': {'shown': 0, 'clicked': 0}, 'model_b': {'shown': 0, 'clicked': 0} } ```

Real Example - Recommendation System

```python from flask import Flask, request

app = Flask(__name__)

@app.route('/recommend') def recommend(): user_id = request.args.get('user_id') # Assign model model_version = assign_model(user_id) if model_version == 'A': recommendations = model_a.predict(user_id) else: recommendations = model_b.predict(user_id) # Log for analysis log_experiment(user_id, model_version, recommendations) return {'recommendations': recommendations} ```

Track Metrics

```python import pandas as pd

Collect results results_df = pd.DataFrame({ 'user_id': [1, 2, 3, 4, 5, 6], 'model': ['A', 'B', 'A', 'B', 'A', 'B'], 'clicked': [1, 1, 0, 1, 1, 1] })

Calculate metrics by_model = results_df.groupby('model')['clicked'].agg(['sum', 'count', 'mean'])

print(by_model) # model sum count mean # A 2 3 0.67 # B 3 3 1.00 ```

Statistical Significance

```python from scipy import stats

Get data for both groups a_clicks = results_df[results_df['model'] == 'A']['clicked'] b_clicks = results_df[results_df['model'] == 'B']['clicked']

T-test t_stat, p_value = stats.ttest_ind(a_clicks, b_clicks)

print(f"P-value: {p_value}")

if p_value < 0.05: print("✅ Result is statistically significant!") else: print("❌ Not enough evidence, need more data") ```

Sample Size Calculation

How many users needed?

```python from statsmodels.stats.power import zt_ind_solve_power

Calculate required sample size n = zt_ind_solve_power( effect_size=0.2, # Expected improvement alpha=0.05, # Significance level power=0.8 # Statistical power )

print(f"Need {int(n)} users per group") ```

Multi-Armed Bandit

Smart A/B test that learns:

```python import numpy as np

class EpsilonGreedy: def __init__(self, epsilon=0.1): self.epsilon = epsilon self.counts = {'A': 0, 'B': 0} self.values = {'A': 0, 'B': 0} def select(self): # Sometimes explore (random) if random.random() < self.epsilon: return random.choice(['A', 'B']) # Usually exploit (best so far) return 'A' if self.values['A'] > self.values['B'] else 'B' def update(self, model, reward): self.counts[model] += 1 n = self.counts[model] # Update average reward self.values[model] = ((n - 1) / n) * self.values[model] + reward / n

Use it bandit = EpsilonGreedy()

for user in users: model = bandit.select() reward = get_reward(user, model) bandit.update(model, reward) ```

Monitoring Dashboard

```python import matplotlib.pyplot as plt

def plot_results(results_df): # Click-through rate over time results_df['date'] = pd.to_datetime(results_df['timestamp']) for model in ['A', 'B']: data = results_df[results_df['model'] == model] ctr = data.groupby('date')['clicked'].mean() plt.plot(ctr, label=f'Model {model}') plt.legend() plt.title('Click-Through Rate Over Time') plt.xlabel('Date') plt.ylabel('CTR') plt.show() ```

Best Practices

1. Run for at least 1-2 weeks 2. Need 1000+ users minimum 3. Check multiple metrics 4. Watch for novelty effect 5. Document everything

Remember

- Offline metrics ≠ online performance - Need statistical significance - Monitor continuously - Consider multi-armed bandits

#AI#Intermediate#Testing