Every A/B testing tutorial assumes you have Netflix-scale traffic. "Just run the test until you reach statistical significance with p < 0.05." Great. With 1,500 DAU and a 5% baseline conversion rate, reaching significance for a 10% relative improvement requires 30,000 users per variant. That's 40 days. By then, three other things have changed and your test is contaminated anyway.
We needed a different approach. We needed to make decisions with 200-500 observations, not 30,000. Bayesian methods let us do this — not by lowering our standards, but by asking a fundamentally different question.
Frequentist vs Bayesian: the actual difference
Frequentist A/B testing asks: "If there were no real difference between A and B, how likely is it that we'd see data this extreme?" That's the p-value. It's a statement about hypothetical repeated experiments, not about the probability that B is better than A.
Bayesian A/B testing asks: "Given the data we've observed, what is the probability that B is better than A?" This is the question you actually care about. And it can be answered with far fewer observations because it doesn't require the same convergence guarantees.
The Beta-Binomial model
For conversion rate tests (the most common type in consumer products), the math is surprisingly simple. Each variant's conversion rate is modeled as a Beta distribution. You start with a prior (we use Beta(1,1) — uniform, meaning "we know nothing"), observe successes and failures, and the posterior updates automatically.
import numpy as np
from scipy import stats
# Observed data
a_successes, a_trials = 47, 312 # variant A: 15.1%
b_successes, b_trials = 68, 308 # variant B: 22.1%
# Posterior distributions (Beta-Binomial)
a_posterior = stats.beta(1 + a_successes, 1 + a_trials - a_successes)
b_posterior = stats.beta(1 + b_successes, 1 + b_trials - b_successes)
# P(B > A) via Monte Carlo
samples = 100_000
a_samples = a_posterior.rvs(samples)
b_samples = b_posterior.rvs(samples)
p_b_wins = (b_samples > a_samples).mean()
# Result: P(B > A) = 0.987
# Decision: ship B (threshold: 0.95)
With just ~300 observations per variant, we get P(B > A) = 98.7%. That's not "we're 98.7% confident" in the frequentist sense (which means something different and confusing). It literally means: given the data, there's a 98.7% probability that B converts better than A. That's a decision-quality answer.
The decision framework
We don't use a single threshold. We use a decision matrix based on the magnitude of the difference and the confidence level:
| P(B > A) | Expected lift | Decision |
|---|---|---|
| > 95% | > 20% | Ship immediately |
| > 95% | 5-20% | Ship, monitor for 7 days |
| 80-95% | > 20% | Continue test, likely winner |
| 80-95% | 5-20% | Continue test, need more data |
| 50-80% | any | Inconclusive — redesign test |
| < 50% | any | A wins or no difference |
The key insight: if P(B > A) is stuck between 50-80% after 500 observations, the test is almost certainly underpowered for the effect size. The delta is too small to detect at our scale. This means the delta is too small to matter. Kill the test and run a bigger swing.
Sequential testing: checking every day without inflating error
The classic frequentist problem: if you check your test daily and stop when p < 0.05, your actual false positive rate inflates to 20-30% because of multiple comparisons. Bayesian methods don't have this problem. The posterior probability is valid at any point. You can check daily, hourly, or continuously.
We check every morning. If P(B > A) > 95% and the expected lift is meaningful, we ship. If it's been 7 days and we're stuck in the 60-80% range, we kill the test. This "always valid" property of Bayesian testing is what makes it practical for small teams. You don't need to pre-commit to a sample size. You don't need to resist the urge to peek. You just look at the posterior and make a decision.
Thompson Sampling for multi-variant tests
When we test more than 2 variants (which we do for pricing and paywall experiments), we use Thompson Sampling to allocate traffic. Instead of splitting traffic equally, each new user is assigned to the variant that a random sample from the posterior suggests is best.
In practice: variant B's posterior suggests it converts at 18% ± 3%. Variant C suggests 22% ± 5%. Variant D suggests 14% ± 4%. On each new user, we sample from each posterior and assign the user to whichever sample is highest. Over time, the best-performing variant naturally gets more traffic, which means it converges faster and we lose less revenue to bad variants.
What this looks like in production
Our testing pipeline runs as a daily cron job. It pulls the latest event data, updates the posteriors for all active experiments, computes P(B > A) and expected lift for each, and posts a summary to our team channel. The summary looks like this:
paywall-v3 day 4/7 P(B>A)=0.94 lift=+18% → trending, 1 more day
trial-3day day 6/7 P(B>A)=0.98 lift=+22% → SHIP
cta-green day 5/7 P(B>A)=0.61 lift=+3% → inconclusive, kill
price-899 day 3/7 P(B>A)=0.88 lift=+31% → promising, continue
No dashboards to check. No stats to interpret. The system tells us what to ship, what to kill, and what to keep running. The entire experimentation workflow takes 5 minutes of human attention per day.
If you're running a product with under 10,000 DAU and you're not experimenting because "we don't have enough traffic for A/B tests," you're using the wrong framework. Switch to Bayesian. Run bigger swings. Check daily. Ship or kill within a week. The cost of a wrong decision at small scale is low. The cost of not deciding is high.