A/B testing for small sample sizes

Every A/B testing tutorial assumes you have Netflix-scale traffic. "Just run the test until you reach statistical significance with p < 0.05." Great. With 1,500 DAU and a 5% baseline conversion rate, reaching significance for a 10% relative improvement requires 30,000 users per variant. That's 40 days. By then, three other things have changed and your test is contaminated anyway.

We needed a different approach. We needed to make decisions with 200-500 observations, not 30,000. Bayesian methods let us do this — not by lowering our standards, but by asking a fundamentally different question.

Frequentist vs Bayesian: the actual difference

Frequentist A/B testing asks: "If there were no real difference between A and B, how likely is it that we'd see data this extreme?" That's the p-value. It's a statement about hypothetical repeated experiments, not about the probability that B is better than A.

Bayesian A/B testing asks: "Given the data we've observed, what is the probability that B is better than A?" This is the question you actually care about. And it can be answered with far fewer observations because it doesn't require the same convergence guarantees.

The Beta-Binomial model

For conversion rate tests (the most common type in consumer products), the math is surprisingly simple. Each variant's conversion rate is modeled as a Beta distribution. You start with a prior (we use Beta(1,1) — uniform, meaning "we know nothing"), observe successes and failures, and the posterior updates automatically.

      # Bayesian A/B test for conversion rates

      import numpy as np

      from scipy import stats

      # Observed data

      a_successes, a_trials = 47, 312  # variant A: 15.1%

      b_successes, b_trials = 68, 308  # variant B: 22.1%

      # Posterior distributions (Beta-Binomial)

      a_posterior = stats.beta(1 + a_successes, 1 + a_trials - a_successes)

      b_posterior = stats.beta(1 + b_successes, 1 + b_trials - b_successes)

      # P(B > A) via Monte Carlo

      samples = 100_000

      a_samples = a_posterior.rvs(samples)

      b_samples = b_posterior.rvs(samples)

      p_b_wins = (b_samples > a_samples).mean()

      # Result: P(B > A) = 0.987

      # Decision: ship B (threshold: 0.95)

With just ~300 observations per variant, we get P(B > A) = 98.7%. That's not "we're 98.7% confident" in the frequentist sense (which means something different and confusing). It literally means: given the data, there's a 98.7% probability that B converts better than A. That's a decision-quality answer.

The decision framework

We don't use a single threshold. We use a decision matrix based on the magnitude of the difference and the confidence level:

P(B > A)	Expected lift	Decision
> 95%	> 20%	Ship immediately
> 95%	5-20%	Ship, monitor for 7 days
80-95%	> 20%	Continue test, likely winner
80-95%	5-20%	Continue test, need more data
50-80%	any	Inconclusive — redesign test
< 50%	any	A wins or no difference

The key insight: if P(B > A) is stuck between 50-80% after 500 observations, the test is almost certainly underpowered for the effect size. The delta is too small to detect at our scale. This means the delta is too small to matter. Kill the test and run a bigger swing.

Sequential testing: checking every day without inflating error

The classic frequentist problem: if you check your test daily and stop when p < 0.05, your actual false positive rate inflates to 20-30% because of multiple comparisons. Bayesian methods don't have this problem. The posterior probability is valid at any point. You can check daily, hourly, or continuously.

We check every morning. If P(B > A) > 95% and the expected lift is meaningful, we ship. If it's been 7 days and we're stuck in the 60-80% range, we kill the test. This "always valid" property of Bayesian testing is what makes it practical for small teams. You don't need to pre-commit to a sample size. You don't need to resist the urge to peek. You just look at the posterior and make a decision.

Thompson Sampling for multi-variant tests

When we test more than 2 variants (which we do for pricing and paywall experiments), we use Thompson Sampling to allocate traffic. Instead of splitting traffic equally, each new user is assigned to the variant that a random sample from the posterior suggests is best.

In practice: variant B's posterior suggests it converts at 18% ± 3%. Variant C suggests 22% ± 5%. Variant D suggests 14% ± 4%. On each new user, we sample from each posterior and assign the user to whichever sample is highest. Over time, the best-performing variant naturally gets more traffic, which means it converges faster and we lose less revenue to bad variants.

What this looks like in production

Our testing pipeline runs as a daily cron job. It pulls the latest event data, updates the posteriors for all active experiments, computes P(B > A) and expected lift for each, and posts a summary to our team channel. The summary looks like this:

      ── experiment report 2024-11-15 ──

      paywall-v3  day 4/7  P(B>A)=0.94  lift=+18%  → trending, 1 more day

      trial-3day  day 6/7  P(B>A)=0.98  lift=+22%  → SHIP

      cta-green   day 5/7  P(B>A)=0.61  lift=+3%   → inconclusive, kill

      price-899   day 3/7  P(B>A)=0.88  lift=+31%  → promising, continue

No dashboards to check. No stats to interpret. The system tells us what to ship, what to kill, and what to keep running. The entire experimentation workflow takes 5 minutes of human attention per day.

The goal isn't statistical rigor for its own sake. The goal is making more correct decisions per unit time. Bayesian methods let small teams test as aggressively as big ones — they just have to test bigger ideas.

If you're running a product with under 10,000 DAU and you're not experimenting because "we don't have enough traffic for A/B tests," you're using the wrong framework. Switch to Bayesian. Run bigger swings. Check daily. Ship or kill within a week. The cost of a wrong decision at small scale is low. The cost of not deciding is high.

A/B testing for small sample sizes: Bayesian methods in practice