A/B Testing: Statistical Significance Explained
Statistical significance tells you whether an A/B test result is likely real or just noise. Here is what p-values, confidence intervals, and sample sizes really mean — without the textbook headache.
- 1. What "statistical significance" actually means
- 2. The 5% threshold (and why some teams use 1%)
- 3. Sample size: the part everyone underestimates
- 4. The peeking problem (and why it inflates false positives)
- 5. Confidence intervals beat p-values
- 6. Common mistakes that produce false wins
- 7. Bayesian A/B testing: the modern alternative
- 8. When A/B testing doesn't fit
- Next steps
Statistical significance answers a single question: is the difference I see between my A and B variants likely to be real, or is it just random noise? The conventional bar is a p-value below 0.05, which informally means "less than a 5% chance the observed difference is due to chance alone." That bar is widely abused, frequently misunderstood, and yet still essential. This article explains p-values, confidence intervals, sample sizes, and the most common A/B testing mistakes — then shows how to use the A/B Test Significance Calculator to check your tests before shipping.
The cardinal rule: Decide your sample size before you start the test. Peeking and stopping early is the single most common way to declare false wins.
1. What "statistical significance" actually means
Statistical significance is the probability that the difference you observed between two variants is not due to random chance.
The technical statement is more constrained: the p-value is the probability of seeing a difference at least as large as the one observed, assuming there is actually no difference (the null hypothesis is true). A p-value of 0.03 doesn't mean "97% chance the new variant is better." It means "if there were no real effect, we'd see a difference this large only 3% of the time."
The practical interpretation for product teams: if your p-value is below 0.05, you can reasonably act on the result. Below 0.01, you can act confidently. Between 0.05 and 0.10, you have a signal worth keeping but not yet shipping.
2. The 5% threshold (and why some teams use 1%)
The 95% confidence (p < 0.05) bar comes from Ronald Fisher in the 1920s. It is a convention, not a law of nature.
| Confidence level | p-value | When to use |
|---|---|---|
| 90% | < 0.10 | Low-stakes UX changes, button colors |
| 95% | < 0.05 | Default for most product A/B tests |
| 99% | < 0.01 | Pricing changes, checkout flow, anything revenue-critical |
| 99.9% | < 0.001 | Algorithmic changes that affect everyone permanently |
Higher confidence requires larger samples, which means longer tests. Pick the level appropriate to the cost of being wrong.
3. Sample size: the part everyone underestimates
The single biggest source of A/B testing failure is underpowered tests — running with too few users to detect the effect you're hoping to see.
Required sample size depends on:
- Baseline conversion rate — lower baseline needs more sample
- Minimum detectable effect (MDE) — the smallest lift you care about
- Confidence level (1 − α, usually 95%)
- Statistical power (1 − β, usually 80%)
Rough sample sizes per variant at baseline 5%, 95% confidence, 80% power:
| Minimum detectable lift | Sample needed per variant |
|---|---|
| 5% relative lift | ~62,000 |
| 10% relative lift | ~15,500 |
| 20% relative lift | ~3,900 |
| 50% relative lift | ~620 |
| 100% relative lift | ~150 |
If you are getting 1,000 visitors per variant per week and your baseline conversion is 5%, you can only reliably detect a ~30% relative lift in a 4-week test. Trying to declare a 10% lift "significant" with that sample is statistical fiction.
4. The peeking problem (and why it inflates false positives)
If you check your test every day and stop the first time it crosses significance, you will get false positives constantly. The math: with a true effect of zero and daily peeking over 30 days, you have a roughly 35% chance of crossing the 0.05 line at least once just by random walk.
The fixes:
- Pre-register the sample size and don't look until you hit it
- Use sequential testing methods designed for continuous monitoring (Bayesian methods, AGILE, Group Sequential Testing)
- Use a stricter threshold if you must peek (e.g., require p < 0.005 if you check daily)
Most teams don't use sophisticated methods. The cleanest practice is: compute sample size up front, run to that sample, then check once.
5. Confidence intervals beat p-values
A confidence interval gives you a range for the likely true effect, not just a yes/no on significance. Two examples:
- "Variant B converts 5.2% (95% CI: 4.8%–5.6%) vs Variant A 5.0% (95% CI: 4.6%–5.4%)" — overlap → no clear winner
- "Variant B converts 6.5% (95% CI: 6.1%–6.9%) vs Variant A 5.0% (95% CI: 4.6%–5.4%)" — no overlap → clear winner
Reporting confidence intervals forces you to internalize the uncertainty. P-values alone hide it.
6. Common mistakes that produce false wins
1. Running multiple tests simultaneously without correction
If you run 20 A/B tests at 95% confidence simultaneously, on average 1 of them will be a false positive even if every variant is actually identical to control. Apply a Bonferroni correction (divide your threshold by the number of tests) or use a more sophisticated FDR control.
2. Slicing by segment after the test
"It wasn't significant overall but it was significant for mobile users on Tuesday" is data snooping. If you didn't pre-register the segment, you cannot claim significance for it.
3. Optimizing for the wrong metric
A test that lifts signup rate by 20% but tanks paid conversion is a loss, not a win. Always test downstream metrics where you can.
4. Ignoring the novelty effect
Users react to anything new for the first few weeks. Run tests long enough to capture steady-state behavior — typically at least 2 full weeks plus your sample-size requirement, whichever is longer.
5. Statistically significant but practically meaningless
With enough sample size, a 0.1% lift can be "statistically significant." Whether shipping that lift is worth the engineering and risk is a separate question. Always ask: does this effect size justify shipping?
7. Bayesian A/B testing: the modern alternative
Bayesian methods produce probabilities you can actually act on: "There is a 92% probability that Variant B beats Variant A by at least 5%." That is much easier to communicate than a p-value.
Bayesian methods also handle peeking gracefully — you can monitor continuously without inflating false positive rates. The tradeoff is they require a prior, and the math is less standardized across tools.
If you have access to a Bayesian-friendly platform (Statsig, Optimizely Stats Engine, GrowthBook), prefer it for product testing. If you're computing manually or with a basic calculator, stick with frequentist methods and pre-registered sample sizes.
8. When A/B testing doesn't fit
A/B testing assumes you have enough traffic to fill statistically meaningful sample sizes in a reasonable time. For small SaaS with under 1,000 weekly users on the page being tested, traditional A/B testing is not the right tool. Alternatives:
- Qualitative research — 5 user interviews often beat a misleading low-power test
- Multi-armed bandit — better for low-traffic exploration
- Sequential decision-making — ship the change, monitor for harm, roll back if needed
- Holdouts — release to 100% of users, hold out 10% as control over a longer period
Next steps
Three practical steps:
- Before your next test: compute the required sample size in the A/B Test Significance Calculator. If you can't reach it in 4 weeks, redesign the test.
- After your next test: plug in the conversion counts and verify significance with the same calculator. Pair the p-value with the confidence interval.
- If your test affects acquisition channels, re-run channel CAC after a few weeks with the CAC Calculator. A "winning" test that quietly inflates CAC is not a win.
Then build the habit: every product change worth shipping is worth either testing properly or being honest that you're shipping on intuition.
Calculators referenced in this guide
Keep reading
Cap Table 101 for Founders
Your cap table is the single source of truth for ownership. Learn how dilution works, common term sheet pitfalls, and how to model future rounds.
SaaS Valuation Methods Explained
How investors value SaaS companies in 2026 — revenue multiples, the Rule of 40, DCF, and what the numbers actually mean for founders.
Conversion Rate Optimization Math
CRO is statistics, not vibes. Learn the math behind sample sizes, statistical significance, and how to read A/B test results without fooling yourself.
Business & SaaS Disclaimer
This article is for educational purposes. Actual business performance varies based on many factors. SaaSCalcHub is not business or financial advice. Consult business advisors, CPAs, and consultants for your specific situation.
Last updated: Jun 3, 2026