SaaSCalcHub. Explore Tools →
Standard

A/B Testing: Statistical Significance Explained

Statistical significance tells you whether an A/B test result is likely real or just noise. Here is what p-values, confidence intervals, and sample sizes really mean — without the textbook headache.

SaaSCalcHub Editorial Team August 27, 2025 11 min read

Statistical significance answers a single question: is the difference I see between my A and B variants likely to be real, or is it just random noise? The conventional bar is a p-value below 0.05, which informally means "less than a 5% chance the observed difference is due to chance alone." That bar is widely abused, frequently misunderstood, and yet still essential. This article explains p-values, confidence intervals, sample sizes, and the most common A/B testing mistakes — then shows how to use the A/B Test Significance Calculator to check your tests before shipping.

The cardinal rule: Decide your sample size before you start the test. Peeking and stopping early is the single most common way to declare false wins.

1. What "statistical significance" actually means

Statistical significance is the probability that the difference you observed between two variants is not due to random chance.

The technical statement is more constrained: the p-value is the probability of seeing a difference at least as large as the one observed, assuming there is actually no difference (the null hypothesis is true). A p-value of 0.03 doesn't mean "97% chance the new variant is better." It means "if there were no real effect, we'd see a difference this large only 3% of the time."

The practical interpretation for product teams: if your p-value is below 0.05, you can reasonably act on the result. Below 0.01, you can act confidently. Between 0.05 and 0.10, you have a signal worth keeping but not yet shipping.

2. The 5% threshold (and why some teams use 1%)

The 95% confidence (p < 0.05) bar comes from Ronald Fisher in the 1920s. It is a convention, not a law of nature.

Confidence level p-value When to use
90% < 0.10 Low-stakes UX changes, button colors
95% < 0.05 Default for most product A/B tests
99% < 0.01 Pricing changes, checkout flow, anything revenue-critical
99.9% < 0.001 Algorithmic changes that affect everyone permanently

Higher confidence requires larger samples, which means longer tests. Pick the level appropriate to the cost of being wrong.

3. Sample size: the part everyone underestimates

The single biggest source of A/B testing failure is underpowered tests — running with too few users to detect the effect you're hoping to see.

Required sample size depends on:

  • Baseline conversion rate — lower baseline needs more sample
  • Minimum detectable effect (MDE) — the smallest lift you care about
  • Confidence level (1 − α, usually 95%)
  • Statistical power (1 − β, usually 80%)

Rough sample sizes per variant at baseline 5%, 95% confidence, 80% power:

Minimum detectable lift Sample needed per variant
5% relative lift ~62,000
10% relative lift ~15,500
20% relative lift ~3,900
50% relative lift ~620
100% relative lift ~150

If you are getting 1,000 visitors per variant per week and your baseline conversion is 5%, you can only reliably detect a ~30% relative lift in a 4-week test. Trying to declare a 10% lift "significant" with that sample is statistical fiction.

4. The peeking problem (and why it inflates false positives)

If you check your test every day and stop the first time it crosses significance, you will get false positives constantly. The math: with a true effect of zero and daily peeking over 30 days, you have a roughly 35% chance of crossing the 0.05 line at least once just by random walk.

The fixes:

  1. Pre-register the sample size and don't look until you hit it
  2. Use sequential testing methods designed for continuous monitoring (Bayesian methods, AGILE, Group Sequential Testing)
  3. Use a stricter threshold if you must peek (e.g., require p < 0.005 if you check daily)

Most teams don't use sophisticated methods. The cleanest practice is: compute sample size up front, run to that sample, then check once.

5. Confidence intervals beat p-values

A confidence interval gives you a range for the likely true effect, not just a yes/no on significance. Two examples:

  • "Variant B converts 5.2% (95% CI: 4.8%–5.6%) vs Variant A 5.0% (95% CI: 4.6%–5.4%)" — overlap → no clear winner
  • "Variant B converts 6.5% (95% CI: 6.1%–6.9%) vs Variant A 5.0% (95% CI: 4.6%–5.4%)" — no overlap → clear winner

Reporting confidence intervals forces you to internalize the uncertainty. P-values alone hide it.

6. Common mistakes that produce false wins

1. Running multiple tests simultaneously without correction

If you run 20 A/B tests at 95% confidence simultaneously, on average 1 of them will be a false positive even if every variant is actually identical to control. Apply a Bonferroni correction (divide your threshold by the number of tests) or use a more sophisticated FDR control.

2. Slicing by segment after the test

"It wasn't significant overall but it was significant for mobile users on Tuesday" is data snooping. If you didn't pre-register the segment, you cannot claim significance for it.

3. Optimizing for the wrong metric

A test that lifts signup rate by 20% but tanks paid conversion is a loss, not a win. Always test downstream metrics where you can.

4. Ignoring the novelty effect

Users react to anything new for the first few weeks. Run tests long enough to capture steady-state behavior — typically at least 2 full weeks plus your sample-size requirement, whichever is longer.

5. Statistically significant but practically meaningless

With enough sample size, a 0.1% lift can be "statistically significant." Whether shipping that lift is worth the engineering and risk is a separate question. Always ask: does this effect size justify shipping?

7. Bayesian A/B testing: the modern alternative

Bayesian methods produce probabilities you can actually act on: "There is a 92% probability that Variant B beats Variant A by at least 5%." That is much easier to communicate than a p-value.

Bayesian methods also handle peeking gracefully — you can monitor continuously without inflating false positive rates. The tradeoff is they require a prior, and the math is less standardized across tools.

If you have access to a Bayesian-friendly platform (Statsig, Optimizely Stats Engine, GrowthBook), prefer it for product testing. If you're computing manually or with a basic calculator, stick with frequentist methods and pre-registered sample sizes.

8. When A/B testing doesn't fit

A/B testing assumes you have enough traffic to fill statistically meaningful sample sizes in a reasonable time. For small SaaS with under 1,000 weekly users on the page being tested, traditional A/B testing is not the right tool. Alternatives:

  • Qualitative research — 5 user interviews often beat a misleading low-power test
  • Multi-armed bandit — better for low-traffic exploration
  • Sequential decision-making — ship the change, monitor for harm, roll back if needed
  • Holdouts — release to 100% of users, hold out 10% as control over a longer period

Next steps

Three practical steps:

  1. Before your next test: compute the required sample size in the A/B Test Significance Calculator. If you can't reach it in 4 weeks, redesign the test.
  2. After your next test: plug in the conversion counts and verify significance with the same calculator. Pair the p-value with the confidence interval.
  3. If your test affects acquisition channels, re-run channel CAC after a few weeks with the CAC Calculator. A "winning" test that quietly inflates CAC is not a win.

Then build the habit: every product change worth shipping is worth either testing properly or being honest that you're shipping on intuition.

Business & SaaS Disclaimer

This article is for educational purposes. Actual business performance varies based on many factors. SaaSCalcHub is not business or financial advice. Consult business advisors, CPAs, and consultants for your specific situation.

Last updated: Jun 3, 2026