How to Run a Valid A/B Test
From hypothesis to statistical significance: a step-by-step guide to running A/B tests that produce decisions you can trust.
- Step 1: Write the hypothesis before you build anything
- Step 2: Pick one primary metric
- Step 3: Calculate required sample size before you launch
- Step 4: Randomize correctly
- Step 5: Run the test for the planned duration, then stop
- Step 6: Analyze the result correctly
- Step 7: Decide what to ship and what to learn
- Common A/B testing mistakes to avoid
- What to do next
A valid A/B test compares two versions of an experience by randomly assigning users to each version, measuring a single primary metric, and continuing until the sample size is large enough to detect the effect you care about. The result is statistically significant if the probability of seeing the observed difference by chance alone (the p-value) is below your threshold, typically 0.05.
Most SaaS A/B tests fail not because the variant was bad but because the test was set up wrong. Underpowered tests miss real wins. Overpowered tests waste traffic. Peeking at results early inflates false positive rates. This guide walks through the seven steps that separate a credible experiment from a hopeful one.
Step 1: Write the hypothesis before you build anything
A valid hypothesis is specific, testable, and tied to a metric. Avoid hypotheses like "the new pricing page will be better." Use the format:
If we [change X], then [primary metric Y] will improve by [Z%] because [reason].
A real example: "If we remove the credit card requirement from the trial signup, then trial-to-paid conversion will increase by 15% because more users will reach product activation before payment friction."
The "because" clause matters. It forces you to articulate the mechanism, which makes the test interpretable regardless of which way it lands.
Step 2: Pick one primary metric
The fastest way to corrupt a test is to track many metrics and report whichever one wins. Pick one primary metric in advance. Everything else is secondary.
Good primary metrics for SaaS A/B tests:
- Activation rate: percentage of new users who complete a defined activation event within X days
- Trial-to-paid conversion: percentage of trial signups who become paying customers
- Free-to-paid conversion: percentage of free-tier users who upgrade to paid
- Click-through rate on a CTA (only valid for top-of-funnel tests; almost never the right primary metric for SaaS)
- 30-day retention: percentage of users active 30 days after signup
If you cannot pick one, your hypothesis is not specific enough. Go back to Step 1.
Step 3: Calculate required sample size before you launch
Underpowered tests cannot detect real effects. Overpowered tests waste calendar weeks. Calculate the minimum sample size given your baseline conversion rate, the minimum effect you care about, and your desired statistical power.
The standard inputs are:
- Baseline conversion rate (control): your current rate
- Minimum detectable effect (MDE): the smallest improvement that would be worth shipping
- Statistical significance level (alpha): usually 0.05 (5% false positive rate tolerated)
- Statistical power (1 - beta): usually 0.80 (80% chance of detecting a real effect)
| Baseline | MDE | Required sample per variant |
|---|---|---|
| 5% | 10% relative lift (to 5.5%) | ~31,000 |
| 5% | 20% relative lift (to 6.0%) | ~7,800 |
| 10% | 10% relative lift (to 11.0%) | ~15,000 |
| 10% | 20% relative lift (to 12.0%) | ~3,800 |
| 20% | 10% relative lift (to 22.0%) | ~6,400 |
| 20% | 20% relative lift (to 24.0%) | ~1,600 |
For low-traffic SaaS sites, the required sample sizes can take months to accumulate. That is not a test problem; that is a traffic problem. Use the A/B Test Significance Calculator to plug in your numbers.
If you cannot get the required sample within 4-6 weeks, you have three options: test a bigger change with a larger expected effect, accept higher uncertainty, or do not test and ship based on judgment.
Step 4: Randomize correctly
Randomization is what makes the test causal. Without it, you have a comparison of self-selected groups, not an experiment.
- Use a deterministic hash of the user ID (or session ID for logged-out tests) so the same user always sees the same variant.
- Split 50/50 unless you have a specific reason not to (some teams use 90/10 for risky changes).
- Randomize at the right level. Test the user experience at the user level, not the page-view level, or returning visitors will see both variants.
- Exclude internal users, bot traffic, and any segments you cannot ship to (some teams exclude paid customers from pricing-page tests).
Step 5: Run the test for the planned duration, then stop
The single most common test invalidation is "peeking." If you check the p-value every day and stop the test the first time it crosses 0.05, you will declare false positives at 2x to 5x the nominal rate. The math is unforgiving here.
Three rules to follow:
- Decide the sample size or end date before launching. Stick to it.
- If you must do interim analyses, use sequential testing methods (mSPRT, alpha-spending) that adjust for repeated looks. Most SaaS teams do not need this.
- Run for at least one full week to capture the day-of-week effect. SaaS conversion varies significantly between weekdays and weekends.
The test ends when you said it would, not when the result looks good.
Step 6: Analyze the result correctly
After the test ends, calculate the difference between variants and the p-value for that difference. The standard test for conversion rates is a two-proportion z-test or chi-square test. For continuous metrics like revenue per user, use a t-test or non-parametric equivalent.
A worked example. Control had 8,200 users with 410 conversions (5.0%). Variant had 8,150 users with 481 conversions (5.9%).
- Absolute lift: 0.9 percentage points
- Relative lift: 18%
- Two-proportion z-test p-value: 0.012
Since p < 0.05, the difference is statistically significant. The variant is the winner.
| Variant | Users | Conversions | Conversion rate |
|---|---|---|---|
| Control | 8,200 | 410 | 5.00% |
| Variant | 8,150 | 481 | 5.90% |
Always report the confidence interval too, not just the point estimate. A 95% confidence interval for the lift in this example is roughly 0.2 to 1.6 percentage points. The true effect is probably somewhere in that range; the 18% relative lift is the midpoint of the distribution, not a precise prediction.
Step 7: Decide what to ship and what to learn
A statistically significant result is the beginning of the decision, not the end. Three more questions matter:
- Is the effect large enough to be worth shipping? A 2% lift on a tiny feature might not justify the engineering cost of maintenance.
- Does the result hold across segments? Check the lift among new users, returning users, mobile, desktop, top traffic channels. A test that wins overall but loses on mobile is a red flag.
- What did you learn that informs the next test? Even a losing variant teaches you something about your users.
Ship the winner if the effect is meaningful, the segments hold up, and the change is not introducing technical debt or UX issues. Document the result somewhere durable; in two years you will not remember what you tested.
Common A/B testing mistakes to avoid
- Multiple testing without correction. If you run 20 A/B tests at p < 0.05, you should expect 1 false positive by chance alone.
- Sample ratio mismatch. If you split 50/50 but the actual user counts are 8,200 and 7,400, something is wrong with your randomization or your tracking. Investigate before trusting any result.
- Novelty effects. New features often see a temporary lift just because they are new. Run tests for at least 2 weeks if possible.
- Testing too small a change. If the variant is a button color tweak, the expected effect is tiny and you will need huge sample sizes for a credible answer.
- Ignoring downstream metrics. A test that lifts free-trial signups but tanks paid conversion is not a win.
What to do next
Build a habit of running tests, not a habit of running one test. The compounding learning is the point. Even modest single-test lifts (1-3%) compound into significant gains across a year of disciplined experimentation. Plug your numbers into the A/B Test Significance Calculator before launching every test so you know what you are committing to.
Calculators referenced in this guide
Keep reading
How to Price Your SaaS Product
A practical framework for SaaS pricing: value-based research, tier design, packaging, and the most common pricing mistakes founders make.
How to Calculate Burn Rate
Calculate gross and net burn rate, then translate them into runway in months. A practical guide for founders and operators.
How to Project MRR Growth
Build a defensible MRR projection model using new business, expansion, and churn assumptions that hold up to investor scrutiny.
Business & SaaS Disclaimer
This article is for educational purposes. Actual business performance varies based on many factors. SaaSCalcHub is not business or financial advice. Consult business advisors, CPAs, and consultants for your specific situation.
Last updated: Jun 3, 2026