SaaSCalcHub. Explore Tools →
How to

How to Run a Valid A/B Test

From hypothesis to statistical significance: a step-by-step guide to running A/B tests that produce decisions you can trust.

SaaSCalcHub Editorial Team February 20, 2026 12 min read

A valid A/B test compares two versions of an experience by randomly assigning users to each version, measuring a single primary metric, and continuing until the sample size is large enough to detect the effect you care about. The result is statistically significant if the probability of seeing the observed difference by chance alone (the p-value) is below your threshold, typically 0.05.

Most SaaS A/B tests fail not because the variant was bad but because the test was set up wrong. Underpowered tests miss real wins. Overpowered tests waste traffic. Peeking at results early inflates false positive rates. This guide walks through the seven steps that separate a credible experiment from a hopeful one.

Step 1: Write the hypothesis before you build anything

A valid hypothesis is specific, testable, and tied to a metric. Avoid hypotheses like "the new pricing page will be better." Use the format:

If we [change X], then [primary metric Y] will improve by [Z%] because [reason].

A real example: "If we remove the credit card requirement from the trial signup, then trial-to-paid conversion will increase by 15% because more users will reach product activation before payment friction."

The "because" clause matters. It forces you to articulate the mechanism, which makes the test interpretable regardless of which way it lands.

Step 2: Pick one primary metric

The fastest way to corrupt a test is to track many metrics and report whichever one wins. Pick one primary metric in advance. Everything else is secondary.

Good primary metrics for SaaS A/B tests:

  • Activation rate: percentage of new users who complete a defined activation event within X days
  • Trial-to-paid conversion: percentage of trial signups who become paying customers
  • Free-to-paid conversion: percentage of free-tier users who upgrade to paid
  • Click-through rate on a CTA (only valid for top-of-funnel tests; almost never the right primary metric for SaaS)
  • 30-day retention: percentage of users active 30 days after signup

If you cannot pick one, your hypothesis is not specific enough. Go back to Step 1.

Step 3: Calculate required sample size before you launch

Underpowered tests cannot detect real effects. Overpowered tests waste calendar weeks. Calculate the minimum sample size given your baseline conversion rate, the minimum effect you care about, and your desired statistical power.

The standard inputs are:

  • Baseline conversion rate (control): your current rate
  • Minimum detectable effect (MDE): the smallest improvement that would be worth shipping
  • Statistical significance level (alpha): usually 0.05 (5% false positive rate tolerated)
  • Statistical power (1 - beta): usually 0.80 (80% chance of detecting a real effect)
Baseline MDE Required sample per variant
5% 10% relative lift (to 5.5%) ~31,000
5% 20% relative lift (to 6.0%) ~7,800
10% 10% relative lift (to 11.0%) ~15,000
10% 20% relative lift (to 12.0%) ~3,800
20% 10% relative lift (to 22.0%) ~6,400
20% 20% relative lift (to 24.0%) ~1,600

For low-traffic SaaS sites, the required sample sizes can take months to accumulate. That is not a test problem; that is a traffic problem. Use the A/B Test Significance Calculator to plug in your numbers.

If you cannot get the required sample within 4-6 weeks, you have three options: test a bigger change with a larger expected effect, accept higher uncertainty, or do not test and ship based on judgment.

Step 4: Randomize correctly

Randomization is what makes the test causal. Without it, you have a comparison of self-selected groups, not an experiment.

  • Use a deterministic hash of the user ID (or session ID for logged-out tests) so the same user always sees the same variant.
  • Split 50/50 unless you have a specific reason not to (some teams use 90/10 for risky changes).
  • Randomize at the right level. Test the user experience at the user level, not the page-view level, or returning visitors will see both variants.
  • Exclude internal users, bot traffic, and any segments you cannot ship to (some teams exclude paid customers from pricing-page tests).

Step 5: Run the test for the planned duration, then stop

The single most common test invalidation is "peeking." If you check the p-value every day and stop the test the first time it crosses 0.05, you will declare false positives at 2x to 5x the nominal rate. The math is unforgiving here.

Three rules to follow:

  1. Decide the sample size or end date before launching. Stick to it.
  2. If you must do interim analyses, use sequential testing methods (mSPRT, alpha-spending) that adjust for repeated looks. Most SaaS teams do not need this.
  3. Run for at least one full week to capture the day-of-week effect. SaaS conversion varies significantly between weekdays and weekends.

The test ends when you said it would, not when the result looks good.

Step 6: Analyze the result correctly

After the test ends, calculate the difference between variants and the p-value for that difference. The standard test for conversion rates is a two-proportion z-test or chi-square test. For continuous metrics like revenue per user, use a t-test or non-parametric equivalent.

A worked example. Control had 8,200 users with 410 conversions (5.0%). Variant had 8,150 users with 481 conversions (5.9%).

  • Absolute lift: 0.9 percentage points
  • Relative lift: 18%
  • Two-proportion z-test p-value: 0.012

Since p < 0.05, the difference is statistically significant. The variant is the winner.

Variant Users Conversions Conversion rate
Control 8,200 410 5.00%
Variant 8,150 481 5.90%

Always report the confidence interval too, not just the point estimate. A 95% confidence interval for the lift in this example is roughly 0.2 to 1.6 percentage points. The true effect is probably somewhere in that range; the 18% relative lift is the midpoint of the distribution, not a precise prediction.

Step 7: Decide what to ship and what to learn

A statistically significant result is the beginning of the decision, not the end. Three more questions matter:

  • Is the effect large enough to be worth shipping? A 2% lift on a tiny feature might not justify the engineering cost of maintenance.
  • Does the result hold across segments? Check the lift among new users, returning users, mobile, desktop, top traffic channels. A test that wins overall but loses on mobile is a red flag.
  • What did you learn that informs the next test? Even a losing variant teaches you something about your users.

Ship the winner if the effect is meaningful, the segments hold up, and the change is not introducing technical debt or UX issues. Document the result somewhere durable; in two years you will not remember what you tested.

Common A/B testing mistakes to avoid

  • Multiple testing without correction. If you run 20 A/B tests at p < 0.05, you should expect 1 false positive by chance alone.
  • Sample ratio mismatch. If you split 50/50 but the actual user counts are 8,200 and 7,400, something is wrong with your randomization or your tracking. Investigate before trusting any result.
  • Novelty effects. New features often see a temporary lift just because they are new. Run tests for at least 2 weeks if possible.
  • Testing too small a change. If the variant is a button color tweak, the expected effect is tiny and you will need huge sample sizes for a credible answer.
  • Ignoring downstream metrics. A test that lifts free-trial signups but tanks paid conversion is not a win.

What to do next

Build a habit of running tests, not a habit of running one test. The compounding learning is the point. Even modest single-test lifts (1-3%) compound into significant gains across a year of disciplined experimentation. Plug your numbers into the A/B Test Significance Calculator before launching every test so you know what you are committing to.

Business & SaaS Disclaimer

This article is for educational purposes. Actual business performance varies based on many factors. SaaSCalcHub is not business or financial advice. Consult business advisors, CPAs, and consultants for your specific situation.

Last updated: Jun 3, 2026