Home
Guides
How to Run a Valid A/B Test

How to

How to Run a Valid A/B Test

Q: What are the steps to run a valid A/B test?

A valid A/B test starts with a specific, testable hypothesis, picks a single primary metric in advance, calculates the required sample size before launching, randomizes users correctly and consistently, runs for the planned duration without stopping early, and then analyzes the result with the appropriate statistical test. Skipping any of these steps, especially predetermining sample size, is a common way tests produce misleading or unreliable conclusions.

Q: Why is it important to pick one primary metric before testing?

Tracking many metrics and reporting whichever one happens to show a positive result is a common way to fool yourself, since the odds of at least one metric moving by chance increase with the number tracked. Committing to a single primary metric in advance, tied to a specific hypothesis, keeps the test result honest and actionable.

Q: How do you calculate the sample size needed for an A/B test?

Required sample size depends on your baseline conversion rate, the minimum lift you want to be able to detect, and your desired statistical significance level and power, and it increases sharply as the target lift shrinks. For low-traffic pages or small target effects, the required sample can take many weeks or months to accumulate, which is a real constraint that should be considered before designing a test.

Q: Why shouldn't you stop an A/B test early once it looks significant?

Checking results daily and stopping as soon as the test crosses statistical significance — known as peeking — substantially inflates the true rate of false positives compared to the nominal 5% threshold. The recommended practice is to decide the sample size or end date before launching and stick to it, or to use sequential testing methods specifically designed for repeated monitoring.

From hypothesis to statistical significance: a step-by-step guide to running A/B tests that produce decisions you can trust.

SaaSCalcHub Editorial Team February 20, 2026 12 min read

Step 1: Write the hypothesis before you build anything
Step 2: Pick one primary metric
Step 3: Calculate required sample size before you launch
Step 4: Randomize correctly
Step 5: Run the test for the planned duration, then stop
Step 6: Analyze the result correctly
Step 7: Decide what to ship and what to learn
Common A/B testing mistakes to avoid
What to do next

A valid A/B test compares two versions of an experience by randomly assigning users to each version, measuring a single primary metric, and continuing until the sample size is large enough to detect the effect you care about. The result is statistically significant if the probability of seeing the observed difference by chance alone (the p-value) is below your threshold, typically 0.05.

Most SaaS A/B tests fail not because the variant was bad but because the test was set up wrong. Underpowered tests miss real wins. Overpowered tests waste traffic. Peeking at results early inflates false positive rates. This guide walks through the seven steps that separate a credible experiment from a hopeful one.

Step 1: Write the hypothesis before you build anything

A valid hypothesis is specific, testable, and tied to a metric. Avoid hypotheses like "the new pricing page will be better." Use the format:

If we [change X], then [primary metric Y] will improve by [Z%] because [reason].

A real example: "If we remove the credit card requirement from the trial signup, then trial-to-paid conversion will increase by 15% because more users will reach product activation before payment friction."

The "because" clause matters. It forces you to articulate the mechanism, which makes the test interpretable regardless of which way it lands.

Step 2: Pick one primary metric

The fastest way to corrupt a test is to track many metrics and report whichever one wins. Pick one primary metric in advance. Everything else is secondary.

Good primary metrics for SaaS A/B tests:

Activation rate: percentage of new users who complete a defined activation event within X days
Trial-to-paid conversion: percentage of trial signups who become paying customers
Free-to-paid conversion: percentage of free-tier users who upgrade to paid
Click-through rate on a CTA (only valid for top-of-funnel tests; almost never the right primary metric for SaaS)
30-day retention: percentage of users active 30 days after signup

If you cannot pick one, your hypothesis is not specific enough. Go back to Step 1.

Step 3: Calculate required sample size before you launch

Underpowered tests cannot detect real effects. Overpowered tests waste calendar weeks. Calculate the minimum sample size given your baseline conversion rate, the minimum effect you care about, and your desired statistical power.

The standard inputs are:

Baseline conversion rate (control): your current rate
Minimum detectable effect (MDE): the smallest improvement that would be worth shipping
Statistical significance level (alpha): usually 0.05 (5% false positive rate tolerated)
Statistical power (1 - beta): usually 0.80 (80% chance of detecting a real effect)

Baseline	MDE	Required sample per variant
5%	10% relative lift (to 5.5%)	~31,000
5%	20% relative lift (to 6.0%)	~7,800
10%	10% relative lift (to 11.0%)	~15,000
10%	20% relative lift (to 12.0%)	~3,800
20%	10% relative lift (to 22.0%)	~6,400
20%	20% relative lift (to 24.0%)	~1,600

For low-traffic SaaS sites, the required sample sizes can take months to accumulate. That is not a test problem; that is a traffic problem. Use the A/B Test Significance Calculator to plug in your numbers.

If you cannot get the required sample within 4-6 weeks, you have three options: test a bigger change with a larger expected effect, accept higher uncertainty, or do not test and ship based on judgment.

Step 4: Randomize correctly

Randomization is what makes the test causal. Without it, you have a comparison of self-selected groups, not an experiment.

Use a deterministic hash of the user ID (or session ID for logged-out tests) so the same user always sees the same variant.
Split 50/50 unless you have a specific reason not to (some teams use 90/10 for risky changes).
Randomize at the right level. Test the user experience at the user level, not the page-view level, or returning visitors will see both variants.
Exclude internal users, bot traffic, and any segments you cannot ship to (some teams exclude paid customers from pricing-page tests).

Step 5: Run the test for the planned duration, then stop

The single most common test invalidation is "peeking." If you check the p-value every day and stop the test the first time it crosses 0.05, you will declare false positives at 2x to 5x the nominal rate. The math is unforgiving here.

Three rules to follow:

Decide the sample size or end date before launching. Stick to it.
If you must do interim analyses, use sequential testing methods (mSPRT, alpha-spending) that adjust for repeated looks. Most SaaS teams do not need this.
Run for at least one full week to capture the day-of-week effect. SaaS conversion varies significantly between weekdays and weekends.

The test ends when you said it would, not when the result looks good.

Step 6: Analyze the result correctly

After the test ends, calculate the difference between variants and the p-value for that difference. The standard test for conversion rates is a two-proportion z-test or chi-square test. For continuous metrics like revenue per user, use a t-test or non-parametric equivalent.

A worked example. Control had 8,200 users with 410 conversions (5.0%). Variant had 8,150 users with 481 conversions (5.9%).

Absolute lift: 0.9 percentage points
Relative lift: 18%
Two-proportion z-test p-value: 0.012

Since p < 0.05, the difference is statistically significant. The variant is the winner.

Variant	Users	Conversions	Conversion rate
Control	8,200	410	5.00%
Variant	8,150	481	5.90%

Always report the confidence interval too, not just the point estimate. A 95% confidence interval for the lift in this example is roughly 0.2 to 1.6 percentage points. The true effect is probably somewhere in that range; the 18% relative lift is the midpoint of the distribution, not a precise prediction.

Step 7: Decide what to ship and what to learn

A statistically significant result is the beginning of the decision, not the end. Three more questions matter:

Is the effect large enough to be worth shipping? A 2% lift on a tiny feature might not justify the engineering cost of maintenance.
Does the result hold across segments? Check the lift among new users, returning users, mobile, desktop, top traffic channels. A test that wins overall but loses on mobile is a red flag.
What did you learn that informs the next test? Even a losing variant teaches you something about your users.

Ship the winner if the effect is meaningful, the segments hold up, and the change is not introducing technical debt or UX issues. Document the result somewhere durable; in two years you will not remember what you tested.

Common A/B testing mistakes to avoid

Multiple testing without correction. If you run 20 A/B tests at p < 0.05, you should expect 1 false positive by chance alone.
Sample ratio mismatch. If you split 50/50 but the actual user counts are 8,200 and 7,400, something is wrong with your randomization or your tracking. Investigate before trusting any result.
Novelty effects. New features often see a temporary lift just because they are new. Run tests for at least 2 weeks if possible.
Testing too small a change. If the variant is a button color tweak, the expected effect is tiny and you will need huge sample sizes for a credible answer.
Ignoring downstream metrics. A test that lifts free-trial signups but tanks paid conversion is not a win.

What to do next

Build a habit of running tests, not a habit of running one test. The compounding learning is the point. Even modest single-test lifts (1-3%) compound into significant gains across a year of disciplined experimentation. Plug your numbers into the A/B Test Significance Calculator before launching every test so you know what you are committing to.

Frequently Asked Questions

What are the steps to run a valid A/B test?

A valid A/B test starts with a specific, testable hypothesis, picks a single primary metric in advance, calculates the required sample size before launching, randomizes users correctly and consistently, runs for the planned duration without stopping early, and then analyzes the result with the appropriate statistical test. Skipping any of these steps, especially predetermining sample size, is a common way tests produce misleading or unreliable conclusions.

Why is it important to pick one primary metric before testing?

Tracking many metrics and reporting whichever one happens to show a positive result is a common way to fool yourself, since the odds of at least one metric moving by chance increase with the number tracked. Committing to a single primary metric in advance, tied to a specific hypothesis, keeps the test result honest and actionable.

How do you calculate the sample size needed for an A/B test?

Required sample size depends on your baseline conversion rate, the minimum lift you want to be able to detect, and your desired statistical significance level and power, and it increases sharply as the target lift shrinks. For low-traffic pages or small target effects, the required sample can take many weeks or months to accumulate, which is a real constraint that should be considered before designing a test.

Why shouldn't you stop an A/B test early once it looks significant?

Checking results daily and stopping as soon as the test crosses statistical significance — known as peeking — substantially inflates the true rate of false positives compared to the nominal 5% threshold. The recommended practice is to decide the sample size or end date before launching and stick to it, or to use sequential testing methods specifically designed for repeated monitoring.

Calculators referenced in this guide

A/B Test Significance Calculator

Drop in visitors and conversions — get instant z-test result.

Keep reading

How to

How to Price Your SaaS Product

A practical framework for SaaS pricing: value-based research, tier design, packaging, and the most common pricing mistakes founders make.

13 min read · Mar 9, 2026

How to

How to Calculate Burn Rate

Calculate gross and net burn rate, then translate them into runway in months. A practical guide for founders and operators.

9 min read · Feb 5, 2026

How to

How to Project MRR Growth

Build a defensible MRR projection model using new business, expansion, and churn assumptions that hold up to investor scrutiny.

10 min read · Jan 21, 2026

Business & SaaS Disclaimer

This article is for educational purposes. Actual business performance varies based on many factors. SaaSCalcHub is not business or financial advice. Consult business advisors, CPAs, and consultants for your specific situation.

Last updated: Jul 19, 2026