A/B Testing for Marketing: How to Analyze Experiment Results

A/B testing is the gold standard for data-driven marketing decisions. Instead of guessing whether a new headline, email subject line, or landing page will perform better, you test it against the current version and let the data decide. But running a test is the easy part — analyzing the results correctly is where most marketing teams go wrong.

What Is A/B Testing?

An A/B test (also called a split test) randomly divides your audience into two groups:

Control (A): Sees the current version
Treatment (B): Sees the new version

You measure a key metric (conversion rate, click-through rate, revenue per visitor) for each group and determine whether the difference is statistically significant — meaning it is unlikely to be caused by random chance.

Step 1: Define Your Hypothesis

Before running any test, write down:

What you are testing: "Changing the CTA button from green to orange"
What you expect: "The orange button will increase click-through rate"
Your primary metric: Click-through rate (CTR)
Your secondary metrics: Conversion rate, bounce rate (for monitoring, not deciding)

A clear hypothesis prevents you from cherry-picking metrics after the test.

Step 2: Calculate Required Sample Size

The biggest mistake in A/B testing is ending the test too early. You need enough data for statistical power.

Key inputs:

Baseline conversion rate: Your current rate (e.g., 3.2%)
Minimum detectable effect (MDE): The smallest improvement worth detecting (e.g., 20% relative increase → 3.2% to 3.84%)
Significance level: α = .05 (standard)
Statistical power: 80% (standard)

Sample Size Examples

Baseline Rate	MDE (Relative)	Sample Per Group
1%	20%	78,039
3%	20%	25,070
5%	20%	14,531
10%	20%	6,839
3%	50%	3,623
10%	50%	1,030

Lower baseline rates and smaller expected effects require dramatically more traffic.

Step 3: Run the Test Properly

Random assignment: Each visitor is randomly assigned to A or B
Simultaneous testing: Both versions run at the same time (never A this week, B next week)
Full duration: Run until you hit the pre-calculated sample size, not until you see significance
Single primary metric: Decide what "winning" means before the test starts

Step 4: Analyze the Results

The Core Analysis: Two-Proportion Z-Test

For conversion rate tests (binary outcome: converted or not), the standard test is a two-proportion z-test.

Example data:

Group	Visitors	Conversions	Rate
Control (A)	5,000	160	3.20%
Treatment (B)	5,000	198	3.96%

The question: Is the 0.76 percentage point difference (3.96% - 3.20%) statistically significant?

Calculating by Hand

Pooled proportion: p = (160 + 198) / (5000 + 5000) = 0.0358

Standard error: SE = √(p × (1-p) × (1/n₁ + 1/n₂)) = √(0.0358 × 0.9642 × 0.0004) = 0.00372

Z-statistic: z = (0.0396 - 0.0320) / 0.00372 = 2.04

P-value: For z = 2.04, p = .041 (two-tailed)

Conclusion: p < .05 → The difference is statistically significant. Treatment B wins.

Using Software

In Excel, SPSS, or Python, you can compute this more easily:

Python:

from scipy.stats import proportions_ztest
stat, p_value = proportions_ztest([198, 160], [5000, 5000])
print(f"z = {stat:.3f}, p = {p_value:.3f}")

SPSS: Use Analyze → Nonparametric Tests → Legacy → 2 Independent Samples, or compute using crosstabs with chi-square (equivalent for large samples).

Step 5: Calculate the Effect Size and Confidence Interval

Statistical significance alone is not enough. Calculate:

Relative Lift

Lift = (Treatment Rate - Control Rate) / Control Rate × 100

= (3.96% - 3.20%) / 3.20% × 100 = 23.75% relative improvement

Confidence Interval for the Difference

The 95% CI tells you the range of plausible true differences:

CI = (p₂ - p₁) ± 1.96 × SE = 0.0076 ± 1.96 × 0.00372 = [0.0003, 0.0149]

This means the true difference is likely between 0.03 and 1.49 percentage points. The entire interval is above zero, confirming significance.

Step 6: Make the Decision

Consider three factors:

Statistical significance: Is p < .05?
Practical significance: Is the lift meaningful for your business? A 0.1% improvement might be statistically significant but not worth the implementation effort
Confidence interval width: A wide interval means high uncertainty — you might want more data

Decision Framework

Significant?	Meaningful Lift?	Decision
Yes	Yes	Implement B
Yes	No	Not worth the change
No	—	Keep A (or extend the test)

Common A/B Testing Mistakes

1. Peeking at Results Too Early

Checking results daily and stopping when you see significance dramatically inflates false positive rates. A test that looks significant at day 3 may not be significant at day 14.

Fix: Pre-calculate your sample size and wait until you reach it. Use sequential testing methods if you must check early.

2. Running Too Many Variants

Testing A vs. B vs. C vs. D vs. E increases the chance of a false positive. With 5 variants and α = .05, there is a 40% chance of at least one false positive.

Fix: Limit to 2–3 variants. Apply Bonferroni correction if testing multiple: α_adjusted = .05 / number of comparisons.

3. Testing on Wrong Metrics

Optimizing for click-through rate when your goal is revenue can lead to worse outcomes. More clicks does not always mean more money.

Fix: Choose your primary metric based on the business objective, not what is easiest to measure.

4. Ignoring Segment Differences

A test that shows no overall effect may have strong positive effects for one segment and negative for another — canceling out in aggregate.

Fix: Pre-register key segments (mobile vs. desktop, new vs. returning) and analyze them separately.

5. Not Accounting for Seasonality

Running a test during Black Friday and comparing to a baseline from October will show inflated results.

Fix: Always run control and treatment simultaneously. Never compare to a historical baseline.

Beyond Simple A/B Tests

Multivariate testing: Test combinations of changes (headline + image + CTA) simultaneously
Bayesian A/B testing: Get probability statements ("93% chance B is better") instead of p-values
Multi-armed bandit: Automatically shift traffic to the winning variant during the test
Causal impact analysis: For changes you cannot A/B test (pricing, store layout), use time-series methods

Need help designing or analyzing your marketing experiments? Our data analysis team handles the statistics so you can focus on strategy. Get a free consultation.