A/B Testing a Checkout Page: An E-commerce Case Study That Lifted Conversions 18%
An online retailer had a checkout abandonment problem: 68% of shoppers who added an item to the cart never completed the purchase. The team had a hunch that the multi-step checkout was the culprit. Instead of redesigning on instinct, they ran a proper A/B test. This is the full analysis, from hypothesis to the decision.
The Hypothesis
Before touching the data, we wrote the hypothesis down — this discipline is what separates a real experiment from a guess.
- Control (A): the existing 3-step checkout (shipping → payment → review)
- Variant (B): a single-page checkout with all fields on one screen
- H₀ (null): the checkout design has no effect on conversion rate
- H₁ (alternative): the single-page checkout has a different conversion rate
- Primary metric: completed purchases ÷ checkout sessions started
We picked a two-tailed test because we genuinely did not know whether one page might hurt by overwhelming users.
Step 1: Calculate Sample Size First
The most common A/B testing mistake is stopping the test the moment it "looks" significant. To avoid it, we fixed the sample size in advance.
Inputs:
- Baseline conversion (control): 32%
- Minimum detectable effect we cared about: 3 percentage points (a lift to 35%)
- Significance level (α): 0.05
- Power (1 − β): 0.80
A power calculation returned roughly 3,900 sessions per group. We agreed to run the test until both groups hit that number and not peek at significance before then. Traffic was split 50/50 by a random assignment cookie.
Step 2: Run the Test Cleanly
Two rules kept the experiment valid:
- Randomization at the visitor level, so the same user always saw the same version (no flip-flopping that confuses behavior).
- Run for full weeks. Weekend and weekday shoppers behave differently; we ran exactly two full weeks so each variant saw the same mix of days.
After 14 days:
| Group | Sessions | Conversions | Conversion Rate |
|---|---|---|---|
| A (Control) | 4,118 | 1,318 | 32.0% |
| B (Single-page) | 4,090 | 1,545 | 37.8% |
A 5.8 percentage point raw lift. But is it real or noise? That is what the test decides.
Step 3: The Statistical Test
Conversion is a yes/no outcome across two groups, which means a chi-square test of independence (or, equivalently here, a two-proportion z-test) is the right tool.
In SPSS
If your data is one row per session with two columns (Group, Converted):
Analyze > Descriptive Statistics > Crosstabs
- Row: Group
- Column: Converted
- Click Statistics and tick Chi-square
- Click Cells and tick Observed, Expected, and Column percentages
The Output
The Chi-Square Tests table returned:
- Pearson Chi-Square = 32.6, df = 1, p < .001
Because p < .001 is well below our α of 0.05, we reject the null hypothesis. The difference in conversion rate is statistically significant — it is very unlikely to be a fluke of random traffic.
Effect Size — Don't Skip It
Significance tells you the effect is real; effect size tells you whether it is big enough to matter. For a 2×2 table the right measure is the phi coefficient, available in the Crosstabs Statistics box.
- Phi = 0.063
That is a small effect by Cohen's conventions — but on millions of sessions, a small effect on conversion is worth a lot of money. This is the key lesson: statistical significance and business significance are different questions, and you need both numbers.
Step 4: Quantify the Business Impact
The relative lift was (37.8 − 32.0) / 32.0 = 18.1%.
With an average order value of $64 and roughly 40,000 checkout sessions per month, the projected additional revenue was:
- Extra conversions per month ≈ 40,000 × 0.058 = 2,320
- Extra monthly revenue ≈ 2,320 × $64 ≈ $148,000
The team rolled out the single-page checkout to 100% of traffic the following week.
Common A/B Testing Mistakes (And How We Avoided Them)
- Peeking and early stopping. Checking significance daily and stopping at the first "p < .05" inflates false positives massively. We fixed sample size in advance.
- Testing too many things at once. We changed only the checkout layout. If we had also changed the button color and the shipping copy, a win would not tell us which change worked.
- Ignoring the segment. We later split results by device and found the lift was driven almost entirely by mobile users — desktop was flat. Always look one level deeper before generalizing.
- Reporting only the p-value. We reported the conversion rates, the lift, the p-value, and the effect size. A p-value with no rates is unreadable.
How to Report It
A chi-square test of independence examined the relationship between checkout design and purchase completion. Conversion was significantly higher for the single-page checkout (37.8%) than the three-step control (32.0%), χ²(1) = 32.6, p < .001, φ = 0.06. The single-page design produced an 18.1% relative lift in conversions.
The Takeaway
A/B testing is not about having opinions about design — it is about replacing opinions with evidence. The structure is always the same: write the hypothesis, fix the sample size, randomize cleanly, run a chi-square or z-test, report significance and effect size, then translate it into money.
Running experiments and not sure if your results are real? Insighter Digital sets up A/B tests, calculates the sample size you actually need, and runs the significance analysis so you can ship changes with confidence. Talk to us.
Keep Reading
Get More Guides Like This
Free tutorials on SPSS, Excel, Python, and research methods delivered to your inbox.
Need Professional Data Analysis Services?
Save time and get accurate results. Our experts provide statistical analysis services using SPSS, Excel, and Python — from hypothesis testing to APA-formatted reports.