How to A/B test cold emails (the math, the variables, the discipline)
How do you A/B test cold emails properly?
Short answer: with sufficient volume (500+ per variant minimum), one variable changed at a time, and discipline to wait for the test to complete before drawing conclusions. Most B2B cold email A/B testing is statistical noise dressed up as insight.
TL;DR — the rules
| Rule | Standard |
|---|---|
| Minimum sample per variant | 500 sends |
| Variables changed simultaneously | 1 |
| Test duration | Full sequence cycle (3 weeks typical) |
| Significance threshold | 95% confidence (use a calculator) |
| Test on like lists | Random split of same list |
Variables worth testing
| Variable | Impact range |
|---|---|
| Subject line | 5–25% on opens |
| First line | 10–30% on reply |
| Email length | 5–15% on reply |
| Call-to-action | 10–25% on meeting booking |
| Sender name (founder vs SDR) | 20–50% on reply |
| Send time | 5–15% on opens |
| Day of week | 5–10% on opens |
Test the high-impact variables first — first line and sender name. Test lower-impact variables (send time) only once the higher levers are optimised.
Sample size math
For a 5% reply rate, to detect a 2-point lift to 7% with 95% confidence:
Roughly 800–1,000 sends per variant required.
Most teams declare a winner after 100 sends. At that sample, the noise is bigger than the signal.
What kills A/B test validity
Multiple variables changed. Subject + first line + CTA all different — you cannot attribute the lift to any one.
Different lists. Variant A on list 1, Variant B on list 2. Lists differ; the test is meaningless.
Different time windows. Variant A in March, Variant B in May. Seasonality varies; the test is meaningless.
Early stopping. Declaring a winner at 100 sends because A is ahead.
Vanity-metric optimisation. Optimising for opens alone, ignoring reply rate.
The iteration cadence
A serious cold email program:
| Cadence | Activity |
|---|---|
| Weekly | Monitor performance; flag anomalies |
| Bi-weekly | Review test results; design next test |
| Quarterly | Major sequence refresh based on 6–8 tests |
| Annually | Full library rebuild |
This produces ~25–30 tested variables per year. Compounded, the sequence library lifts 30–60% over 12 months vs. an unchanged baseline.
For UAE & KSA teams
- Smaller list volumes make MENA-specific A/B testing harder. May need 4–8 weeks per test rather than 2–3.
- Combine UAE + KSA into one test for the same persona where lists are small.
- Seasonality affects timing. Avoid running tests across Ramadan boundaries.
What MAVEN does about it
A/B test design + execution is part of every Sales Process Program and Apollo Quick-Start.
Frequently asked
Should I test subject lines or body first?
First line. Bigger impact on reply.
How long does a test take?
3 weeks minimum to capture the full sequence cycle.
Use Apollo's built-in A/B testing?
Yes — it handles the splitting cleanly.
Test sender names?
Yes, periodically. Founder-sent often outperforms SDR-sent by 20–50%.
What's the most overlooked test variable?
Email length. Many teams test subject obsessively and ignore that 90-word emails outperform 200-word ones materially.
Post 81 of our outbound + sales OS series.
Related reading
Level Up Your Sales Career
Join The Sales Development Society — weekly live coaching, proven templates, and a community of ambitious B2B salespeople going from entry-level to enterprise.
Join the CommunityReady to install your sales engine?
Book a 30-minute Virtual Coffee. No deck, no pitch — just an honest read of where you are.
Book a Virtual Coffee