A/B Testing vs Multi-Armed Bandit: Which Experimentation Method for PMs?

TL;DR — Which Experimentation Method Should PMs Actually Use?

A/B testing gives you certainty with a catch: you need traffic, time, and a willingness to ship losers. Multi-armed bandit (MAB) optimizes for speed with a different catch: you might exploit before you ever explore enough to know what actually works. The judgment that separates senior PMs from the rest isn't picking the better framework—it's knowing which organizational context makes each one a career-limiting move or a career-accelerating one.

I sat on a hiring committee at a FAANG-level company where a candidate with flawless MAB implementation on a low-traffic onboarding flow got passed over. The hiring manager's verdict during debrief: "Great execution, zero judgment on whether we should have run an A/B test to establish a baseline first." The candidate prepared the mechanics. They missed the signal.

You're a PM interviewing at Google, Meta, or a Series C+ startup who can explain statistical significance but gets pushed back when hiring managers ask "when would you not use this?" You currently earn $160K-$220K base, have 3-7 years experience, and your pain point is interview answers that sound textbook-correct but fail the "been in the room" test. This article gives you the specific scenarios, numbers, and counterintuitive framings that change your signal from "prepared candidate" to "person who has shipped."

The Real Decision Framework: When A/B Testing Is a Trap

Most candidates answer "when do you use A/B testing?" with sample size formulas. The candidates who get offers answer with organizational scars.

The scenario that separates levels: You're at a company with 100K monthly active users. Your onboarding conversion is 12%. You want to test a new tutorial flow. An A/B test with 80% power, 5% significance, and a 2% minimum detectable effect requires ~40,000 users per variant. That's 80 days at 50/50 split. In 80 days, your quarterly OKR review happens, your engineer rotates, and your competitor ships something similar.

The counterintuitive truth: A/B testing isn't wrong here. It's organizationally expensive in a way that MAB isn't.

But here's the hiring committee pushback I watched a candidate fail:

Hiring Manager: "Why not MAB?"

Candidate: "MAB would optimize faster."

HM: "And what do we lose?"

Candidate: [silence, then] "Statistical rigor?"

HM: "No. We lose the ability to ever know if the winner was real or lucky. You just told me you'd sacrifice learning for speed on a flow you'll redesign in 6 months anyway."

The candidate prepared the frameworks. They didn't prepare the trade-off narrative. The judgment: A/B testing is for when the organizational cost of being wrong exceeds the time cost of waiting. MAB is for when the time cost of waiting exceeds the reversibility cost of being slightly wrong.

Multi-Armed Bandit: The Undercover Career Risk

MAB sounds modern. In interviews, it's also where overprepared candidates self-select out.

The specific failure mode: A candidate at Meta described using Thompson Sampling for notification timing. Sounded sophisticated. Then the cross-functional interviewer asked: "Your 'exploration' phase showed 2pm sent performed 15% worse for 3 weeks. At what exploitation percentage did you commit?" The candidate hadn't set a commit threshold. They'd let the bandit run for 6 weeks, never achieving statistical confidence on any arm, and declared the highest-exploitation arm the winner.

The judgment: MAB without a stopping rule isn't experimentation. It's expensive randomization with better PR.

The real framework senior PMs use:

Factor	A/B Test	MAB
Decision reversibility	Irreversible (pricing, architecture)	Reversible (button color, copy)
Sample timeline	Fixed, known	Adaptive, unknown
Organizational learning	High: you know why it won	Low: you know that it won
Political capital needed	High: you need buy-in to wait	Low: you optimize continuously

The candidates who advance don't just know this table. They tell the scene where they got this wrong.

Preparation Checklist — Before Your PM Interview

Work through a structured preparation system (the PM Interview Playbook covers Google-specific experimentation frameworks with real debrief examples where candidates missed the "organizational cost" layer).

Prepare one A/B test scenario with exact numbers. "At 50K MAU, 8% baseline conversion, 2% MDE, we needed 62 days. I presented this timeline to leadership and we chose MAB instead because the feature had a 4-week competitive window." Not: "I consider sample size."

Prepare your "when I was wrong" story. "I used MAB for a pricing test because it felt faster. We never achieved stable convergence. Now I use A/B for pricing, MAB for UI micro-optimizations."

Know your company's actual scale. If interviewing at Google, know that Search experiments often run at 0.1% traffic allocation with millions of impressions. If at a Series B startup, know that "statistical significance" is often a luxury good you can't afford.

Script your "it depends" opener. "For irreversible decisions with high organizational stakes, A/B. For reversible, low-stakes optimizations where speed beats certainty, MAB. The error I see most often is using MAB because it feels more sophisticated, not because the context demands it."

Prepare for the pushback. "But doesn't MAB always win on regret?" Your answer: "Only if you define regret as short-term conversion lift. If you define it as 'chance of shipping the wrong winner and cementing it for quarters,' A/B often wins."

Mistakes to Avoid — BAD vs. GOOD

BAD: "A/B testing requires large sample sizes, while MAB is more efficient."

GOOD: "At my last company, we had 30K users. An A/B test for our checkout flow would have taken 90 days. We ran MAB for 14 days, got a 4% lift, and I presented it as a win. Six months later, we couldn't replicate the result. Now I know: that wasn't a win. That was noise I couldn't distinguish from signal because I stopped exploring too early."

BAD: "MAB is better for dynamic environments."

GOOD: "I used MAB for our recommendation algorithm because items changed daily. The bandit kept exploring new items. But I never defined 'exploration success'—whether an item got enough impressions to evaluate. We promoted items that won on 200 impressions against control's 20,000. The hiring committee at [Company] asked me how I prevented this. I didn't have an answer then. I do now: minimum exposure thresholds before exploitation, even in MAB."

BAD: "Statistical significance matters in both."

GOOD: "In my Google onsite, the interviewer asked why we use 95% confidence. I said 'industry standard.' He pushed: 'We run thousands of tests. At 95%, 5% of our winners are false positives. That's 50 tests. How do we handle that?' I froze. The answer isn't the confidence level. It's the organizational cost of false positives multiplied by volume. At Google scale, you need Bonferroni corrections or false discovery rate control. At startup scale, you need to know whether you can afford any false positives."

FAQ — Exactly 3 Items

Q: I said "A/B test for everything" in my Meta interview and got rejected. Was I wrong?

You signaled rigidity, not rigor. Meta's infrastructure supports both. The judgment they wanted: "For News Feed ranking changes that affect billions of impressions, A/B with holdout. For notification copy tests with 48-hour decision windows, MAB." The specific reframe: "I used to think A/B was the responsible choice. Now I think the responsible choice is matching method to decision cost."

Q: How do I answer "design an experiment" if I don't know the company's scale?

Ask. Then anchor. "Before I commit: what's the monthly traffic to this surface, and what's the decision reversibility? At 100K visitors with reversible UI, I'd propose MAB with 20% minimum exploration and a 2-week review gate. At 10M visitors with pricing implications, I'd propose A/B with 1% holdout and quarterly business review." The phrase "review gate" signals you've managed real experiments, not just studied them.

Q: The PM Interview Playbook mentions "organizational cost" as a framework. How do I use it without sounding scripted?

Drop the phrase once, with a specific scar. "I learned organizational cost the hard way. We ran a 6-week A/B test for a feature that got deprioritized in week 5. The test finished. No one cared. Now I ask: 'If this test runs 8 weeks and the world changes in week 4, do we still learn something durable?' If no, I question whether we should experiment at all, or just ship and monitor." That's the signal. Not the framework name. The specific failure and revised heuristic.