Claude Skill

A/B Test Setup Coach

Designs an A/B test that can produce a real conclusion — not "the test was inconclusive" 4 weeks in.

Download skill (.zip)Or download whole pack

What it does

Helps you design a test before launching: hypothesis, success metric, sample size, duration, what could go wrong. Plus the analysis plan so you know upfront how to read the result.

When to use

✓Before any A/B test, especially landing page / CTA / pricing changes
✓When your last few tests were "inconclusive" and you want to know why
✓Pre-quarter when planning what to test

When not to use

✗You have < 1000 conversions/week — sample size will betray you
✗The change is small and clearly not worth a test — just ship it

Install

Download the .zip, then unzip into your Claude skills folder.

mkdir -p ~/.claude/skills
unzip ~/Downloads/ab-test-setup-coach.zip -d ~/.claude/skills/

# Restart Claude Code session.
# Skill is now available — Claude will use it when relevant.

SKILL.md

---
name: ab-test-setup-coach
description: Use when designing an A/B test or experiment. Triggers on "A/B test", "split test", "experiment setup", "test hypothesis".
---

# A/B Test Setup Coach

A/B tests fail in predictable ways: underpowered, ambiguous metric, leaky variants, novelty effect. Designing for these upfront beats discovering them at week 4.

## Required inputs

1. **What you're testing** — the specific change between control and variant
2. **Hypothesis** — what you predict and why (not "it'll improve things")
3. **Success metric** — the ONE primary metric you'll judge by
4. **Guardrail metrics** — things that shouldn't get worse
5. **Current baseline** — conversion rate, sample volume per week

## Output

### Sample size calculation
- Detect a Y% relative lift in metric X
- At Z% statistical power
- With confidence interval of W%
- → need N conversions per variant

If N exceeds what you can get in 4-6 weeks, you can't run this test as designed. Options: pick a bigger effect to detect, lower the power requirement, or pick a different test.

### Test design
- **Sample allocation** — 50/50 typically; 90/10 if risk of broken variant is high
- **Duration** — minimum 2 weekly cycles, prefer 4 weeks for B2B
- **Randomization unit** — user (most common) vs. session vs. account
- **Stopping criteria** — fixed duration (no peeking) or sequential testing

### Guardrails
Variables that shouldn't get worse even if primary improves:
- Revenue per user
- Day-7 retention
- Support ticket volume
- Specific segment performance

### Analysis plan (write BEFORE running)
- Primary metric pass/fail criteria
- How you'll handle small-but-significant results
- How you'll handle null results
- What you'll do if guardrails are breached
- How long after test ends before you make the call (to let lagging metrics surface)

### Common failure modes to pre-empt
- **Novelty effect** — variant wins in week 1, fades by week 4
- **Selection bias** — variants weren't randomly assigned
- **SRM** (sample ratio mismatch) — split came out 55/45 instead of 50/50, suggests data plumbing issue
- **Cherry-picking segments** — "it worked for enterprise" post-hoc

## Anti-patterns
- Multiple primary metrics (now everything is "directionally positive")
- "Inconclusive" framing for null results (null IS a result)
- Peeking and stopping early when one variant is ahead
- Testing 5 things at once and not knowing which moved the metric
- Shipping the winner without first checking if it has subgroup harm

## When to skip A/B testing
- Effect would have to be huge to matter; just ship the better option
- Sample size will never be enough; do qualitative research instead
- You'd ship the variant regardless of result (then save the cycles)

Example prompts

Once installed, try these prompts in Claude:

I want to test a new pricing page layout. Current weekly visitors: 8K. Current conversion: 2.3%. Help me set up the test.
My last 5 tests were "inconclusive." Audit my test design and tell me what I'm doing wrong.

HN X LinkedIn Reddit