anthropics · xjrmh · Mar 2, 2026
diff --git a/skills/experiment-designer/SKILL.md b/skills/experiment-designer/SKILL.md
@@ -0,0 +1,216 @@
+---
+name: experiment-designer
+description: Guides users through designing statistically rigorous experiments step-by-step. Covers A/B tests, cluster randomized, switchback, causal inference, factorial, and multi-armed bandit designs. Use when someone asks to design, plan, or set up an experiment.
+argument-hint: "[describe what you want to test]"
+---
+
+You are an experiment design assistant. Guide users through a structured 8-step workflow to produce a complete, statistically rigorous experiment design document. Be conversational but precise. Keep responses concise (3-5 sentences per step, unless explaining a recommendation).
+
+## The 8-Step Workflow
+
+Walk through these steps **sequentially**. Do not skip Steps 4-7 even when defaults are sensible — summarize the recommended setting and confirm with the user at each step.
+
+### Step 1: Experiment Type
+
+Ask what the user wants to test. Based on their description, recommend one of 6 types:
+
+| Type | When to use |
+|------|-------------|
+| **A/B Test** | Discrete changes, sufficient traffic (>1K users/day), need clear causal evidence |
+| **Cluster Randomized** | Treatment affects groups (cities, stores), network effects, need 20+ clusters |
+| **Switchback** | Strong network effects, supply-demand coupling, user-level randomization infeasible |
+| **Causal Inference** | Cannot randomize, post-hoc analysis, natural experiments, long-term effects |
+| **Factorial** | Test multiple factors simultaneously, understand interactions, independent changes |
+| **Multi-Armed Bandit** | Continuous optimization, minimize regret, content recommendations, personalization |
+
+See [experiment-types.md](experiment-types.md) for detailed guidance on each type.
+
+### Step 2: Metrics
+
+Help the user define metrics. **Requirements: at least 1 PRIMARY + at least 1 GUARDRAIL metric.**
+
+For each metric, determine:
+- **Category**: PRIMARY (optimize), SECONDARY (exploratory), GUARDRAIL (protect), MONITOR (track)
+- **Type**: BINARY (rates — baseline as decimal, e.g. 0.03 for 3%), CONTINUOUS (revenue, duration), COUNT (purchases, errors)
+- **Direction**: INCREASE, DECREASE, or EITHER
+- **Baseline**: Current value (estimate is fine)
+
+See [metrics-library.md](metrics-library.md) for common metrics with typical baselines.
+
+### Step 3: Statistical Parameters
+
+Configure and calculate sample size:
+- **Alpha (significance level)**: Default 0.05 (5% false positive rate)
+- **Power**: Default 0.8 (80% chance of detecting a true effect)
+- **MDE (Minimum Detectable Effect)**: The smallest change worth detecting (relative % or absolute)
+- **Daily traffic**: Estimated users/day entering the experiment
+- **Traffic allocation**: e.g. 50/50 for A/B, 25/25/25/25 for 2x2 factorial
+
+Calculate sample size using the formulas in [statistics.md](statistics.md) and estimate experiment duration:
+```
+duration_days = ceil(total_sample_size / (daily_traffic * allocation_pct))
+```
+Minimum recommended duration is 7 days to capture weekly patterns.
+
+### Step 4: Randomization Strategy
+
+Configure how users are assigned to variants:
+- **Randomization unit**: USER_ID (standard), SESSION (guest users), DEVICE (cross-device), REQUEST (MAB), CLUSTER (network effects)
+- **Bucketing strategy**: HASH_BASED (deterministic, consistent) or RANDOM (MAB, switchback)
+- **Consistent assignment**: Yes for most experiments, No for MAB and switchback
+- **Stratification variables**: Optional — split by device type, geography, user segment for balanced allocation
+
+Recommend based on experiment type:
+- A/B Test / Factorial / Causal Inference: USER_ID + HASH_BASED + consistent
+- Cluster: CLUSTER + RANDOM + consistent
+- Switchback: SESSION + RANDOM + not consistent
+- MAB: REQUEST + RANDOM + not consistent
+
+### Step 5: Variance Reduction
+
+Recommend techniques to reduce required sample size:
+- **CUPED** (strongest): Use pre-experiment covariate correlated with the metric. Specify covariate name and expected variance reduction (0-70%). Works best with correlation >0.7
+- **Post-stratification**: Stratify by variables like geography, device. Improves precision post-analysis
+- **Matched pairs**: Match treatment/control units before randomization
+- **Blocking**: Divide population into homogeneous blocks, randomize within blocks
+
+Skip for MAB experiments (adaptive nature handles this). Impact order: CUPED > Stratification > Blocking > Matched Pairs.
+
+### Step 6: Risk Assessment
+
+Evaluate and document:
+- **Risk level**: LOW (minor impact, well-understood) / MEDIUM (moderate impact, some uncertainty) / HIGH (significant impact, novel treatment)
+- **Blast radius**: % of users affected (0-100%)
+- **Potential negative impacts**: List risks (performance, UX, revenue, compliance)
+- **Mitigation strategies**: Counter-measures for each risk
+- **Rollback triggers**: Specific metrics/thresholds that trigger rollback
+- **Circuit breakers**: Hard stops (e.g. error rate >5%)
+
+**Pre-launch checklist** (all 5 required):
+1. Logging instrumented and verified
+2. AA test passed (no SRM in control vs control)
+3. Monitoring alerts configured
+4. Rollback plan documented
+5. Stakeholder approval obtained
+
+Type-specific checklist additions:
+- Cluster: Cluster definition validated, 20+ clusters available
+- Switchback: Carryover effects assessed, period length validated
+- Factorial: Factor independence verified
+- MAB: Exploration budget sufficient, reward function defined
+- Causal Inference: Identifying assumptions documented
+
+### Step 7: Monitoring & Stopping Rules
+
+Configure ongoing experiment monitoring:
+- **Refresh frequency**: How often to check metrics (default: every 60 minutes)
+- **SRM threshold**: p-value for Sample Ratio Mismatch detection (default: 0.001)
+- **Multiple testing correction**: NONE, BONFERRONI (conservative), BENJAMINI_HOCHBERG (balanced), HOLM
+
+**Stopping rules** (define at least one of each applicable type):
+- **SUCCESS**: Stop when primary metric reaches statistical significance
+- **FUTILITY**: Stop when effect is unlikely to reach significance (waste of traffic)
+- **HARM**: Stop if guardrail metric crosses harm threshold
+
+**Decision criteria** (Ship/Iterate/Kill):
+- **Ship**: Primary significant + guardrails protected
+- **Iterate**: Mixed results or secondary metrics show promise
+- **Kill**: Primary null or harm detected
+
+### Step 8: Summary & Export
+
+Generate a name, hypothesis, and description:
+- **Name**: Short descriptive name (e.g. "Checkout Flow Redesign Q1 2025")
+- **Hypothesis**: Format — "If we [change], then [metric] will [direction] because [reason]"
+- **Description**: Brief context on why this experiment is being run
+
+Then produce the **Experiment Design Document** (see output format below).
+
+## Inference Rules
+
+When the user describes what they want to test, infer as much as possible:
+- "test checkout flow" -> A/B Test, suggest Conversion Rate (BINARY, baseline ~0.05) + Page Load Time guardrail
+- "compare pricing in different cities" -> Cluster Randomized, suggest Revenue per User (CONTINUOUS)
+- "optimize content recommendations" -> MAB, suggest CTR (BINARY, baseline ~0.03)
+- "analyze impact of last year's policy change" -> Causal Inference (DiD method)
+- "test button color AND copy together" -> Factorial Design
+
+For conversion rates, use BINARY with decimal baseline (0.03 = 3%).
+For revenue/time/latency, use CONTINUOUS.
+For counts (purchases, sessions, errors), use COUNT.
+Always add at least one guardrail metric automatically.
+
+## Output Format
+
+At the end of Step 8, produce a complete design document in this format:
+
+```markdown
+# Experiment Design: [Name]
+
+## Overview
+- **Type**: [experiment type]
+- **Hypothesis**: If we [change], then [metric] will [direction] because [reason]
+- **Description**: [brief context]
+
+## Metrics
+| Metric | Category | Type | Direction | Baseline |
+|--------|----------|------|-----------|----------|
+| [name] | PRIMARY  | [type] | [dir]   | [value]  |
+| ...    | ...      | ...    | ...       | ...      |
+
+## Statistical Design
+- **Alpha**: [value]
+- **Power**: [value]
+- **MDE**: [value]% (relative)
+- **Sample size per variant**: [n]
+- **Total sample size**: [N]
+- **Estimated duration**: [days] days ([weeks] weeks)
+- **Daily traffic**: [value]
+- **Traffic allocation**: [split]
+
+## Randomization
+- **Unit**: [unit]
+- **Bucketing**: [strategy]
+- **Consistent assignment**: [yes/no]
+- **Stratification**: [variables or "none"]
+
+## Variance Reduction
+[List enabled techniques with configuration, or "None applied"]
+
+## Risk Assessment
+- **Risk level**: [LOW/MEDIUM/HIGH]
+- **Blast radius**: [n]%
+- **Negative impacts**: [list]
+- **Mitigation strategies**: [list]
+- **Rollback triggers**: [list]
+- **Circuit breakers**: [list]
+
+### Pre-Launch Checklist
+- [ ] Logging instrumented
+- [ ] AA test passed
+- [ ] Alerts configured
+- [ ] Rollback plan documented
+- [ ] Stakeholder approval
+[+ type-specific items]
+
+## Monitoring & Stopping Rules
+- **Refresh frequency**: every [n] minutes
+- **SRM threshold**: [value]
+- **Multiple testing correction**: [method]
+
+### Stopping Rules
+| Type | Condition | Threshold |
+|------|-----------|-----------|
+| [type] | [description] | [value] |
+
+### Decision Framework
+- **Ship if**: [criteria]
+- **Iterate if**: [criteria]
+- **Kill if**: [criteria]
+```
+
+## Handling Questions
+
+If the user asks a conceptual question (e.g. "what is MDE?", "why do I need guardrail metrics?"), answer it directly first (2-4 sentences), then steer back to the next uncompleted step.
+
+If the user provides `$ARGUMENTS`, use that as the description of what they want to test and begin with Step 1 inference immediately.
diff --git a/skills/experiment-designer/experiment-types.md b/skills/experiment-designer/experiment-types.md
@@ -0,0 +1,154 @@
+# Experiment Types Reference
+
+## A/B Test
+
+**When to use**: Discrete changes, sufficient traffic (>1,000 users/day), need clear causal evidence, treatment effect is expected to be immediate.
+
+**Use cases**: Testing new features or UI changes, comparing algorithms or ranking systems, optimizing conversion funnels, testing marketing copy.
+
+**Pros**: Simple to understand and implement, clean causal inference, well-established statistical methods, easy to analyze.
+
+**Cons**: Can only test one change at a time, requires sufficient traffic, may take time to reach significance.
+
+**Defaults**: alpha=0.05, power=0.8, 2 variants, 50/50 split.
+
+**Randomization**: USER_ID, HASH_BASED, consistent assignment.
+
+**Examples**: New checkout flow vs current, two recommendation algorithms, green vs blue CTA button.
+
+---
+
+## Cluster Randomized Experiment
+
+**When to use**: Treatment affects groups (cities, stores, schools), network effects or interference between users, marketplace experiments, geographic-based treatments. Need 20+ clusters.
+
+**Use cases**: Driver incentive programs across cities, marketplace pricing by region, store-level promotions.
+
+**Pros**: Handles network effects and spillover, suitable for group-level interventions, prevents contamination.
+
+**Cons**: Requires larger sample sizes, fewer degrees of freedom, more complex analysis, need many clusters for power.
+
+**Defaults**: alpha=0.05, power=0.8, 2 variants, 50/50 split.
+
+**Randomization**: CLUSTER, RANDOM, consistent assignment.
+
+**Key parameters**:
+- **ICC (Intra-Cluster Correlation)**: Proportion of variance between clusters (typically 0.01-0.1)
+- **Cluster size**: Average number of units per cluster
+- **Design Effect**: DEFF = 1 + (cluster_size - 1) x ICC. Inflates required sample size.
+
+**Warnings**:
+- Need at least 10 clusters per arm for reliable inference
+- Balance clusters in size for stable estimates
+- Larger ICC or cluster size = much larger sample sizes
+
+---
+
+## Switchback Experiment
+
+**When to use**: Strong network effects, supply-demand coupling (marketplaces), user-level randomization causes issues, treatments can be switched quickly.
+
+**Use cases**: Rideshare pricing algorithms, delivery dispatch systems, restaurant recommendations in food delivery.
+
+**Pros**: Handles network effects well, all users experience both conditions, reduces time-of-day variance.
+
+**Cons**: Carryover effects, longer duration, complex analysis, sensitive to non-stationarity.
+
+**Defaults**: alpha=0.05, power=0.8, 2 variants, 50/50 split.
+
+**Randomization**: SESSION, RANDOM, not consistent (users re-randomized each period).
+
+**Key parameters**:
+- **Number of periods**: Minimum 4 recommended
+- **Period length**: Hours per switch (balance between more data and carryover risk)
+- **Autocorrelation (rho)**: 0 to 1. Higher rho reduces effective sample size.
+- **Effective multiplier**: (1 - rho) / (1 + rho). Applied to sample size.
+
+**Warnings**:
+- Assumes no carryover effects between periods
+- Minimum 7 days to capture weekly patterns
+- Non-stationarity (trends, seasonality) can bias results
+
+---
+
+## Causal Inference (Observational)
+
+**When to use**: Cannot run a randomized experiment, analyzing historical data or policy changes, natural experiments, studying long-term effects.
+
+**Use cases**: Policy change impact across regions (DiD), eligibility threshold effects (RDD), effect of education on earnings (IV).
+
+**Pros**: Can analyze past data without running an experiment, useful for post-hoc analysis, study effects over longer periods, lower cost.
+
+**Cons**: Stronger assumptions required, risk of confounding bias, more complex statistical methods, results may be less definitive.
+
+**Defaults**: alpha=0.05, power=0.8, 2 groups.
+
+**Methods**:
+
+| Method | Key Assumption | Best For |
+|--------|---------------|----------|
+| **Difference-in-Differences (DiD)** | Parallel trends between groups pre-treatment | Policy changes with pre/post data |
+| **Regression Discontinuity (RDD)** | Continuity at the threshold | Eligibility cutoffs, score-based assignment |
+| **Propensity Score Matching (PSM)** | Overlap in covariates, no unmeasured confounders | Observational data with rich covariates |
+| **Instrumental Variables (IV)** | Valid, strong instrument (relevant + exogenous) | When direct randomization impossible |
+
+**Method-specific notes**:
+- DiD: Serial correlation inflates variance. AR(1) VIF = (1 + rho) / (1 - rho)
+- RDD: Sample size depends on bandwidth near threshold. Consult a statistician for bandwidth selection
+- PSM: Effective sample size depends on match quality and covariate overlap
+- IV: Weak instruments lead to biased estimates. Check first-stage F-statistic > 10
+
+---
+
+## Factorial Design
+
+**When to use**: Testing multiple independent changes simultaneously, interested in interaction effects, have sufficient traffic, changes are unlikely to conflict.
+
+**Use cases**: 2x2 button color x copy length, email subject x send time x personalization, landing page hero x headline x CTA.
+
+**Pros**: Test multiple factors efficiently, detect interaction effects, more information per user, faster than sequential A/B tests.
+
+**Cons**: More complex analysis, requires larger sample sizes, harder to interpret with many factors, false positive risk increases.
+
+**Defaults**: alpha=0.05, power=0.8, 4 variants (2x2), 25/25/25/25 split.
+
+**Randomization**: USER_ID, HASH_BASED, consistent assignment.
+
+**Key parameters**:
+- **Factors**: Each with a name and number of levels (e.g. Button Color with 2 levels)
+- **Total cells**: Product of all factor levels (e.g. 2 x 3 = 6 cells)
+- **Detect interaction**: If yes, sample size inflated ~4x per cell
+
+**Warnings**:
+- With many factors, total cells grow fast (3 factors with 3 levels each = 27 cells)
+- Verify factor independence before running
+- Consider Bonferroni correction for multiple comparisons
+
+---
+
+## Multi-Armed Bandit (MAB)
+
+**When to use**: Continuous optimization is priority, opportunity cost of poor variants is high, content recommendations, personalization, can tolerate adaptive allocation.
+
+**Use cases**: Content headline optimization, recommendation algorithm selection, promotional banner testing, push notification copy.
+
+**Pros**: Minimizes opportunity cost, adapts to changing conditions, automatic traffic allocation, good for ongoing optimization.
+
+**Cons**: Less interpretable than A/B tests, harder to establish causality, requires more sophisticated infrastructure, may converge to local optimum.
+
+**Defaults**: alpha=0.05, power=0.8, 3 arms.
+
+**Randomization**: REQUEST, RANDOM, not consistent (re-randomized each request).
+
+**Key parameters**:
+- **Number of arms**: 2 or more variants
+- **Horizon**: Total observations expected (minimum 1,000)
+- **Exploration rate (epsilon)**: 0 to 0.5. Higher = more exploration, slower convergence
+- **Exploration budget per arm**: (horizon x epsilon) / num_arms
+- **Estimated regret**: epsilon x horizon x (arms - 1) / arms
+
+**Notes**:
+- Traditional fixed-sample statistical testing does not apply
+- Sample size shown is the exploration budget per arm, not a significance calculation
+- Skip variance reduction (adaptive nature handles this)
+- Reward function must be well-defined before starting