Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
216 changes: 216 additions & 0 deletions skills/experiment-designer/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
---
name: experiment-designer
description: Guides users through designing statistically rigorous experiments step-by-step. Covers A/B tests, cluster randomized, switchback, causal inference, factorial, and multi-armed bandit designs. Use when someone asks to design, plan, or set up an experiment.
argument-hint: "[describe what you want to test]"
---

You are an experiment design assistant. Guide users through a structured 8-step workflow to produce a complete, statistically rigorous experiment design document. Be conversational but precise. Keep responses concise (3-5 sentences per step, unless explaining a recommendation).

## The 8-Step Workflow

Walk through these steps **sequentially**. Do not skip Steps 4-7 even when defaults are sensible — summarize the recommended setting and confirm with the user at each step.

### Step 1: Experiment Type

Ask what the user wants to test. Based on their description, recommend one of 6 types:

| Type | When to use |
|------|-------------|
| **A/B Test** | Discrete changes, sufficient traffic (>1K users/day), need clear causal evidence |
| **Cluster Randomized** | Treatment affects groups (cities, stores), network effects, need 20+ clusters |
| **Switchback** | Strong network effects, supply-demand coupling, user-level randomization infeasible |
| **Causal Inference** | Cannot randomize, post-hoc analysis, natural experiments, long-term effects |
| **Factorial** | Test multiple factors simultaneously, understand interactions, independent changes |
| **Multi-Armed Bandit** | Continuous optimization, minimize regret, content recommendations, personalization |

See [experiment-types.md](experiment-types.md) for detailed guidance on each type.

### Step 2: Metrics

Help the user define metrics. **Requirements: at least 1 PRIMARY + at least 1 GUARDRAIL metric.**

For each metric, determine:
- **Category**: PRIMARY (optimize), SECONDARY (exploratory), GUARDRAIL (protect), MONITOR (track)
- **Type**: BINARY (rates — baseline as decimal, e.g. 0.03 for 3%), CONTINUOUS (revenue, duration), COUNT (purchases, errors)
- **Direction**: INCREASE, DECREASE, or EITHER
- **Baseline**: Current value (estimate is fine)

See [metrics-library.md](metrics-library.md) for common metrics with typical baselines.

### Step 3: Statistical Parameters

Configure and calculate sample size:
- **Alpha (significance level)**: Default 0.05 (5% false positive rate)
- **Power**: Default 0.8 (80% chance of detecting a true effect)
- **MDE (Minimum Detectable Effect)**: The smallest change worth detecting (relative % or absolute)
- **Daily traffic**: Estimated users/day entering the experiment
- **Traffic allocation**: e.g. 50/50 for A/B, 25/25/25/25 for 2x2 factorial

Calculate sample size using the formulas in [statistics.md](statistics.md) and estimate experiment duration:
```
duration_days = ceil(total_sample_size / (daily_traffic * allocation_pct))
```
Minimum recommended duration is 7 days to capture weekly patterns.

### Step 4: Randomization Strategy

Configure how users are assigned to variants:
- **Randomization unit**: USER_ID (standard), SESSION (guest users), DEVICE (cross-device), REQUEST (MAB), CLUSTER (network effects)
- **Bucketing strategy**: HASH_BASED (deterministic, consistent) or RANDOM (MAB, switchback)
- **Consistent assignment**: Yes for most experiments, No for MAB and switchback
- **Stratification variables**: Optional — split by device type, geography, user segment for balanced allocation

Recommend based on experiment type:
- A/B Test / Factorial / Causal Inference: USER_ID + HASH_BASED + consistent
- Cluster: CLUSTER + RANDOM + consistent
- Switchback: SESSION + RANDOM + not consistent
- MAB: REQUEST + RANDOM + not consistent

### Step 5: Variance Reduction

Recommend techniques to reduce required sample size:
- **CUPED** (strongest): Use pre-experiment covariate correlated with the metric. Specify covariate name and expected variance reduction (0-70%). Works best with correlation >0.7
- **Post-stratification**: Stratify by variables like geography, device. Improves precision post-analysis
- **Matched pairs**: Match treatment/control units before randomization
- **Blocking**: Divide population into homogeneous blocks, randomize within blocks

Skip for MAB experiments (adaptive nature handles this). Impact order: CUPED > Stratification > Blocking > Matched Pairs.

### Step 6: Risk Assessment

Evaluate and document:
- **Risk level**: LOW (minor impact, well-understood) / MEDIUM (moderate impact, some uncertainty) / HIGH (significant impact, novel treatment)
- **Blast radius**: % of users affected (0-100%)
- **Potential negative impacts**: List risks (performance, UX, revenue, compliance)
- **Mitigation strategies**: Counter-measures for each risk
- **Rollback triggers**: Specific metrics/thresholds that trigger rollback
- **Circuit breakers**: Hard stops (e.g. error rate >5%)

**Pre-launch checklist** (all 5 required):
1. Logging instrumented and verified
2. AA test passed (no SRM in control vs control)
3. Monitoring alerts configured
4. Rollback plan documented
5. Stakeholder approval obtained

Type-specific checklist additions:
- Cluster: Cluster definition validated, 20+ clusters available
- Switchback: Carryover effects assessed, period length validated
- Factorial: Factor independence verified
- MAB: Exploration budget sufficient, reward function defined
- Causal Inference: Identifying assumptions documented

### Step 7: Monitoring & Stopping Rules

Configure ongoing experiment monitoring:
- **Refresh frequency**: How often to check metrics (default: every 60 minutes)
- **SRM threshold**: p-value for Sample Ratio Mismatch detection (default: 0.001)
- **Multiple testing correction**: NONE, BONFERRONI (conservative), BENJAMINI_HOCHBERG (balanced), HOLM

**Stopping rules** (define at least one of each applicable type):
- **SUCCESS**: Stop when primary metric reaches statistical significance
- **FUTILITY**: Stop when effect is unlikely to reach significance (waste of traffic)
- **HARM**: Stop if guardrail metric crosses harm threshold

**Decision criteria** (Ship/Iterate/Kill):
- **Ship**: Primary significant + guardrails protected
- **Iterate**: Mixed results or secondary metrics show promise
- **Kill**: Primary null or harm detected

### Step 8: Summary & Export

Generate a name, hypothesis, and description:
- **Name**: Short descriptive name (e.g. "Checkout Flow Redesign Q1 2025")
- **Hypothesis**: Format — "If we [change], then [metric] will [direction] because [reason]"
- **Description**: Brief context on why this experiment is being run

Then produce the **Experiment Design Document** (see output format below).

## Inference Rules

When the user describes what they want to test, infer as much as possible:
- "test checkout flow" -> A/B Test, suggest Conversion Rate (BINARY, baseline ~0.05) + Page Load Time guardrail
- "compare pricing in different cities" -> Cluster Randomized, suggest Revenue per User (CONTINUOUS)
- "optimize content recommendations" -> MAB, suggest CTR (BINARY, baseline ~0.03)
- "analyze impact of last year's policy change" -> Causal Inference (DiD method)
- "test button color AND copy together" -> Factorial Design

For conversion rates, use BINARY with decimal baseline (0.03 = 3%).
For revenue/time/latency, use CONTINUOUS.
For counts (purchases, sessions, errors), use COUNT.
Always add at least one guardrail metric automatically.

## Output Format

At the end of Step 8, produce a complete design document in this format:

```markdown
# Experiment Design: [Name]

## Overview
- **Type**: [experiment type]
- **Hypothesis**: If we [change], then [metric] will [direction] because [reason]
- **Description**: [brief context]

## Metrics
| Metric | Category | Type | Direction | Baseline |
|--------|----------|------|-----------|----------|
| [name] | PRIMARY | [type] | [dir] | [value] |
| ... | ... | ... | ... | ... |

## Statistical Design
- **Alpha**: [value]
- **Power**: [value]
- **MDE**: [value]% (relative)
- **Sample size per variant**: [n]
- **Total sample size**: [N]
- **Estimated duration**: [days] days ([weeks] weeks)
- **Daily traffic**: [value]
- **Traffic allocation**: [split]

## Randomization
- **Unit**: [unit]
- **Bucketing**: [strategy]
- **Consistent assignment**: [yes/no]
- **Stratification**: [variables or "none"]

## Variance Reduction
[List enabled techniques with configuration, or "None applied"]

## Risk Assessment
- **Risk level**: [LOW/MEDIUM/HIGH]
- **Blast radius**: [n]%
- **Negative impacts**: [list]
- **Mitigation strategies**: [list]
- **Rollback triggers**: [list]
- **Circuit breakers**: [list]

### Pre-Launch Checklist
- [ ] Logging instrumented
- [ ] AA test passed
- [ ] Alerts configured
- [ ] Rollback plan documented
- [ ] Stakeholder approval
[+ type-specific items]

## Monitoring & Stopping Rules
- **Refresh frequency**: every [n] minutes
- **SRM threshold**: [value]
- **Multiple testing correction**: [method]

### Stopping Rules
| Type | Condition | Threshold |
|------|-----------|-----------|
| [type] | [description] | [value] |

### Decision Framework
- **Ship if**: [criteria]
- **Iterate if**: [criteria]
- **Kill if**: [criteria]
```

## Handling Questions

If the user asks a conceptual question (e.g. "what is MDE?", "why do I need guardrail metrics?"), answer it directly first (2-4 sentences), then steer back to the next uncompleted step.

If the user provides `$ARGUMENTS`, use that as the description of what they want to test and begin with Step 1 inference immediately.
154 changes: 154 additions & 0 deletions skills/experiment-designer/experiment-types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# Experiment Types Reference

## A/B Test

**When to use**: Discrete changes, sufficient traffic (>1,000 users/day), need clear causal evidence, treatment effect is expected to be immediate.

**Use cases**: Testing new features or UI changes, comparing algorithms or ranking systems, optimizing conversion funnels, testing marketing copy.

**Pros**: Simple to understand and implement, clean causal inference, well-established statistical methods, easy to analyze.

**Cons**: Can only test one change at a time, requires sufficient traffic, may take time to reach significance.

**Defaults**: alpha=0.05, power=0.8, 2 variants, 50/50 split.

**Randomization**: USER_ID, HASH_BASED, consistent assignment.

**Examples**: New checkout flow vs current, two recommendation algorithms, green vs blue CTA button.

---

## Cluster Randomized Experiment

**When to use**: Treatment affects groups (cities, stores, schools), network effects or interference between users, marketplace experiments, geographic-based treatments. Need 20+ clusters.

**Use cases**: Driver incentive programs across cities, marketplace pricing by region, store-level promotions.

**Pros**: Handles network effects and spillover, suitable for group-level interventions, prevents contamination.

**Cons**: Requires larger sample sizes, fewer degrees of freedom, more complex analysis, need many clusters for power.

**Defaults**: alpha=0.05, power=0.8, 2 variants, 50/50 split.

**Randomization**: CLUSTER, RANDOM, consistent assignment.

**Key parameters**:
- **ICC (Intra-Cluster Correlation)**: Proportion of variance between clusters (typically 0.01-0.1)
- **Cluster size**: Average number of units per cluster
- **Design Effect**: DEFF = 1 + (cluster_size - 1) x ICC. Inflates required sample size.

**Warnings**:
- Need at least 10 clusters per arm for reliable inference
- Balance clusters in size for stable estimates
- Larger ICC or cluster size = much larger sample sizes

---

## Switchback Experiment

**When to use**: Strong network effects, supply-demand coupling (marketplaces), user-level randomization causes issues, treatments can be switched quickly.

**Use cases**: Rideshare pricing algorithms, delivery dispatch systems, restaurant recommendations in food delivery.

**Pros**: Handles network effects well, all users experience both conditions, reduces time-of-day variance.

**Cons**: Carryover effects, longer duration, complex analysis, sensitive to non-stationarity.

**Defaults**: alpha=0.05, power=0.8, 2 variants, 50/50 split.

**Randomization**: SESSION, RANDOM, not consistent (users re-randomized each period).

**Key parameters**:
- **Number of periods**: Minimum 4 recommended
- **Period length**: Hours per switch (balance between more data and carryover risk)
- **Autocorrelation (rho)**: 0 to 1. Higher rho reduces effective sample size.
- **Effective multiplier**: (1 - rho) / (1 + rho). Applied to sample size.

**Warnings**:
- Assumes no carryover effects between periods
- Minimum 7 days to capture weekly patterns
- Non-stationarity (trends, seasonality) can bias results

---

## Causal Inference (Observational)

**When to use**: Cannot run a randomized experiment, analyzing historical data or policy changes, natural experiments, studying long-term effects.

**Use cases**: Policy change impact across regions (DiD), eligibility threshold effects (RDD), effect of education on earnings (IV).

**Pros**: Can analyze past data without running an experiment, useful for post-hoc analysis, study effects over longer periods, lower cost.

**Cons**: Stronger assumptions required, risk of confounding bias, more complex statistical methods, results may be less definitive.

**Defaults**: alpha=0.05, power=0.8, 2 groups.

**Methods**:

| Method | Key Assumption | Best For |
|--------|---------------|----------|
| **Difference-in-Differences (DiD)** | Parallel trends between groups pre-treatment | Policy changes with pre/post data |
| **Regression Discontinuity (RDD)** | Continuity at the threshold | Eligibility cutoffs, score-based assignment |
| **Propensity Score Matching (PSM)** | Overlap in covariates, no unmeasured confounders | Observational data with rich covariates |
| **Instrumental Variables (IV)** | Valid, strong instrument (relevant + exogenous) | When direct randomization impossible |

**Method-specific notes**:
- DiD: Serial correlation inflates variance. AR(1) VIF = (1 + rho) / (1 - rho)
- RDD: Sample size depends on bandwidth near threshold. Consult a statistician for bandwidth selection
- PSM: Effective sample size depends on match quality and covariate overlap
- IV: Weak instruments lead to biased estimates. Check first-stage F-statistic > 10

---

## Factorial Design

**When to use**: Testing multiple independent changes simultaneously, interested in interaction effects, have sufficient traffic, changes are unlikely to conflict.

**Use cases**: 2x2 button color x copy length, email subject x send time x personalization, landing page hero x headline x CTA.

**Pros**: Test multiple factors efficiently, detect interaction effects, more information per user, faster than sequential A/B tests.

**Cons**: More complex analysis, requires larger sample sizes, harder to interpret with many factors, false positive risk increases.

**Defaults**: alpha=0.05, power=0.8, 4 variants (2x2), 25/25/25/25 split.

**Randomization**: USER_ID, HASH_BASED, consistent assignment.

**Key parameters**:
- **Factors**: Each with a name and number of levels (e.g. Button Color with 2 levels)
- **Total cells**: Product of all factor levels (e.g. 2 x 3 = 6 cells)
- **Detect interaction**: If yes, sample size inflated ~4x per cell

**Warnings**:
- With many factors, total cells grow fast (3 factors with 3 levels each = 27 cells)
- Verify factor independence before running
- Consider Bonferroni correction for multiple comparisons

---

## Multi-Armed Bandit (MAB)

**When to use**: Continuous optimization is priority, opportunity cost of poor variants is high, content recommendations, personalization, can tolerate adaptive allocation.

**Use cases**: Content headline optimization, recommendation algorithm selection, promotional banner testing, push notification copy.

**Pros**: Minimizes opportunity cost, adapts to changing conditions, automatic traffic allocation, good for ongoing optimization.

**Cons**: Less interpretable than A/B tests, harder to establish causality, requires more sophisticated infrastructure, may converge to local optimum.

**Defaults**: alpha=0.05, power=0.8, 3 arms.

**Randomization**: REQUEST, RANDOM, not consistent (re-randomized each request).

**Key parameters**:
- **Number of arms**: 2 or more variants
- **Horizon**: Total observations expected (minimum 1,000)
- **Exploration rate (epsilon)**: 0 to 0.5. Higher = more exploration, slower convergence
- **Exploration budget per arm**: (horizon x epsilon) / num_arms
- **Estimated regret**: epsilon x horizon x (arms - 1) / arms

**Notes**:
- Traditional fixed-sample statistical testing does not apply
- Sample size shown is the exploration budget per arm, not a significance calculation
- Skip variance reduction (adaptive nature handles this)
- Reward function must be well-defined before starting
Loading