diff --git a/skills/experiment-designer/SKILL.md b/skills/experiment-designer/SKILL.md new file mode 100644 index 000000000..ce8305166 --- /dev/null +++ b/skills/experiment-designer/SKILL.md @@ -0,0 +1,216 @@ +--- +name: experiment-designer +description: Guides users through designing statistically rigorous experiments step-by-step. Covers A/B tests, cluster randomized, switchback, causal inference, factorial, and multi-armed bandit designs. Use when someone asks to design, plan, or set up an experiment. +argument-hint: "[describe what you want to test]" +--- + +You are an experiment design assistant. Guide users through a structured 8-step workflow to produce a complete, statistically rigorous experiment design document. Be conversational but precise. Keep responses concise (3-5 sentences per step, unless explaining a recommendation). + +## The 8-Step Workflow + +Walk through these steps **sequentially**. Do not skip Steps 4-7 even when defaults are sensible — summarize the recommended setting and confirm with the user at each step. + +### Step 1: Experiment Type + +Ask what the user wants to test. Based on their description, recommend one of 6 types: + +| Type | When to use | +|------|-------------| +| **A/B Test** | Discrete changes, sufficient traffic (>1K users/day), need clear causal evidence | +| **Cluster Randomized** | Treatment affects groups (cities, stores), network effects, need 20+ clusters | +| **Switchback** | Strong network effects, supply-demand coupling, user-level randomization infeasible | +| **Causal Inference** | Cannot randomize, post-hoc analysis, natural experiments, long-term effects | +| **Factorial** | Test multiple factors simultaneously, understand interactions, independent changes | +| **Multi-Armed Bandit** | Continuous optimization, minimize regret, content recommendations, personalization | + +See [experiment-types.md](experiment-types.md) for detailed guidance on each type. + +### Step 2: Metrics + +Help the user define metrics. **Requirements: at least 1 PRIMARY + at least 1 GUARDRAIL metric.** + +For each metric, determine: +- **Category**: PRIMARY (optimize), SECONDARY (exploratory), GUARDRAIL (protect), MONITOR (track) +- **Type**: BINARY (rates — baseline as decimal, e.g. 0.03 for 3%), CONTINUOUS (revenue, duration), COUNT (purchases, errors) +- **Direction**: INCREASE, DECREASE, or EITHER +- **Baseline**: Current value (estimate is fine) + +See [metrics-library.md](metrics-library.md) for common metrics with typical baselines. + +### Step 3: Statistical Parameters + +Configure and calculate sample size: +- **Alpha (significance level)**: Default 0.05 (5% false positive rate) +- **Power**: Default 0.8 (80% chance of detecting a true effect) +- **MDE (Minimum Detectable Effect)**: The smallest change worth detecting (relative % or absolute) +- **Daily traffic**: Estimated users/day entering the experiment +- **Traffic allocation**: e.g. 50/50 for A/B, 25/25/25/25 for 2x2 factorial + +Calculate sample size using the formulas in [statistics.md](statistics.md) and estimate experiment duration: +``` +duration_days = ceil(total_sample_size / (daily_traffic * allocation_pct)) +``` +Minimum recommended duration is 7 days to capture weekly patterns. + +### Step 4: Randomization Strategy + +Configure how users are assigned to variants: +- **Randomization unit**: USER_ID (standard), SESSION (guest users), DEVICE (cross-device), REQUEST (MAB), CLUSTER (network effects) +- **Bucketing strategy**: HASH_BASED (deterministic, consistent) or RANDOM (MAB, switchback) +- **Consistent assignment**: Yes for most experiments, No for MAB and switchback +- **Stratification variables**: Optional — split by device type, geography, user segment for balanced allocation + +Recommend based on experiment type: +- A/B Test / Factorial / Causal Inference: USER_ID + HASH_BASED + consistent +- Cluster: CLUSTER + RANDOM + consistent +- Switchback: SESSION + RANDOM + not consistent +- MAB: REQUEST + RANDOM + not consistent + +### Step 5: Variance Reduction + +Recommend techniques to reduce required sample size: +- **CUPED** (strongest): Use pre-experiment covariate correlated with the metric. Specify covariate name and expected variance reduction (0-70%). Works best with correlation >0.7 +- **Post-stratification**: Stratify by variables like geography, device. Improves precision post-analysis +- **Matched pairs**: Match treatment/control units before randomization +- **Blocking**: Divide population into homogeneous blocks, randomize within blocks + +Skip for MAB experiments (adaptive nature handles this). Impact order: CUPED > Stratification > Blocking > Matched Pairs. + +### Step 6: Risk Assessment + +Evaluate and document: +- **Risk level**: LOW (minor impact, well-understood) / MEDIUM (moderate impact, some uncertainty) / HIGH (significant impact, novel treatment) +- **Blast radius**: % of users affected (0-100%) +- **Potential negative impacts**: List risks (performance, UX, revenue, compliance) +- **Mitigation strategies**: Counter-measures for each risk +- **Rollback triggers**: Specific metrics/thresholds that trigger rollback +- **Circuit breakers**: Hard stops (e.g. error rate >5%) + +**Pre-launch checklist** (all 5 required): +1. Logging instrumented and verified +2. AA test passed (no SRM in control vs control) +3. Monitoring alerts configured +4. Rollback plan documented +5. Stakeholder approval obtained + +Type-specific checklist additions: +- Cluster: Cluster definition validated, 20+ clusters available +- Switchback: Carryover effects assessed, period length validated +- Factorial: Factor independence verified +- MAB: Exploration budget sufficient, reward function defined +- Causal Inference: Identifying assumptions documented + +### Step 7: Monitoring & Stopping Rules + +Configure ongoing experiment monitoring: +- **Refresh frequency**: How often to check metrics (default: every 60 minutes) +- **SRM threshold**: p-value for Sample Ratio Mismatch detection (default: 0.001) +- **Multiple testing correction**: NONE, BONFERRONI (conservative), BENJAMINI_HOCHBERG (balanced), HOLM + +**Stopping rules** (define at least one of each applicable type): +- **SUCCESS**: Stop when primary metric reaches statistical significance +- **FUTILITY**: Stop when effect is unlikely to reach significance (waste of traffic) +- **HARM**: Stop if guardrail metric crosses harm threshold + +**Decision criteria** (Ship/Iterate/Kill): +- **Ship**: Primary significant + guardrails protected +- **Iterate**: Mixed results or secondary metrics show promise +- **Kill**: Primary null or harm detected + +### Step 8: Summary & Export + +Generate a name, hypothesis, and description: +- **Name**: Short descriptive name (e.g. "Checkout Flow Redesign Q1 2025") +- **Hypothesis**: Format — "If we [change], then [metric] will [direction] because [reason]" +- **Description**: Brief context on why this experiment is being run + +Then produce the **Experiment Design Document** (see output format below). + +## Inference Rules + +When the user describes what they want to test, infer as much as possible: +- "test checkout flow" -> A/B Test, suggest Conversion Rate (BINARY, baseline ~0.05) + Page Load Time guardrail +- "compare pricing in different cities" -> Cluster Randomized, suggest Revenue per User (CONTINUOUS) +- "optimize content recommendations" -> MAB, suggest CTR (BINARY, baseline ~0.03) +- "analyze impact of last year's policy change" -> Causal Inference (DiD method) +- "test button color AND copy together" -> Factorial Design + +For conversion rates, use BINARY with decimal baseline (0.03 = 3%). +For revenue/time/latency, use CONTINUOUS. +For counts (purchases, sessions, errors), use COUNT. +Always add at least one guardrail metric automatically. + +## Output Format + +At the end of Step 8, produce a complete design document in this format: + +```markdown +# Experiment Design: [Name] + +## Overview +- **Type**: [experiment type] +- **Hypothesis**: If we [change], then [metric] will [direction] because [reason] +- **Description**: [brief context] + +## Metrics +| Metric | Category | Type | Direction | Baseline | +|--------|----------|------|-----------|----------| +| [name] | PRIMARY | [type] | [dir] | [value] | +| ... | ... | ... | ... | ... | + +## Statistical Design +- **Alpha**: [value] +- **Power**: [value] +- **MDE**: [value]% (relative) +- **Sample size per variant**: [n] +- **Total sample size**: [N] +- **Estimated duration**: [days] days ([weeks] weeks) +- **Daily traffic**: [value] +- **Traffic allocation**: [split] + +## Randomization +- **Unit**: [unit] +- **Bucketing**: [strategy] +- **Consistent assignment**: [yes/no] +- **Stratification**: [variables or "none"] + +## Variance Reduction +[List enabled techniques with configuration, or "None applied"] + +## Risk Assessment +- **Risk level**: [LOW/MEDIUM/HIGH] +- **Blast radius**: [n]% +- **Negative impacts**: [list] +- **Mitigation strategies**: [list] +- **Rollback triggers**: [list] +- **Circuit breakers**: [list] + +### Pre-Launch Checklist +- [ ] Logging instrumented +- [ ] AA test passed +- [ ] Alerts configured +- [ ] Rollback plan documented +- [ ] Stakeholder approval +[+ type-specific items] + +## Monitoring & Stopping Rules +- **Refresh frequency**: every [n] minutes +- **SRM threshold**: [value] +- **Multiple testing correction**: [method] + +### Stopping Rules +| Type | Condition | Threshold | +|------|-----------|-----------| +| [type] | [description] | [value] | + +### Decision Framework +- **Ship if**: [criteria] +- **Iterate if**: [criteria] +- **Kill if**: [criteria] +``` + +## Handling Questions + +If the user asks a conceptual question (e.g. "what is MDE?", "why do I need guardrail metrics?"), answer it directly first (2-4 sentences), then steer back to the next uncompleted step. + +If the user provides `$ARGUMENTS`, use that as the description of what they want to test and begin with Step 1 inference immediately. diff --git a/skills/experiment-designer/experiment-types.md b/skills/experiment-designer/experiment-types.md new file mode 100644 index 000000000..de28ade2a --- /dev/null +++ b/skills/experiment-designer/experiment-types.md @@ -0,0 +1,154 @@ +# Experiment Types Reference + +## A/B Test + +**When to use**: Discrete changes, sufficient traffic (>1,000 users/day), need clear causal evidence, treatment effect is expected to be immediate. + +**Use cases**: Testing new features or UI changes, comparing algorithms or ranking systems, optimizing conversion funnels, testing marketing copy. + +**Pros**: Simple to understand and implement, clean causal inference, well-established statistical methods, easy to analyze. + +**Cons**: Can only test one change at a time, requires sufficient traffic, may take time to reach significance. + +**Defaults**: alpha=0.05, power=0.8, 2 variants, 50/50 split. + +**Randomization**: USER_ID, HASH_BASED, consistent assignment. + +**Examples**: New checkout flow vs current, two recommendation algorithms, green vs blue CTA button. + +--- + +## Cluster Randomized Experiment + +**When to use**: Treatment affects groups (cities, stores, schools), network effects or interference between users, marketplace experiments, geographic-based treatments. Need 20+ clusters. + +**Use cases**: Driver incentive programs across cities, marketplace pricing by region, store-level promotions. + +**Pros**: Handles network effects and spillover, suitable for group-level interventions, prevents contamination. + +**Cons**: Requires larger sample sizes, fewer degrees of freedom, more complex analysis, need many clusters for power. + +**Defaults**: alpha=0.05, power=0.8, 2 variants, 50/50 split. + +**Randomization**: CLUSTER, RANDOM, consistent assignment. + +**Key parameters**: +- **ICC (Intra-Cluster Correlation)**: Proportion of variance between clusters (typically 0.01-0.1) +- **Cluster size**: Average number of units per cluster +- **Design Effect**: DEFF = 1 + (cluster_size - 1) x ICC. Inflates required sample size. + +**Warnings**: +- Need at least 10 clusters per arm for reliable inference +- Balance clusters in size for stable estimates +- Larger ICC or cluster size = much larger sample sizes + +--- + +## Switchback Experiment + +**When to use**: Strong network effects, supply-demand coupling (marketplaces), user-level randomization causes issues, treatments can be switched quickly. + +**Use cases**: Rideshare pricing algorithms, delivery dispatch systems, restaurant recommendations in food delivery. + +**Pros**: Handles network effects well, all users experience both conditions, reduces time-of-day variance. + +**Cons**: Carryover effects, longer duration, complex analysis, sensitive to non-stationarity. + +**Defaults**: alpha=0.05, power=0.8, 2 variants, 50/50 split. + +**Randomization**: SESSION, RANDOM, not consistent (users re-randomized each period). + +**Key parameters**: +- **Number of periods**: Minimum 4 recommended +- **Period length**: Hours per switch (balance between more data and carryover risk) +- **Autocorrelation (rho)**: 0 to 1. Higher rho reduces effective sample size. +- **Effective multiplier**: (1 - rho) / (1 + rho). Applied to sample size. + +**Warnings**: +- Assumes no carryover effects between periods +- Minimum 7 days to capture weekly patterns +- Non-stationarity (trends, seasonality) can bias results + +--- + +## Causal Inference (Observational) + +**When to use**: Cannot run a randomized experiment, analyzing historical data or policy changes, natural experiments, studying long-term effects. + +**Use cases**: Policy change impact across regions (DiD), eligibility threshold effects (RDD), effect of education on earnings (IV). + +**Pros**: Can analyze past data without running an experiment, useful for post-hoc analysis, study effects over longer periods, lower cost. + +**Cons**: Stronger assumptions required, risk of confounding bias, more complex statistical methods, results may be less definitive. + +**Defaults**: alpha=0.05, power=0.8, 2 groups. + +**Methods**: + +| Method | Key Assumption | Best For | +|--------|---------------|----------| +| **Difference-in-Differences (DiD)** | Parallel trends between groups pre-treatment | Policy changes with pre/post data | +| **Regression Discontinuity (RDD)** | Continuity at the threshold | Eligibility cutoffs, score-based assignment | +| **Propensity Score Matching (PSM)** | Overlap in covariates, no unmeasured confounders | Observational data with rich covariates | +| **Instrumental Variables (IV)** | Valid, strong instrument (relevant + exogenous) | When direct randomization impossible | + +**Method-specific notes**: +- DiD: Serial correlation inflates variance. AR(1) VIF = (1 + rho) / (1 - rho) +- RDD: Sample size depends on bandwidth near threshold. Consult a statistician for bandwidth selection +- PSM: Effective sample size depends on match quality and covariate overlap +- IV: Weak instruments lead to biased estimates. Check first-stage F-statistic > 10 + +--- + +## Factorial Design + +**When to use**: Testing multiple independent changes simultaneously, interested in interaction effects, have sufficient traffic, changes are unlikely to conflict. + +**Use cases**: 2x2 button color x copy length, email subject x send time x personalization, landing page hero x headline x CTA. + +**Pros**: Test multiple factors efficiently, detect interaction effects, more information per user, faster than sequential A/B tests. + +**Cons**: More complex analysis, requires larger sample sizes, harder to interpret with many factors, false positive risk increases. + +**Defaults**: alpha=0.05, power=0.8, 4 variants (2x2), 25/25/25/25 split. + +**Randomization**: USER_ID, HASH_BASED, consistent assignment. + +**Key parameters**: +- **Factors**: Each with a name and number of levels (e.g. Button Color with 2 levels) +- **Total cells**: Product of all factor levels (e.g. 2 x 3 = 6 cells) +- **Detect interaction**: If yes, sample size inflated ~4x per cell + +**Warnings**: +- With many factors, total cells grow fast (3 factors with 3 levels each = 27 cells) +- Verify factor independence before running +- Consider Bonferroni correction for multiple comparisons + +--- + +## Multi-Armed Bandit (MAB) + +**When to use**: Continuous optimization is priority, opportunity cost of poor variants is high, content recommendations, personalization, can tolerate adaptive allocation. + +**Use cases**: Content headline optimization, recommendation algorithm selection, promotional banner testing, push notification copy. + +**Pros**: Minimizes opportunity cost, adapts to changing conditions, automatic traffic allocation, good for ongoing optimization. + +**Cons**: Less interpretable than A/B tests, harder to establish causality, requires more sophisticated infrastructure, may converge to local optimum. + +**Defaults**: alpha=0.05, power=0.8, 3 arms. + +**Randomization**: REQUEST, RANDOM, not consistent (re-randomized each request). + +**Key parameters**: +- **Number of arms**: 2 or more variants +- **Horizon**: Total observations expected (minimum 1,000) +- **Exploration rate (epsilon)**: 0 to 0.5. Higher = more exploration, slower convergence +- **Exploration budget per arm**: (horizon x epsilon) / num_arms +- **Estimated regret**: epsilon x horizon x (arms - 1) / arms + +**Notes**: +- Traditional fixed-sample statistical testing does not apply +- Sample size shown is the exploration budget per arm, not a significance calculation +- Skip variance reduction (adaptive nature handles this) +- Reward function must be well-defined before starting diff --git a/skills/experiment-designer/metrics-library.md b/skills/experiment-designer/metrics-library.md new file mode 100644 index 000000000..ed111f58d --- /dev/null +++ b/skills/experiment-designer/metrics-library.md @@ -0,0 +1,64 @@ +# Common Metrics Library + +Use these as starting points when helping users select metrics. Adjust baselines to match the user's actual data when available. + +## Binary Metrics + +Baselines expressed as decimals (0.05 = 5%). Use proportions test for sample size calculation. + +| Metric | Typical Category | Direction | Baseline | Notes | +|--------|-----------------|-----------|----------|-------| +| Conversion Rate | PRIMARY | INCREASE | 0.05 (5%) | Users completing a desired action | +| Click-Through Rate (CTR) | PRIMARY | INCREASE | 0.03 (3%) | Users clicking a specific element | +| Bounce Rate | GUARDRAIL | DECREASE | 0.40 (40%) | Users leaving without interaction | +| Signup Rate | PRIMARY | INCREASE | 0.02 (2%) | Visitors creating an account | +| Retention Rate (7-day) | PRIMARY | INCREASE | 0.25 (25%) | Users returning within 7 days | + +## Continuous Metrics + +Baselines as raw values. Provide variance (sigma^2) or standard deviation for sample size calculation. If unknown, default to CV=10% (variance = (baseline * 0.1)^2). + +| Metric | Typical Category | Direction | Baseline | Variance | Notes | +|--------|-----------------|-----------|----------|----------|-------| +| Revenue per User | PRIMARY | INCREASE | $50 | 2,500 (sd=50) | Average revenue generated per user | +| Average Order Value | PRIMARY | INCREASE | $75 | 1,875 (sd=43) | Average transaction value | +| Session Duration | SECONDARY | INCREASE | 180s | 14,400 (sd=120) | Average session length in seconds | +| Page Load Time | GUARDRAIL | DECREASE | 1,500ms | 250,000 (sd=500) | Average page load in milliseconds | +| Engagement Score | SECONDARY | INCREASE | 7.5 | 6.25 (sd=2.5) | Composite engagement metric | + +## Count Metrics + +Baselines as average counts. Use Poisson approximation (variance = lambda = baseline). + +| Metric | Typical Category | Direction | Baseline | Notes | +|--------|-----------------|-----------|----------|-------| +| Number of Purchases | PRIMARY | INCREASE | 1.5 | Purchases per user | +| Pages per Session | SECONDARY | INCREASE | 4.0 | Pages viewed per session | +| Error Count | GUARDRAIL | DECREASE | 0.1 | Errors encountered per user | +| Feature Usage Count | SECONDARY | INCREASE | 2.5 | Times a feature is used per user | + +## Choosing the Right Metric Type + +| If the metric is... | Use type | Baseline format | +|---------------------|----------|-----------------| +| A rate or percentage (conversion, CTR, bounce) | BINARY | Decimal (0.05 for 5%) | +| A dollar amount, duration, or score | CONTINUOUS | Raw value ($50, 180s) | +| A count of events per user | COUNT | Average count (1.5) | + +## Metric Category Guidelines + +- **PRIMARY**: The metric you're trying to move. Every experiment needs at least one. +- **GUARDRAIL**: Metrics that must not degrade. Every experiment needs at least one. Common guardrails: error rate, page load time, bounce rate, revenue (if not primary). +- **SECONDARY**: Interesting to observe but not the primary goal. Helps explain results. +- **MONITOR**: Operational metrics to track (e.g. traffic volume, latency). Not used in statistical analysis. + +## Quick Metric Selection by Experiment Goal + +| Goal | Suggested PRIMARY | Suggested GUARDRAIL | +|------|-------------------|---------------------| +| Improve conversion funnel | Conversion Rate | Bounce Rate, Page Load Time | +| Increase revenue | Revenue per User | Conversion Rate, Error Count | +| Boost engagement | Session Duration or Engagement Score | Bounce Rate, Error Count | +| Optimize content | CTR | Bounce Rate, Revenue per User | +| Reduce friction | Signup Rate or Conversion Rate | Page Load Time, Error Count | +| Improve retention | Retention Rate (7-day) | Revenue per User, Error Count | diff --git a/skills/experiment-designer/statistics.md b/skills/experiment-designer/statistics.md new file mode 100644 index 000000000..f2d811055 --- /dev/null +++ b/skills/experiment-designer/statistics.md @@ -0,0 +1,137 @@ +# Statistical Methods Reference + +## Sample Size Formulas + +All formulas use two-tailed tests with z-scores: z_alpha = Z(1 - alpha/2), z_beta = Z(power). + +### Binary Metrics (Proportions Test) +``` +n = 2 * (z_alpha + z_beta)^2 * p_pooled * (1 - p_pooled) / (p2 - p1)^2 + +where p_pooled = (p1 + p2) / 2 + p1 = baseline rate (decimal, e.g. 0.05 for 5%) + p2 = p1 + absolute effect size +``` + +### Continuous Metrics (Two-Sample T-Test) +``` +n = 2 * (z_alpha + z_beta)^2 * variance / effect_size^2 + +where variance = sigma^2 (if not known, default: (baseline * 0.1)^2) + effect_size = baseline * (mde_pct / 100) for relative MDE +``` + +### Count Metrics (Poisson Approximation) +``` +n = 2 * (z_alpha + z_beta)^2 * lambda / effect_size^2 + +where lambda = baseline count rate +``` + +## Adjustments + +### Unequal Traffic Allocation +When control/treatment split is not 50/50: +``` +adjusted_n = n * (1 + r)^2 / (4 * r) + +where r = treatment_allocation / control_allocation +``` + +### Multiple Variants +When more than 2 variants (including control): +``` +adjusted_n = n * (k - 1) + +where k = number of variants +``` +Consider multiple testing correction (Bonferroni, Holm, or Benjamini-Hochberg). + +### Cluster Design Effect +For cluster randomized experiments: +``` +DEFF = 1 + (m - 1) * ICC +adjusted_n = n * DEFF +clusters_per_arm = ceil(adjusted_n / m) + +where m = average cluster size + ICC = intra-cluster correlation (typically 0.01 - 0.1) +``` + +### Switchback Autocorrelation +For switchback experiments: +``` +effective_multiplier = (1 - rho) / (1 + rho) +adjusted_n = ceil(n / effective_multiplier) +effective_periods = floor(num_periods * effective_multiplier) + +where rho = temporal autocorrelation (0 to 1) +``` + +### Factorial Cells +For factorial designs: +``` +total_cells = product of all factor levels +total_n = n_per_cell * total_cells + +If detecting interactions: interaction_n = n_per_cell * 4 * total_cells +``` + +### Causal Inference: DiD Serial Correlation +``` +VIF = (1 + rho) / (1 - rho) +adjusted_n = ceil(n * VIF) +``` + +### MAB Exploration Budget +For epsilon-greedy multi-armed bandit: +``` +explore_budget = ceil(horizon * epsilon) +per_arm_explore = ceil(explore_budget / num_arms) +estimated_regret = epsilon * horizon * (arms - 1) / arms +``` + +## Duration Estimation + +``` +effective_daily_traffic = daily_traffic * (sum_of_allocation_pcts / 100) +days = ceil(total_sample_size / effective_daily_traffic) +weeks = ceil(days / 7) +``` + +Minimum recommended duration: **7 days** (to capture weekly patterns). Consider adding 2-3 buffer days for ramp-up/cool-down. + +## MDE Feasibility Check + +Given available traffic and max duration, calculate the achievable MDE: +``` +available_n = daily_traffic * max_days +achievable_mde = (z_alpha + z_beta) * sqrt(2 * variance / available_n) # continuous +achievable_mde = (z_alpha + z_beta) * sqrt(2 * p * (1-p) / available_n) # binary +achievable_mde = (z_alpha + z_beta) * sqrt(2 * lambda / available_n) # count +``` + +If achievable MDE > target MDE, the experiment is not feasible with current traffic. + +## Common Warnings + +| Condition | Warning | +|-----------|---------| +| n < 100 per variant | Very small sample. Results may be unreliable | +| MDE < 1% relative | Very small effect needs large traffic or long duration | +| Alpha > 0.05 | Risk of false positives increases | +| Power < 0.8 | Risk of false negatives increases | +| Clusters < 10 per arm | Fewer clusters = unreliable inference | +| Cluster size imbalance | Balance clusters for stable estimates | +| Duration < 7 days | May miss weekly patterns | +| Duration > 90 days | Consider if the experiment is worth running this long | + +## Standard Defaults + +| Parameter | Default | Notes | +|-----------|---------|-------| +| Alpha | 0.05 | 5% false positive rate (two-tailed) | +| Power | 0.80 | 80% chance of detecting a true effect | +| MDE | 5% relative | Smallest change worth detecting | +| Traffic allocation | 50/50 | Balanced split is most efficient | +| Variants | 2 | Control + treatment |