diff --git a/skills/experiment-designer/SKILL.md b/skills/experiment-designer/SKILL.md
new file mode 100644
index 000000000..ce8305166
--- /dev/null
+++ b/skills/experiment-designer/SKILL.md
@@ -0,0 +1,216 @@
+---
+name: experiment-designer
+description: Guides users through designing statistically rigorous experiments step-by-step. Covers A/B tests, cluster randomized, switchback, causal inference, factorial, and multi-armed bandit designs. Use when someone asks to design, plan, or set up an experiment.
+argument-hint: "[describe what you want to test]"
+---
+
+You are an experiment design assistant. Guide users through a structured 8-step workflow to produce a complete, statistically rigorous experiment design document. Be conversational but precise. Keep responses concise (3-5 sentences per step, unless explaining a recommendation).
+
+## The 8-Step Workflow
+
+Walk through these steps **sequentially**. Do not skip Steps 4-7 even when defaults are sensible — summarize the recommended setting and confirm with the user at each step.
+
+### Step 1: Experiment Type
+
+Ask what the user wants to test. Based on their description, recommend one of 6 types:
+
+| Type | When to use |
+|------|-------------|
+| **A/B Test** | Discrete changes, sufficient traffic (>1K users/day), need clear causal evidence |
+| **Cluster Randomized** | Treatment affects groups (cities, stores), network effects, need 20+ clusters |
+| **Switchback** | Strong network effects, supply-demand coupling, user-level randomization infeasible |
+| **Causal Inference** | Cannot randomize, post-hoc analysis, natural experiments, long-term effects |
+| **Factorial** | Test multiple factors simultaneously, understand interactions, independent changes |
+| **Multi-Armed Bandit** | Continuous optimization, minimize regret, content recommendations, personalization |
+
+See [experiment-types.md](experiment-types.md) for detailed guidance on each type.
+
+### Step 2: Metrics
+
+Help the user define metrics. **Requirements: at least 1 PRIMARY + at least 1 GUARDRAIL metric.**
+
+For each metric, determine:
+- **Category**: PRIMARY (optimize), SECONDARY (exploratory), GUARDRAIL (protect), MONITOR (track)
+- **Type**: BINARY (rates — baseline as decimal, e.g. 0.03 for 3%), CONTINUOUS (revenue, duration), COUNT (purchases, errors)
+- **Direction**: INCREASE, DECREASE, or EITHER
+- **Baseline**: Current value (estimate is fine)
+
+See [metrics-library.md](metrics-library.md) for common metrics with typical baselines.
+
+### Step 3: Statistical Parameters
+
+Configure and calculate sample size:
+- **Alpha (significance level)**: Default 0.05 (5% false positive rate)
+- **Power**: Default 0.8 (80% chance of detecting a true effect)
+- **MDE (Minimum Detectable Effect)**: The smallest change worth detecting (relative % or absolute)
+- **Daily traffic**: Estimated users/day entering the experiment
+- **Traffic allocation**: e.g. 50/50 for A/B, 25/25/25/25 for 2x2 factorial
+
+Calculate sample size using the formulas in [statistics.md](statistics.md) and estimate experiment duration:
+```
+duration_days = ceil(total_sample_size / (daily_traffic * allocation_pct))
+```
+Minimum recommended duration is 7 days to capture weekly patterns.
+
+### Step 4: Randomization Strategy
+
+Configure how users are assigned to variants:
+- **Randomization unit**: USER_ID (standard), SESSION (guest users), DEVICE (cross-device), REQUEST (MAB), CLUSTER (network effects)
+- **Bucketing strategy**: HASH_BASED (deterministic, consistent) or RANDOM (MAB, switchback)
+- **Consistent assignment**: Yes for most experiments, No for MAB and switchback
+- **Stratification variables**: Optional — split by device type, geography, user segment for balanced allocation
+
+Recommend based on experiment type:
+- A/B Test / Factorial / Causal Inference: USER_ID + HASH_BASED + consistent
+- Cluster: CLUSTER + RANDOM + consistent
+- Switchback: SESSION + RANDOM + not consistent
+- MAB: REQUEST + RANDOM + not consistent
+
+### Step 5: Variance Reduction
+
+Recommend techniques to reduce required sample size:
+- **CUPED** (strongest): Use pre-experiment covariate correlated with the metric. Specify covariate name and expected variance reduction (0-70%). Works best with correlation >0.7
+- **Post-stratification**: Stratify by variables like geography, device. Improves precision post-analysis
+- **Matched pairs**: Match treatment/control units before randomization
+- **Blocking**: Divide population into homogeneous blocks, randomize within blocks
+
+Skip for MAB experiments (adaptive nature handles this). Impact order: CUPED > Stratification > Blocking > Matched Pairs.
+
+### Step 6: Risk Assessment
+
+Evaluate and document:
+- **Risk level**: LOW (minor impact, well-understood) / MEDIUM (moderate impact, some uncertainty) / HIGH (significant impact, novel treatment)
+- **Blast radius**: % of users affected (0-100%)
+- **Potential negative impacts**: List risks (performance, UX, revenue, compliance)
+- **Mitigation strategies**: Counter-measures for each risk
+- **Rollback triggers**: Specific metrics/thresholds that trigger rollback
+- **Circuit breakers**: Hard stops (e.g. error rate >5%)
+
+**Pre-launch checklist** (all 5 required):
+1. Logging instrumented and verified
+2. AA test passed (no SRM in control vs control)
+3. Monitoring alerts configured
+4. Rollback plan documented
+5. Stakeholder approval obtained
+
+Type-specific checklist additions:
+- Cluster: Cluster definition validated, 20+ clusters available
+- Switchback: Carryover effects assessed, period length validated
+- Factorial: Factor independence verified
+- MAB: Exploration budget sufficient, reward function defined
+- Causal Inference: Identifying assumptions documented
+
+### Step 7: Monitoring & Stopping Rules
+
+Configure ongoing experiment monitoring:
+- **Refresh frequency**: How often to check metrics (default: every 60 minutes)
+- **SRM threshold**: p-value for Sample Ratio Mismatch detection (default: 0.001)
+- **Multiple testing correction**: NONE, BONFERRONI (conservative), BENJAMINI_HOCHBERG (balanced), HOLM
+
+**Stopping rules** (define at least one of each applicable type):
+- **SUCCESS**: Stop when primary metric reaches statistical significance
+- **FUTILITY**: Stop when effect is unlikely to reach significance (waste of traffic)
+- **HARM**: Stop if guardrail metric crosses harm threshold
+
+**Decision criteria** (Ship/Iterate/Kill):
+- **Ship**: Primary significant + guardrails protected
+- **Iterate**: Mixed results or secondary metrics show promise
+- **Kill**: Primary null or harm detected
+
+### Step 8: Summary & Export
+
+Generate a name, hypothesis, and description:
+- **Name**: Short descriptive name (e.g. "Checkout Flow Redesign Q1 2025")
+- **Hypothesis**: Format — "If we [change], then [metric] will [direction] because [reason]"
+- **Description**: Brief context on why this experiment is being run
+
+Then produce the **Experiment Design Document** (see output format below).
+
+## Inference Rules
+
+When the user describes what they want to test, infer as much as possible:
+- "test checkout flow" -> A/B Test, suggest Conversion Rate (BINARY, baseline ~0.05) + Page Load Time guardrail
+- "compare pricing in different cities" -> Cluster Randomized, suggest Revenue per User (CONTINUOUS)
+- "optimize content recommendations" -> MAB, suggest CTR (BINARY, baseline ~0.03)
+- "analyze impact of last year's policy change" -> Causal Inference (DiD method)
+- "test button color AND copy together" -> Factorial Design
+
+For conversion rates, use BINARY with decimal baseline (0.03 = 3%).
+For revenue/time/latency, use CONTINUOUS.
+For counts (purchases, sessions, errors), use COUNT.
+Always add at least one guardrail metric automatically.
+
+## Output Format
+
+At the end of Step 8, produce a complete design document in this format:
+
+```markdown
+# Experiment Design: [Name]
+
+## Overview
+- **Type**: [experiment type]
+- **Hypothesis**: If we [change], then [metric] will [direction] because [reason]
+- **Description**: [brief context]
+
+## Metrics
+| Metric | Category | Type | Direction | Baseline |
+|--------|----------|------|-----------|----------|
+| [name] | PRIMARY  | [type] | [dir]   | [value]  |
+| ...    | ...      | ...    | ...       | ...      |
+
+## Statistical Design
+- **Alpha**: [value]
+- **Power**: [value]
+- **MDE**: [value]% (relative)
+- **Sample size per variant**: [n]
+- **Total sample size**: [N]
+- **Estimated duration**: [days] days ([weeks] weeks)
+- **Daily traffic**: [value]
+- **Traffic allocation**: [split]
+
+## Randomization
+- **Unit**: [unit]
+- **Bucketing**: [strategy]
+- **Consistent assignment**: [yes/no]
+- **Stratification**: [variables or "none"]
+
+## Variance Reduction
+[List enabled techniques with configuration, or "None applied"]
+
+## Risk Assessment
+- **Risk level**: [LOW/MEDIUM/HIGH]
+- **Blast radius**: [n]%
+- **Negative impacts**: [list]
+- **Mitigation strategies**: [list]
+- **Rollback triggers**: [list]
+- **Circuit breakers**: [list]
+
+### Pre-Launch Checklist
+- [ ] Logging instrumented
+- [ ] AA test passed
+- [ ] Alerts configured
+- [ ] Rollback plan documented
+- [ ] Stakeholder approval
+[+ type-specific items]
+
+## Monitoring & Stopping Rules
+- **Refresh frequency**: every [n] minutes
+- **SRM threshold**: [value]
+- **Multiple testing correction**: [method]
+
+### Stopping Rules
+| Type | Condition | Threshold |
+|------|-----------|-----------|
+| [type] | [description] | [value] |
+
+### Decision Framework
+- **Ship if**: [criteria]
+- **Iterate if**: [criteria]
+- **Kill if**: [criteria]
+```
+
+## Handling Questions
+
+If the user asks a conceptual question (e.g. "what is MDE?", "why do I need guardrail metrics?"), answer it directly first (2-4 sentences), then steer back to the next uncompleted step.
+
+If the user provides `$ARGUMENTS`, use that as the description of what they want to test and begin with Step 1 inference immediately.
diff --git a/skills/experiment-designer/experiment-types.md b/skills/experiment-designer/experiment-types.md
new file mode 100644
index 000000000..de28ade2a
--- /dev/null
+++ b/skills/experiment-designer/experiment-types.md
@@ -0,0 +1,154 @@
+# Experiment Types Reference
+
+## A/B Test
+
+**When to use**: Discrete changes, sufficient traffic (>1,000 users/day), need clear causal evidence, treatment effect is expected to be immediate.
+
+**Use cases**: Testing new features or UI changes, comparing algorithms or ranking systems, optimizing conversion funnels, testing marketing copy.
+
+**Pros**: Simple to understand and implement, clean causal inference, well-established statistical methods, easy to analyze.
+
+**Cons**: Can only test one change at a time, requires sufficient traffic, may take time to reach significance.
+
+**Defaults**: alpha=0.05, power=0.8, 2 variants, 50/50 split.
+
+**Randomization**: USER_ID, HASH_BASED, consistent assignment.
+
+**Examples**: New checkout flow vs current, two recommendation algorithms, green vs blue CTA button.
+
+---
+
+## Cluster Randomized Experiment
+
+**When to use**: Treatment affects groups (cities, stores, schools), network effects or interference between users, marketplace experiments, geographic-based treatments. Need 20+ clusters.
+
+**Use cases**: Driver incentive programs across cities, marketplace pricing by region, store-level promotions.
+
+**Pros**: Handles network effects and spillover, suitable for group-level interventions, prevents contamination.
+
+**Cons**: Requires larger sample sizes, fewer degrees of freedom, more complex analysis, need many clusters for power.
+
+**Defaults**: alpha=0.05, power=0.8, 2 variants, 50/50 split.
+
+**Randomization**: CLUSTER, RANDOM, consistent assignment.
+
+**Key parameters**:
+- **ICC (Intra-Cluster Correlation)**: Proportion of variance between clusters (typically 0.01-0.1)
+- **Cluster size**: Average number of units per cluster
+- **Design Effect**: DEFF = 1 + (cluster_size - 1) x ICC. Inflates required sample size.
+
+**Warnings**:
+- Need at least 10 clusters per arm for reliable inference
+- Balance clusters in size for stable estimates
+- Larger ICC or cluster size = much larger sample sizes
+
+---
+
+## Switchback Experiment
+
+**When to use**: Strong network effects, supply-demand coupling (marketplaces), user-level randomization causes issues, treatments can be switched quickly.
+
+**Use cases**: Rideshare pricing algorithms, delivery dispatch systems, restaurant recommendations in food delivery.
+
+**Pros**: Handles network effects well, all users experience both conditions, reduces time-of-day variance.
+
+**Cons**: Carryover effects, longer duration, complex analysis, sensitive to non-stationarity.
+
+**Defaults**: alpha=0.05, power=0.8, 2 variants, 50/50 split.
+
+**Randomization**: SESSION, RANDOM, not consistent (users re-randomized each period).
+
+**Key parameters**:
+- **Number of periods**: Minimum 4 recommended
+- **Period length**: Hours per switch (balance between more data and carryover risk)
+- **Autocorrelation (rho)**: 0 to 1. Higher rho reduces effective sample size.
+- **Effective multiplier**: (1 - rho) / (1 + rho). Applied to sample size.
+
+**Warnings**:
+- Assumes no carryover effects between periods
+- Minimum 7 days to capture weekly patterns
+- Non-stationarity (trends, seasonality) can bias results
+
+---
+
+## Causal Inference (Observational)
+
+**When to use**: Cannot run a randomized experiment, analyzing historical data or policy changes, natural experiments, studying long-term effects.
+
+**Use cases**: Policy change impact across regions (DiD), eligibility threshold effects (RDD), effect of education on earnings (IV).
+
+**Pros**: Can analyze past data without running an experiment, useful for post-hoc analysis, study effects over longer periods, lower cost.
+
+**Cons**: Stronger assumptions required, risk of confounding bias, more complex statistical methods, results may be less definitive.
+
+**Defaults**: alpha=0.05, power=0.8, 2 groups.
+
+**Methods**:
+
+| Method | Key Assumption | Best For |
+|--------|---------------|----------|
+| **Difference-in-Differences (DiD)** | Parallel trends between groups pre-treatment | Policy changes with pre/post data |
+| **Regression Discontinuity (RDD)** | Continuity at the threshold | Eligibility cutoffs, score-based assignment |
+| **Propensity Score Matching (PSM)** | Overlap in covariates, no unmeasured confounders | Observational data with rich covariates |
+| **Instrumental Variables (IV)** | Valid, strong instrument (relevant + exogenous) | When direct randomization impossible |
+
+**Method-specific notes**:
+- DiD: Serial correlation inflates variance. AR(1) VIF = (1 + rho) / (1 - rho)
+- RDD: Sample size depends on bandwidth near threshold. Consult a statistician for bandwidth selection
+- PSM: Effective sample size depends on match quality and covariate overlap
+- IV: Weak instruments lead to biased estimates. Check first-stage F-statistic > 10
+
+---
+
+## Factorial Design
+
+**When to use**: Testing multiple independent changes simultaneously, interested in interaction effects, have sufficient traffic, changes are unlikely to conflict.
+
+**Use cases**: 2x2 button color x copy length, email subject x send time x personalization, landing page hero x headline x CTA.
+
+**Pros**: Test multiple factors efficiently, detect interaction effects, more information per user, faster than sequential A/B tests.
+
+**Cons**: More complex analysis, requires larger sample sizes, harder to interpret with many factors, false positive risk increases.
+
+**Defaults**: alpha=0.05, power=0.8, 4 variants (2x2), 25/25/25/25 split.
+
+**Randomization**: USER_ID, HASH_BASED, consistent assignment.
+
+**Key parameters**:
+- **Factors**: Each with a name and number of levels (e.g. Button Color with 2 levels)
+- **Total cells**: Product of all factor levels (e.g. 2 x 3 = 6 cells)
+- **Detect interaction**: If yes, sample size inflated ~4x per cell
+
+**Warnings**:
+- With many factors, total cells grow fast (3 factors with 3 levels each = 27 cells)
+- Verify factor independence before running
+- Consider Bonferroni correction for multiple comparisons
+
+---
+
+## Multi-Armed Bandit (MAB)
+
+**When to use**: Continuous optimization is priority, opportunity cost of poor variants is high, content recommendations, personalization, can tolerate adaptive allocation.
+
+**Use cases**: Content headline optimization, recommendation algorithm selection, promotional banner testing, push notification copy.
+
+**Pros**: Minimizes opportunity cost, adapts to changing conditions, automatic traffic allocation, good for ongoing optimization.
+
+**Cons**: Less interpretable than A/B tests, harder to establish causality, requires more sophisticated infrastructure, may converge to local optimum.
+
+**Defaults**: alpha=0.05, power=0.8, 3 arms.
+
+**Randomization**: REQUEST, RANDOM, not consistent (re-randomized each request).
+
+**Key parameters**:
+- **Number of arms**: 2 or more variants
+- **Horizon**: Total observations expected (minimum 1,000)
+- **Exploration rate (epsilon)**: 0 to 0.5. Higher = more exploration, slower convergence
+- **Exploration budget per arm**: (horizon x epsilon) / num_arms
+- **Estimated regret**: epsilon x horizon x (arms - 1) / arms
+
+**Notes**:
+- Traditional fixed-sample statistical testing does not apply
+- Sample size shown is the exploration budget per arm, not a significance calculation
+- Skip variance reduction (adaptive nature handles this)
+- Reward function must be well-defined before starting
diff --git a/skills/experiment-designer/metrics-library.md b/skills/experiment-designer/metrics-library.md
new file mode 100644
index 000000000..ed111f58d
--- /dev/null
+++ b/skills/experiment-designer/metrics-library.md
@@ -0,0 +1,64 @@
+# Common Metrics Library
+
+Use these as starting points when helping users select metrics. Adjust baselines to match the user's actual data when available.
+
+## Binary Metrics
+
+Baselines expressed as decimals (0.05 = 5%). Use proportions test for sample size calculation.
+
+| Metric | Typical Category | Direction | Baseline | Notes |
+|--------|-----------------|-----------|----------|-------|
+| Conversion Rate | PRIMARY | INCREASE | 0.05 (5%) | Users completing a desired action |
+| Click-Through Rate (CTR) | PRIMARY | INCREASE | 0.03 (3%) | Users clicking a specific element |
+| Bounce Rate | GUARDRAIL | DECREASE | 0.40 (40%) | Users leaving without interaction |
+| Signup Rate | PRIMARY | INCREASE | 0.02 (2%) | Visitors creating an account |
+| Retention Rate (7-day) | PRIMARY | INCREASE | 0.25 (25%) | Users returning within 7 days |
+
+## Continuous Metrics
+
+Baselines as raw values. Provide variance (sigma^2) or standard deviation for sample size calculation. If unknown, default to CV=10% (variance = (baseline * 0.1)^2).
+
+| Metric | Typical Category | Direction | Baseline | Variance | Notes |
+|--------|-----------------|-----------|----------|----------|-------|
+| Revenue per User | PRIMARY | INCREASE | $50 | 2,500 (sd=50) | Average revenue generated per user |
+| Average Order Value | PRIMARY | INCREASE | $75 | 1,875 (sd=43) | Average transaction value |
+| Session Duration | SECONDARY | INCREASE | 180s | 14,400 (sd=120) | Average session length in seconds |
+| Page Load Time | GUARDRAIL | DECREASE | 1,500ms | 250,000 (sd=500) | Average page load in milliseconds |
+| Engagement Score | SECONDARY | INCREASE | 7.5 | 6.25 (sd=2.5) | Composite engagement metric |
+
+## Count Metrics
+
+Baselines as average counts. Use Poisson approximation (variance = lambda = baseline).
+
+| Metric | Typical Category | Direction | Baseline | Notes |
+|--------|-----------------|-----------|----------|-------|
+| Number of Purchases | PRIMARY | INCREASE | 1.5 | Purchases per user |
+| Pages per Session | SECONDARY | INCREASE | 4.0 | Pages viewed per session |
+| Error Count | GUARDRAIL | DECREASE | 0.1 | Errors encountered per user |
+| Feature Usage Count | SECONDARY | INCREASE | 2.5 | Times a feature is used per user |
+
+## Choosing the Right Metric Type
+
+| If the metric is... | Use type | Baseline format |
+|---------------------|----------|-----------------|
+| A rate or percentage (conversion, CTR, bounce) | BINARY | Decimal (0.05 for 5%) |
+| A dollar amount, duration, or score | CONTINUOUS | Raw value ($50, 180s) |
+| A count of events per user | COUNT | Average count (1.5) |
+
+## Metric Category Guidelines
+
+- **PRIMARY**: The metric you're trying to move. Every experiment needs at least one.
+- **GUARDRAIL**: Metrics that must not degrade. Every experiment needs at least one. Common guardrails: error rate, page load time, bounce rate, revenue (if not primary).
+- **SECONDARY**: Interesting to observe but not the primary goal. Helps explain results.
+- **MONITOR**: Operational metrics to track (e.g. traffic volume, latency). Not used in statistical analysis.
+
+## Quick Metric Selection by Experiment Goal
+
+| Goal | Suggested PRIMARY | Suggested GUARDRAIL |
+|------|-------------------|---------------------|
+| Improve conversion funnel | Conversion Rate | Bounce Rate, Page Load Time |
+| Increase revenue | Revenue per User | Conversion Rate, Error Count |
+| Boost engagement | Session Duration or Engagement Score | Bounce Rate, Error Count |
+| Optimize content | CTR | Bounce Rate, Revenue per User |
+| Reduce friction | Signup Rate or Conversion Rate | Page Load Time, Error Count |
+| Improve retention | Retention Rate (7-day) | Revenue per User, Error Count |
diff --git a/skills/experiment-designer/statistics.md b/skills/experiment-designer/statistics.md
new file mode 100644
index 000000000..f2d811055
--- /dev/null
+++ b/skills/experiment-designer/statistics.md
@@ -0,0 +1,137 @@
+# Statistical Methods Reference
+
+## Sample Size Formulas
+
+All formulas use two-tailed tests with z-scores: z_alpha = Z(1 - alpha/2), z_beta = Z(power).
+
+### Binary Metrics (Proportions Test)
+```
+n = 2 * (z_alpha + z_beta)^2 * p_pooled * (1 - p_pooled) / (p2 - p1)^2
+
+where p_pooled = (p1 + p2) / 2
+      p1 = baseline rate (decimal, e.g. 0.05 for 5%)
+      p2 = p1 + absolute effect size
+```
+
+### Continuous Metrics (Two-Sample T-Test)
+```
+n = 2 * (z_alpha + z_beta)^2 * variance / effect_size^2
+
+where variance = sigma^2 (if not known, default: (baseline * 0.1)^2)
+      effect_size = baseline * (mde_pct / 100) for relative MDE
+```
+
+### Count Metrics (Poisson Approximation)
+```
+n = 2 * (z_alpha + z_beta)^2 * lambda / effect_size^2
+
+where lambda = baseline count rate
+```
+
+## Adjustments
+
+### Unequal Traffic Allocation
+When control/treatment split is not 50/50:
+```
+adjusted_n = n * (1 + r)^2 / (4 * r)
+
+where r = treatment_allocation / control_allocation
+```
+
+### Multiple Variants
+When more than 2 variants (including control):
+```
+adjusted_n = n * (k - 1)
+
+where k = number of variants
+```
+Consider multiple testing correction (Bonferroni, Holm, or Benjamini-Hochberg).
+
+### Cluster Design Effect
+For cluster randomized experiments:
+```
+DEFF = 1 + (m - 1) * ICC
+adjusted_n = n * DEFF
+clusters_per_arm = ceil(adjusted_n / m)
+
+where m = average cluster size
+      ICC = intra-cluster correlation (typically 0.01 - 0.1)
+```
+
+### Switchback Autocorrelation
+For switchback experiments:
+```
+effective_multiplier = (1 - rho) / (1 + rho)
+adjusted_n = ceil(n / effective_multiplier)
+effective_periods = floor(num_periods * effective_multiplier)
+
+where rho = temporal autocorrelation (0 to 1)
+```
+
+### Factorial Cells
+For factorial designs:
+```
+total_cells = product of all factor levels
+total_n = n_per_cell * total_cells
+
+If detecting interactions: interaction_n = n_per_cell * 4 * total_cells
+```
+
+### Causal Inference: DiD Serial Correlation
+```
+VIF = (1 + rho) / (1 - rho)
+adjusted_n = ceil(n * VIF)
+```
+
+### MAB Exploration Budget
+For epsilon-greedy multi-armed bandit:
+```
+explore_budget = ceil(horizon * epsilon)
+per_arm_explore = ceil(explore_budget / num_arms)
+estimated_regret = epsilon * horizon * (arms - 1) / arms
+```
+
+## Duration Estimation
+
+```
+effective_daily_traffic = daily_traffic * (sum_of_allocation_pcts / 100)
+days = ceil(total_sample_size / effective_daily_traffic)
+weeks = ceil(days / 7)
+```
+
+Minimum recommended duration: **7 days** (to capture weekly patterns). Consider adding 2-3 buffer days for ramp-up/cool-down.
+
+## MDE Feasibility Check
+
+Given available traffic and max duration, calculate the achievable MDE:
+```
+available_n = daily_traffic * max_days
+achievable_mde = (z_alpha + z_beta) * sqrt(2 * variance / available_n)  # continuous
+achievable_mde = (z_alpha + z_beta) * sqrt(2 * p * (1-p) / available_n)  # binary
+achievable_mde = (z_alpha + z_beta) * sqrt(2 * lambda / available_n)     # count
+```
+
+If achievable MDE > target MDE, the experiment is not feasible with current traffic.
+
+## Common Warnings
+
+| Condition | Warning |
+|-----------|---------|
+| n < 100 per variant | Very small sample. Results may be unreliable |
+| MDE < 1% relative | Very small effect needs large traffic or long duration |
+| Alpha > 0.05 | Risk of false positives increases |
+| Power < 0.8 | Risk of false negatives increases |
+| Clusters < 10 per arm | Fewer clusters = unreliable inference |
+| Cluster size imbalance | Balance clusters for stable estimates |
+| Duration < 7 days | May miss weekly patterns |
+| Duration > 90 days | Consider if the experiment is worth running this long |
+
+## Standard Defaults
+
+| Parameter | Default | Notes |
+|-----------|---------|-------|
+| Alpha | 0.05 | 5% false positive rate (two-tailed) |
+| Power | 0.80 | 80% chance of detecting a true effect |
+| MDE | 5% relative | Smallest change worth detecting |
+| Traffic allocation | 50/50 | Balanced split is most efficient |
+| Variants | 2 | Control + treatment |