chore(experiments): revamp Experimentation docs (#8969)

* Update contents/docs/experiments/experiment-significance.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update contents/docs/experiments/experiment-significance.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update contents/docs/experiments/experiment-significance.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update contents/docs/experiments/experiment-significance.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update contents/docs/experiments/experiment-significance.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update contents/docs/experiments/experiment-significance.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update contents/docs/experiments/experiment-significance.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update contents/docs/experiments/experiment-significance.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update contents/docs/experiments/experiment-significance.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update contents/docs/experiments/experiment-significance.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update contents/docs/experiments/experiment-significance.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update contents/docs/experiments/experiment-significance.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update contents/docs/experiments/sample-size-running-time.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update contents/docs/experiments/sample-size-running-time.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update contents/docs/experiments/sample-size-running-time.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update contents/docs/experiments/sample-size-running-time.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update contents/docs/experiments/sample-size-running-time.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update contents/docs/experiments/sample-size-running-time.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update contents/docs/experiments/sample-size-running-time.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update contents/docs/experiments/traffic-allocation.mdx Co-authored-by: Ian Vanagas <[email protected]> * Update vercel.json Co-authored-by: Ian Vanagas <[email protected]> * fix * fix --------- Co-authored-by: Ian Vanagas <[email protected]>
PostHog · Jul 16, 2024 · 506a59f · 506a59f
1 parent 5b0f9da
commit 506a59f
Show file tree

Hide file tree

Showing 7 changed files with 179 additions and 173 deletions.
diff --git a/contents/docs/experiments/experiment-significance.mdx b/contents/docs/experiments/experiment-significance.mdx
@@ -0,0 +1,107 @@
+---
+title: Experiment significance
+---
+
+import { FormulaScreenshot } from 'components/FormulaScreenshot'
+export const TrendExperimentCalculationLight = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/trend-experiment-calculation-light.png"
+export const TrendExperimentCalculationDark = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/trend-experiment-calculation-dark.png"
+export const FunnelExperimentCalculationLight = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/funnel-experiment-calculation-light.png"
+export const FunnelExperimentCalculationDark = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/funnel-experiment-calculation-dark.png"
+export const FunnelSignificanceLight = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/funnel-significance-light.png"
+export const FunnelSignificanceDark = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/funnel-significance-dark.png"
+
+Below are all the formulas and calculations we use to determine the significance of an experiment.
+
+## Bayesian experimentation
+
+In the field of experimentation, there are two primary statistical approaches: frequentist and Bayesian.
+
+We adopt the Bayesian methodology because it directly answers the question: "Is variant A better than variant B?" This approach minimizes judgment errors, which are more common with the frequentist method.
+
+> In a frequentist approach, you start with a null hypothesis, which typically represents the current state of things or no effect. For example, the null hypothesis might state that there is no difference between variant A and variant B. 
+> The goal is to collect enough data to disprove this null hypothesis. However, disproving the null hypothesis does not directly tell us that "A is better than B." It only tells us that there is a statistically significant difference between the two. This approach can often lead to misinterpretations, especially if the context of the difference isn't considered.
+
+Our Bayesian experimentation method focuses on two key parameters during experiments:
+
+1. **Probability of each variant being the best:** This metric helps us understand which variant is more likely to outperform the other.
+2. **Significance of the results:** We evaluate whether the observed differences between variants are statistically meaningful.
+
+## Funnel experiment calculations
+
+Funnel experiments compare conversion rates. For example, if you want to measure the change in the conversion rate for subscribing to your site, you would use this type of experiment.
+
+### 1. Probability of being the best
+
+We use Monte Carlo simulations to determine the probability of each variant being the best. 
+
+Each variant can be modeled as a beta distribution, with the alpha parameter equal to the number of conversions and the beta parameter equal to the number of failures for that variant. For each variant, we sample from their respective distributions to get a conversion rate. We perform 100,000 simulation runs in our calculations.
+
+The probability of a variant being the best is given by:
+
+<FormulaScreenshot
+  imageLight={FunnelExperimentCalculationLight}
+  imageDark={FunnelExperimentCalculationDark}
+  alt="Funnel experiment calculation" 
+  classes="rounded"
+/>
+
+### 2. Statistical significance
+
+To calculate significance, we measure the expected loss, as described in [VWO's SmartStats whitepaper](https://vwo.com/downloads/VWO_SmartStats_technical_whitepaper.pdf).
+
+To do this, we run a Monte Carlo simulation and calculate the loss as:
+
+<FormulaScreenshot
+  imageLight={FunnelSignificanceLight}
+  imageDark={FunnelSignificanceDark}
+  alt="Funnel significance" 
+  classes="rounded"
+/>
+
+This represents the expected loss in conversion rate if you choose any other variant. If this loss is below 1%, we declare the results significant.
+
+## Trend experiment calculations
+
+Trend experiments capture count data. For example, if you want to measure the change in the total count of clicks, you would use this type of experiment.
+
+### 1. Probability of being the best
+
+We use Monte Carlo simulations to determine the probability of each variant being the best. 
+
+Each variant can be modeled as a gamma distribution, with the shape parameter equal to the trend count and the exposure parameter equal to the relative exposure for that variant. For each variant, we sample from their respective distributions to get a count value. We perform 100,000 simulation runs in our calculations.
+
+The probability of a variant being the best is given by:
+
+<FormulaScreenshot
+  imageLight={TrendExperimentCalculationLight}
+  imageDark={TrendExperimentCalculationDark}
+  alt="Trend experiment calculation" 
+  classes="rounded"
+/>
+
+>**Trend experiment exposure**
+>
+>Trend experiments compare counts of events. Since count data can refer to the total count of events or the number of unique users, we use a proxy metric to measure exposure. The number of times the `feature_flag_called` event returns control or test is used as the respective exposure for the variant. This event is sent automatically when you call `posthog.getFeatureFlag()`.
+>
+>Note that a variant showing fewer count data can still have a higher probability of being the best if its exposure is much smaller. This is because the relative exposure is taken into account when calculating probabilities.
+
+### 2. Statistical significance
+
+To calculate significance, we measure p-values using a [Poisson means test](https://www.evanmiller.org/statistical-formulas-for-programmers.html#count_test). Results are significant when the p-value is less than 0.05
+
+## How do we determine final significance?
+
+For your results and conclusions to be valid, any experiment must have significant exposure. For instance, if you test a product change and only one user sees the change, you can't extrapolate from that single user that the change will be beneficial or detrimental. This principle holds true for any simple randomized controlled experiment, such as those used in testing new drugs or vaccines.
+
+Even with a large sample size (e.g. ~10,000 participants), results can still be ambiguous. For example, if the difference in conversion rates between variants is less than 1%, it becomes difficult to determine if one variant is truly better than the other. To achieve statistical significance, there must be a sufficient difference between the conversion rates given the exposure size.
+
+PostHog computes this statistical significance for you automatically. We display on the results page when your experiment has reached statistically significant results, making it safe to draw conclusions and terminate the experiment.
+
+In the early days of an experiment, data can vary wildly and sometimes one variant can seem overwhelmingly better. In this case, our significance calculations might say that the results are significant, but this shouldn't be the case, since we need more data.
+
+Therefore, we have additional criteria to determine what we call **final significance**. Before each variant in an experiment reaches 100 unique users, we default to considering the results as not significant. Additionally, if the combined probability of all test variants being the best is less than 90%, we also default to considering the results as not significant.
+
+You'll see the green significance banner only when all three conditions are met:
+- Each variant has more than 100 unique users.
+- The statistical significance calculations confirm significance.
+- The combined probability of all test variants being the best is greater than 90%.
diff --git a/contents/docs/experiments/sample-size-running-time.mdx b/contents/docs/experiments/sample-size-running-time.mdx
@@ -0,0 +1,43 @@
+---
+title: Sample size and running time
+---
+
+import { FormulaScreenshot } from 'components/FormulaScreenshot'
+export const LehrEquationLight = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/lehr-equation-light.png"
+export const LehrEquationDark = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/lehr-equation-dark.png"
+export const SampleSizeDeterminationLight = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/sample-size-determination-light.png"
+export const SampleSizeDeterminationDark = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/sample-size-determination-dark.png"
+
+When creating an experiment, we provide recommended running times and sample sizes based on the parameters you choose. These values serve as **estimates** for how long to run the experiment. You can end the experiment early if you observe a [significant effect](/docs/experiments/experiment-significance).
+
+For trend experiments, we use [Lehr's equation](http://www.columbia.edu/~cjd11/charles_dimaggio/DIRE/styled-4/code-12/#poisson-distributed-or-count-data) to determine sample sizes.
+
+<FormulaScreenshot
+  imageLight={LehrEquationLight}
+  imageDark={LehrEquationDark} 
+  alt="Lehr equation" 
+  classes="rounded"
+/>
+
+Here, lambda1 (λ1) represents the baseline count data from the past two weeks, and lambda2 (λ2) is calculated as `baseline count + MDE * (baseline count)`. The MDE (minimum detectable effect) is the minimum acceptable improvement you select in the UI.
+
+For funnel experiments, we use the general [sample size determination](https://en.wikipedia.org/wiki/Sample_size_determination) formula, with 80% power and 5% significance level. The formula is as follows:
+
+<FormulaScreenshot
+  imageLight={SampleSizeDeterminationLight}
+  imageDark={SampleSizeDeterminationDark} 
+  alt="Sample size determination" 
+  classes="rounded"
+/>
+
+These values serve as estimates for how long to run the experiment. It's possible to conclude experiments early if you observe a significant effect.
+
+## Minimum detectable effect (MDE)
+
+When setting up an experiment, the minimum detectable effect (MDE) is a required parameter for estimating sample size and running time. The MDE represents the smallest change or difference that you want to be able to detect in your experiment. Essentially, it's the minimum improvement you would consider significant.
+
+To make things easier, we provide an estimated MDE value by default. This enables you to create experiments quickly without getting stuck on this step. However, we encourage you to review and adjust this value based on your specific goals.
+
+It's important to understand that **the recommended sample size decreases as the MDE increases.** In other words, if you are looking for a smaller improvement, you'll need a larger sample size to detect it. This is because smaller effects are harder to identify and require more data to ensure the results are statistically significant. 
+
+Conversely, if you're looking for a larger improvement, you won't need as many samples because the effect is easier to spot. This relationship helps ensure that your experiment is sensitive enough to detect the changes you're interested in.
diff --git a/contents/docs/experiments/significance.mdx b/contents/docs/experiments/significance.mdx
diff --git a/contents/docs/experiments/traffic-allocation.mdx b/contents/docs/experiments/traffic-allocation.mdx
@@ -0,0 +1,11 @@
+---
+title: Traffic allocation
+---
+
+By default, we use PostHog's multivariate [feature flags](/docs/feature-flags) to evenly assign people to variations (unless you choose to [run an experiment without feature flags](/docs/experiments/running-experiments-without-feature-flags)). The experiment feature flag is initialized automatically when you create your experiment.
+
+In any experiment, there is one control group and up to nine test groups. Each user is randomly assigned to one of these groups based on their `distinctId`. This assignment is stable, meaning the same user will remain in the same group even when they revisit your page.
+
+We achieve this by creating a SHA-1 hash from a combination of the feature flag key and a `distinctId`, convert the first 15 characters of this hash (in hexadecimal) into a large integer, and then divide this integer by a predefined large constant to normalize it to a float between 0 and 1. If this float is less than a specified threshold percentage, the feature is enabled for the user; otherwise, it is not.
+
+It's important to note that when dealing with low data volumes (less than 1,000 users per variant), the difference in variant exposure can be as much as 20%. This means a test variant could have only 800 people, while the control variant has 1,000. All our calculations take this exposure discrepancy into account.