Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(experiments): revamp Experimentation docs #8969

Merged
merged 28 commits into from
Jul 16, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
d02015a
init
jurajmajerik Jul 15, 2024
d3b4483
fix
jurajmajerik Jul 15, 2024
50ccecc
fix
jurajmajerik Jul 15, 2024
e9ced54
fix
jurajmajerik Jul 15, 2024
c1dbc8f
Update contents/docs/experiments/experiment-significance.mdx
jurajmajerik Jul 16, 2024
9bab8b6
Update contents/docs/experiments/experiment-significance.mdx
jurajmajerik Jul 16, 2024
c9d155d
Update contents/docs/experiments/experiment-significance.mdx
jurajmajerik Jul 16, 2024
14d4f68
Update contents/docs/experiments/experiment-significance.mdx
jurajmajerik Jul 16, 2024
ce0f08f
Update contents/docs/experiments/experiment-significance.mdx
jurajmajerik Jul 16, 2024
ab39883
Update contents/docs/experiments/experiment-significance.mdx
jurajmajerik Jul 16, 2024
a285332
Update contents/docs/experiments/experiment-significance.mdx
jurajmajerik Jul 16, 2024
ac77f9d
Update contents/docs/experiments/experiment-significance.mdx
jurajmajerik Jul 16, 2024
8d60a50
Update contents/docs/experiments/experiment-significance.mdx
jurajmajerik Jul 16, 2024
d74a84a
Update contents/docs/experiments/experiment-significance.mdx
jurajmajerik Jul 16, 2024
df6f289
Update contents/docs/experiments/experiment-significance.mdx
jurajmajerik Jul 16, 2024
d401a74
Update contents/docs/experiments/experiment-significance.mdx
jurajmajerik Jul 16, 2024
571f5d3
Update contents/docs/experiments/sample-size-running-time.mdx
jurajmajerik Jul 16, 2024
2cd49dc
Update contents/docs/experiments/sample-size-running-time.mdx
jurajmajerik Jul 16, 2024
aac9009
Update contents/docs/experiments/sample-size-running-time.mdx
jurajmajerik Jul 16, 2024
2b91897
Update contents/docs/experiments/sample-size-running-time.mdx
jurajmajerik Jul 16, 2024
b92176c
Update contents/docs/experiments/sample-size-running-time.mdx
jurajmajerik Jul 16, 2024
517fde8
Update contents/docs/experiments/sample-size-running-time.mdx
jurajmajerik Jul 16, 2024
c0a743d
Update contents/docs/experiments/sample-size-running-time.mdx
jurajmajerik Jul 16, 2024
52746c2
Update contents/docs/experiments/traffic-allocation.mdx
jurajmajerik Jul 16, 2024
e3192f0
Update vercel.json
jurajmajerik Jul 16, 2024
c986ed4
fix
jurajmajerik Jul 16, 2024
04d495d
fix
jurajmajerik Jul 16, 2024
2290767
fix
jurajmajerik Jul 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions contents/docs/experiments/experiment-significance.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
---
title: Experiment significance
---

import { FormulaScreenshot } from 'components/FormulaScreenshot'
export const TrendExperimentCalculationLight = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/trend-experiment-calculation-light.png"
export const TrendExperimentCalculationDark = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/trend-experiment-calculation-dark.png"
export const FunnelExperimentCalculationLight = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/funnel-experiment-calculation-light.png"
export const FunnelExperimentCalculationDark = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/funnel-experiment-calculation-dark.png"
export const FunnelSignificanceLight = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/funnel-significance-light.png"
export const FunnelSignificanceDark = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/funnel-significance-dark.png"

Below are all the formulas and calculations we use to determine the significance of an experiment
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved

## Bayesian experimentation

In the field of experimentation, there are two primary statistical approaches: frequentist and Bayesian.

We adopt the Bayesian methodology because it directly answers the question: "Is variant A better than variant B?" This approach minimizes judgment errors, which are more common with the frequentist method.

> In a frequentist approach, you start with a null hypothesis, which typically represents the current state of things or no effect. For example, the null hypothesis might state that there is no difference between variant A and variant B. The goal is to collect enough data to disprove this null hypothesis. However, disproving the null hypothesis does not directly tell us that "A is better than B." It only tells us that there is a statistically significant difference between the two. This approach can often lead to misinterpretations, especially if the context of the difference isn't considered.
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved

Our Bayesian experimentation method focuses on two key parameters during experiments:

1. **Probability of each variant being the best:** This metric helps us understand which variant is more likely to outperform the other.
2. **Significance of the results:** We evaluate whether the observed differences between variants are statistically meaningful.

## Funnel experiment calculations

Funnel experiments compare conversion rates. For example, if you want to measure the change in the conversion rate for subscribing to your site, you would use this type of experiment.

#### 1. Probability of being the best
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved

We use Monte Carlo simulations to determine the probability of each variant being the best. Each variant can be modeled as a beta distribution, with the alpha parameter equal to the number of conversions and the beta parameter equal to the number of failures for that variant. For each variant, we sample from their respective distributions to get a conversion rate. We perform 100,000 simulation runs in our calculations.
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved

The probability of a variant being the best is given by:

<FormulaScreenshot
imageLight={FunnelExperimentCalculationLight}
imageDark={FunnelExperimentCalculationDark}
alt="Funnel experiment calculation"
classes="rounded"
/>

#### 2. Statistical significance
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved

To calculate significance, we measure the expected loss, as described in [VWO's SmartStats whitepaper](https://vwo.com/downloads/VWO_SmartStats_technical_whitepaper.pdf).

To do this, we run a Monte Carlo simulation and calculate the loss as:

<FormulaScreenshot
imageLight={FunnelSignificanceLight}
imageDark={FunnelSignificanceDark}
alt="Funnel significance"
classes="rounded"
/>

This represents the expected loss in conversion rate if you choose any other variant. If this loss is below 1%, we declare the results significant.

## Trend experiment calculations

Trend experiments capture count data. For example, if you want to measure the change in the total count of clicks, you would use this type of experiment.

#### 1. Probability of being the best
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved

We use Monte Carlo simulations to determine the probability of each variant being the best. Each variant can be modeled as a gamma distribution, with the shape parameter equal to the trend count and the exposure parameter equal to the relative exposure for that variant. For each variant, we sample from their respective distributions to get a count value. We perform 100,000 simulation runs in our calculations.
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved

The probability of a variant being the best is given by:

<FormulaScreenshot
imageLight={TrendExperimentCalculationLight}
imageDark={TrendExperimentCalculationDark}
alt="Trend experiment calculation"
classes="rounded"
/>

>**Trend experiment exposure**
>
>
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved
>Trend experiments compare counts of events. Since count data can refer to the total count of events or the number of unique users, we use a proxy metric to measure exposure. The number of times the `feature_flag_called` event returns control or test is used as the respective exposure for the variant. This event is sent automatically when you call `posthog.getFeatureFlag()`.
>
>It's important to note that a variant showing fewer count data can still have a higher probability of being the best if its exposure is much smaller. This is because the relative exposure is taken into account when calculating probabilities.
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved

#### 2. Statistical significance
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved

To calculate significance, we measure p-values using a [Poisson means test](https://www.evanmiller.org/statistical-formulas-for-programmers.html#count_test). Results are significant when the p-value is less than 0.05

## How do we determine final significance?

For your results and conclusions to be valid, any experiment must have significant exposure. For instance, if you test a product change and only one user sees the change, you can't extrapolate from that single user that the change will be beneficial or detrimental for your entire user base. This principle holds true for any simple randomized controlled experiment, such as those used in testing new drugs or vaccines.
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved

Even with a large sample size (e.g., approximately 10,000 participants), results can still be ambiguous. For example, if the difference in conversion rates between variants is less than 1%, it becomes difficult to determine if one variant is truly better than the other. To achieve statistical significance, there must be a sufficient difference between the conversion rates given the exposure size.

PostHog computes this statistical significance for you automatically. We notify you when your experiment has reached statistically significant results, making it safe to draw conclusions and terminate the experiment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we notify you? In the experiment results screen or some place else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good point. There's no notification, we only show it in the results page.


In the early days of an experiment, data can vary wildly, and sometimes one variant can seem overwhelmingly better. In this case, our significance calculations might say that the results are significant, but this shouldn't be the case, since we need more data.
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved

Therefore, we have additional criteria to determine what we call **final significance**. Before each variant in an experiment reaches 100 unique users, we default to considering the results as not significant. Additionally, if the combined probability of all test variants being the best is less than 90%, we also default to considering the results as not significant.

You'll see the green significance banner only when all three conditions are met:
- Each variant has more than 100 unique users.
- The statistical significance calculations confirm significance.
- The combined probability of all test variants being the best is greater than 90%.
40 changes: 40 additions & 0 deletions contents/docs/experiments/sample-size-running-time.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
---
title: Sample size and running time
---

import { FormulaScreenshot } from 'components/FormulaScreenshot'
export const LehrEquationLight = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/lehr-equation-light.png"
export const LehrEquationDark = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/lehr-equation-dark.png"
export const SampleSizeDeterminationLight = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/sample-size-determination-light.png"
export const SampleSizeDeterminationDark = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/sample-size-determination-dark.png"

When creating an experiment, we provide recommended running times and sample sizes based on the parameters you choose. Please note that values serve as estimates for how long to run the experiment. You can end the experiment early if you observe a significant effect.
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved

For trend experiments, we use [Lehr's equation](http://www.columbia.edu/~cjd11/charles_dimaggio/DIRE/styled-4/code-12/#poisson-distributed-or-count-data) to determine sample sizes.

<FormulaScreenshot
imageLight={LehrEquationLight}
imageDark={LehrEquationDark}
alt="Lehr equation"
classes="rounded"
/>

Here, lambda1 (λ1) represents the baseline count data from the past two weeks, and lambda2 (λ2) is calculated as `baseline count + MDE * (baseline count)`. The MDE (Minimum Detectable Effect) is the minimum acceptable improvement you select in the UI.
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved

For funnel experiments, we use the general [Sample size determination](https://en.wikipedia.org/wiki/Sample_size_determination) formula, with 80% power and 5% significance level. The formula is as follows:
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved

<FormulaScreenshot
imageLight={SampleSizeDeterminationLight}
imageDark={SampleSizeDeterminationDark}
alt="Sample size determination"
classes="rounded"
/>

These values serve as estimates for how long to run the experiment. It's possible to conclude experiments early if you observe a significant effect.

## Minimum Detectable Effect (MDE)
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved
When setting up an experiment, the Minimum Detectable Effect (MDE) is a required parameter for estimating sample size and running time. The MDE represents the smallest change or difference that you want to be able to detect in your experiment. Essentially, it's the minimum improvement you would consider significant.
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved

To make things easier, we set a default MDE value to get you started, we provide an estimated MDE value by default. This allows you to create experiments quickly without getting stuck on this step. However, we encourage you to review and adjust this value based on your specific goals. If you have a particular MDE in mind, feel free to set it accordingly.
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved

It's important to understand that the recommended sample size decreases as the MDE increases. In other words, if you are looking for a smaller improvement, you'll need a larger sample size to detect it. This is because smaller effects are harder to identify and require more data to ensure the results are statistically significant. Conversely, if you're looking for a larger improvement, you won't need as many samples because the effect is easier to spot. This relationship helps ensure that your experiment is sensitive enough to detect the changes you're interested in.
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved
28 changes: 0 additions & 28 deletions contents/docs/experiments/significance.mdx

This file was deleted.

11 changes: 11 additions & 0 deletions contents/docs/experiments/traffic-allocation.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
title: Traffic allocation
---

By default, we use PostHog's multivariate [feature flags](/docs/feature-flags) to assign people to variations (unless you choose to [run an experiment without feature flags](/docs/experiments/running-experiments-without-feature-flags)). The experiment feature flag is initialized automatically when you create your experiment.
jurajmajerik marked this conversation as resolved.
Show resolved Hide resolved

In any experiment, there is one control group and up to nine test groups. Each user is randomly assigned to one of these groups based on their `distinctId`. This assignment is stable, meaning the same user will remain in the same group even when they revisit your page.

We achieve this by creating a SHA-1 hash from a combination of the feature flag key and a `distinctId`, convert the first 15 characters of this hash (in hexadecimal) into a large integer, and then divide this integer by a predefined large constant to normalize it to a float between 0 and 1. If this float is less than a specified threshold percentage, the feature is enabled for the user; otherwise, it is not.

It's important to note that when dealing with low data volumes (less than 1,000 users per variant), the difference in variant exposure can be as much as 20%. This means a test variant could have only 800 people, while the control variant has 1,000. All our calculations take this exposure discrepancy into account.
Loading
Loading