Skip to content

Bayesian Protein Fold Change Analysis

Robert Millikin edited this page Aug 26, 2019 · 55 revisions

The Bayesian protein fold-change analysis is a new feature of FlashLFQ, starting in version 1.0.0. It is inspired by Triqler, though there are a number of differences.

You may wish to complete the vignette and take a look at the Bayesian quantification results as you read this Wiki article.

This algorithm is designed to answer two questions for each protein:

  1. How much is the protein changing from condition A to condition B?
  2. What is the probability that the protein's change is noise (caused by biological/sample prep/instrument/etc. variance)?

We will do this in a few steps:

  1. Estimate the fold-change of each protein between two conditions
  2. Estimate "noise" in the fold-change measurements between two conditions
  3. Perform a Bayesian null hypothesis significance test, where the "noise" determined in step 2 is the null hypothesis

General: Estimating the fold-change of a protein via Bayesian statistics

We will begin with the assumption that we are analyzing bottom-up proteomics data, though the methods described here are extensible to many other types of data.

FlashLFQ takes in a list of peptide identifications and quantifies them via MS1 peakfinding. The peptides are digestion products of proteins that were in the sample. We would like to know the quantity of the protein in each sample. We will assume that we are performing a relative quantification; calculating the absolute quantity of a protein in a sample is outside the scope of this article, though it could be done with this method given peptide standards. In other words, we would like to quantify a protein in a sample relative to another sample; we would like to quantify a change in abundance.

Once FlashLFQ has performed the peptide-level quantification, we can use Bayesian statistics to estimate how much a protein is changing between any two sets of samples. We could also do multi-set analysis (similar to ANOVA) but we will leave that for another day.

First, we generate a list of fold-change measurements for the list of peptides that constitute the protein. These are usually fairly noisy. An example from the vignette is provided in the following table.

Protein Peptide Log2 Peptide Intensity Sample1 Log2 Peptide Intensity Sample2 Log2 Change
A0A1M0SW71 HELAQLLGFENYAFK 18.464 19.438 0.974
A0A1M0SW71 LVAVVPSPSWGR 20.018 20.988 0.970
A0A1M0SW71 QLEFGLFDFR N/A 18.696 N/A
A0A1M0SW71 TNPLLTPFELPPFSK 19.656 20.733 1.078

We use fold-change rather than absolute intensity because each peptide ionizes with a different efficiency. Meaning, if we were to do a 2-sample comparison, we'd be comparing (18.5, 20.0, 19.7) vs (19.4, 20.1, 18.7, 20.7). The statistical significance of such a comparison would be poor. Alternatively, we could estimate the ionization efficiency of each peptide from the data. We have opted for this simpler approach for now, though.

In the above example, the data points (the log changes) are fairly precise, clustered around 1.0. One value is missing, however. We could impute an intensity in place of the missing value. For now, we will continue on with the three data points we have.

We would like to know, primarily, the mean of these values. Some kind of measurement of statistical significance or uncertainty would be useful, too. One option would be to calculate the arithmetic mean and standard deviation of these 3 numbers and perform a t-test. Instead, we will use a method of Bayesian estimation originally described by John Kruschke for reasons described in step 3. We have made modifications to Kruschke's method where appropriate; these changes are annotated in the text, with a justification.

At the heart of Bayesian statistics is Bayes' Rule. We will use Bayes' Rule to calculate a probability ("posterior probability") of a fold-change given an initial belief in the a probability of a particular fold change ("prior probability") and some data. To accomplish this, we will fit Student's t-distribution to the peptide-level fold-changes, and use the mean as an estimate of the fold-change. This is not too different from simply taking the average of the peptide-level fold-changes and using that value as the protein's fold-change, except that it is more robust to outliers. It also provides a distribution of estimates instead of a single point estimate, as well as a means to perform a null hypothesis significance test; these properties are explored further in step 3.

Student's t-distribution has the parameters mu (mean), sigma (standard deviation), and nu ("degrees of freedom", or degree of normality). FlashLFQ estimates all three of these parameters from the peptide fold-change datapoints, but for the most part we are primarily concerned with the mean of the datapoints, which we will use as an estimate of the protein's fold-change. We will also use the estimated standard deviation for the estimation of "noise".

To use Bayes' Rule, we need to define the prior probability of a particular fold-change before any data is taken into account. Because in principle the fold-change could take practically any range of values from zero to infinity, we should use a very uncertain prior. We will represent the prior probability distribution for the estimate of the mean as another Student's t-distribution [note: this is slightly different than Kruschke's normally-distributed prior]. The standard deviation of this distribution should be very high (100 times the data's standard deviation), because we want to consider a wide range of possibilities for the fold-change. We will center the t-distribution at the arithmetic mean of the data points, which in this case is (0.974 + 0.970 + 1.078) / 3 = 1.007, because this seems like a reasonable place to default to. The nu ("degrees of freedom") parameter we will specify as 1.0, to make the tails heavy; again, because we are fairly uncertain what the actual fold-change is.

Now that we have our datapoints and we have defined the prior, we need to estimate the posterior probability distribution. This is mathematically challenging. Instead of trying to derive and evaluate an integral, we will sample from the posterior probability distribution using a Markov Chain Monte Carlo (MCMC) algorithm. Each iteration of the MCMC algorithm will generate one Student's t-distribution that is hypothesized to represent the data. The algorithm 1) calculates the likelihood that the distribution represents the data, 2) calculates the likelihood of distribution occurring based on prior probabilities for the parameters of the distribution, 3) combines them with Bayes' Rule, and 4) adapts towards more likely distributions.

The following is an example of a likelihood calculation. The top and bottom panels each show a t-distribution generated by one iteration of the MCMC algorithm. The likelihood of the hypothesized distribution of representing the data can be calculated by multiplying the probability density at each datapoint. The bottom t-distribution has a higher likelihood of representing the data than the top distribution. The actual algorithm incorporates a random number generator to generate new hypothesized t-distributions, as well as evaluating whether the change in likelihood is greater than random chance.

The algorithm also combines the likelihood of the distribution representing the data with the prior likelihood of the distribution occurring in the first place based on the priors; this is not depicted here.

We can perform as many iterations of the MCMC algorithm as we like; by default, FlashLFQ will perform 1000 initial iterations and disregard these (a "burn-in" phase), followed by another 3000 iterations, which are kept. These distributions should look fairly similar if the data is precise with a large number of data points, and look dissimilar if the data is imprecise or with few data points.

The following plot shows 20 representative distributions fit to the three datapoints described above.

Once the MCMC algorithm has finished generating the t-distributions, we can estimate the mean, standard deviation, and nu from them. If 3000 t-distributions have been generated, then we have 3000 estimates of the mean, 3000 estimates of the standard deviation, and 3000 estimates of nu. Plotting this data with a histogram shows the posterior probability distribution for each parameter.

The following plots show posterior distributions for the 3 datapoints described above, with 300000 MCMC iterations performed.

We can also get a point estimate out of the distribution by simply averaging the values, or taking the median. FlashLFQ uses the median of the values as the point estimate.

Step 1: Estimate protein change between two conditions

We estimate the fold-change of each protein between the samples of the two conditions (i.e., samples of condition A vs. samples of condition B) using the technique described above. Specifically, we fit a series of t-distributions to the peptide-level fold-changes for each protein. The prior probability distributions for the mean, standard deviation, and nu parameters are weak/diffuse. After the series of t-distributions have been fit to a protein's peptide-level fold-change data, we estimate the protein's (log2-transformed) fold-change by aggregating all of the means of the t-distributions and taking the median value.

Step 2: Estimate noise

Secondly, we want to try to estimate what is considered to be a non-important change for a protein. For example, if a protein is very confidently changing by 0.01 log2 units, we may not really care; this is very close to no change. We could impose some lower limit on the magnitude of the change that we care about (a "fold-change cutoff"). However, it would be best if we had some algorithmic or quantitative way to figure out what this cutoff should be. We will use an emperical null strategy for this.

Each protein has a point estimate of the mean, standard deviation, and nu. FlashLFQ estimates the noise from the standard deviation of the typical protein. We could take the median protein's standard deviation, but the distribution of standard deviations tends to be skewed, making the median too conservative for an estimate of the noise. Instead, we use the mode (the most common value) of the distribution. We estimate the mode by first finding the smallest interval that represents 10% of the standard deviations (i.e., the most dense part of the distribution). The median of the data within this interval is used as an estimate of the mode. This number is used as the fold-change cutoff (the null hypothesis in the next step). This is a heuristic; future versions of FlashLFQ may alter this specific calculation, but the principle will likely largely stay the same.

The following figure shows the distribution of protein standard deviations in the vignette. The green dashed line shows FlashLFQ's estimate of the mode. You can either specify your own fold-change cutoff in FlashLFQ, or the software can use the estimate of the mode standard deviation as the null hypothesis.

Step 3: Perform Bayesian null hypothesis significance testing

Frequentist methods to perform null hypothesis significance testing (i.e., a t-test) are commonly used in proteomics, but we choose a Bayesian approach here mainly for three reasons:

  1. It is easier to interpret the result. The result of a Bayesian hypothesis test is called a posterior error probability (PEP). This is the probability that a protein's fold-change falls below the specified noise level. In other words, if a protein's PEP is 0.3, there is a 30% chance that the protein's fold-change is noise (symbolically annotated as P(H0|D), or the probability that the null hypothesis is true, given the data). If a t-test is performed instead, the resulting p-value tells you the probability of getting the data given that the null hypothesis is true (symbolically, P(D|H0)). The p-value is fairly difficult to interpret; if 1000 t-tests were performed, 10 of those tests will provide p-values less than 0.01 simply by random chance (0.01 * 1000 = 10). You would need to perform a multiple testing correction to account for this (e.g., by estimating the false discovery rate). The FDR is certainly more understandable than the p-value, but the p-value itself is not a very good measurement of "importance", because of the t-test's point null hypothesis.
  2. The null hypothesis makes more sense. The null hypothesis of the t-test is that the fold-change of the protein is exactly zero. This is a priori not true, because even if the change is very small (e.g., 0.01) it is not exactly equal to zero. Thus we must introduce a fold-change cutoff to filter out high-precision but unimportant fold-changes. This fold-change cutoff, however, has no basis in frequentist statistics. It would be better if we could specify our own interval null hypothesis (e.g., -0.1 to 0.1), but this is not possible with a t-test. It is, however, possible in Bayesian statistics.
  3. It is more robust to outlier measurements. The underlying assumption of a t-test is that the data is normally distributed (the central limit theorem works pretty well, if you have enough data). Quantitative mass spectrometry-based proteomics data, however, is fairly sparse and noisy. It would be better if we could specify an underlying model for the data that is more robust to outliers instead of a normal distribution, such as a Student's t-distribution, but this is not possible with a t-test. It is possible in Bayesian statistics.

There are several ways to perform a hypothesis test with Bayesian statistics. We based ours on Kruschke's estimation (linked above), except that we specify a different prior probability distribution for the mean. We will perform the same Markov Chain Monte Carlo algorithm as before, except we will no longer use a weak/diffuse prior for the mean, but rather a specifically chosen prior to reflect the theoretical principles behind a null hypothesis significance test. We will also incorporate the fact that the null hypothesis is an interval, rather than a single value (e.g., [-0.2 to 0.2] instead of exactly zero).

A priori, given no data, we would be uncertain if a protein is significantly changing or not, because there is no data to classify the protein as signal or noise. This view should be reflected by the prior probability distribution, because it is the probability distribution before any data (evidence) is taken into account. Mathematically we translate this to a 50% probability of the null hypothesis being true, and a 50% probability of the alternative hypothesis being true, with the two being mutually exclusive. Thus, we will want to use a prior probability distribution for the mean in which 50% of our probability is contained within the interval null hypothesis, and 50% is outside of it. We will use a Student's t-distribution centered at a fold-change of zero, a standard deviation equal to the fold-change cutoff, and a nu (degrees of freedom) equal to 1.0 as the prior probability distribution for the mean. We chose a heavy-tailed Student's t-distribution instead of Kruschke's normal distribution because the it reflects the belief that it is perfectly reasonable to have a large fold-change, given sufficient evidence. The normal distribution would not perform well if the protein changed by a large amount. This is why nu is set to 1.0 - it results in heavy tails. The sigma (standard deviation) of the t-distribution is set equal to the fold-change cutoff because this makes the cumulative probability contained within the interval null hypothesis equal to 50%.

Once we have specified our priors, we then perform the same MCMC sampling method as described above. The parameter that is of most interest to us is the mean; we want to know if the mean of the measurements is greater than the noise, and with what probability. The probability that the protein's fold-change is noise is the number of MCMC iterations in which the mean fell below the cutoff divided by the total number of MCMC iterations performed. So if 1000 out of the 3000 generated t-distributions' means fell below the fold-change cutoff, the PEP is ~0.333.

In summary, we take our data (peptide fold-change measurements) and combine it with the prior probability distribution using Bayes' rule, resulting in a posterior probability distribution. We sample this distribution with a Markov Chain Monte Carlo algorithm. More MCMC samples results in a more precise estimate of the posterior probability distribution, at the cost of increased computation time.

Examples

The posterior distribution can be plotted, along with the data, the prior, and the null hypothesis, for visual inspection. The posterior distribution can look somewhat noisy with relatively few iterations of the MCMC algorithm and a broad posterior distribution. Increasing the number of MCMC iterations increases the precision of the estimates of the fold-change, standard deviation, the PEP, and the false discovery rate, but 3000 iterations per protein is generally sufficient to get a good estimate of these values.

The following are three example plots of the posterior probability distribution (in blue) along with the peptide-level fold-changes (i.e., the data, in red) for the peptides assigned to the annotated protein. The data is from the vignette. 300000 MCMC iterations were performed so that the probability distributions appear more smooth. The probability distribution was binned in 0.01 fold-change increments.