-
Notifications
You must be signed in to change notification settings - Fork 9
Calculation of false discovery rate (FDR) and protein scores
PIA can automatically calculate the false discovery rate for PSM, peptide and protein level. For this, it is recommended that the spectra were identified using a merged target-decoy database or that the results of separate target and decoy identifications are merged before provided to PIA. While the internal decoy approach provided by search engines like Mascot or Proteome Discoverer is supported, an approach with a handcrafted target-decoy database is recommended.
For the PSM level FDR calculation, first a search engine (or pre-processing) score or probability needs to be provided. Furthermore, an accession pattern for the decoys needs to be given (unless the search engine internal approach is used, which needs to be selected if appropriate).
Based on the selected score/probabilty and the decoy pattern, the list of PSMs is first sorted and traversed in descending order, starting with the highest scoring PSM. The FDR is calculated by the formula: nrDecoys_i / nrTargets_i up to each PSM index i. The index assures, that if several PSMs have the same score/probability, all of these get the same FDR assigned.
Additionally to the FDR, also the FDR q-value is calculated for each index, which is the lowest possible FDR value in the score sorted list of PSMs at or after the given index.
Finally, to smooth the FDR distribution and combine results of several search engines or MS runs, the FDR Score and combined FDR score (Jones et al 2009, Proteomics, PMID: 19253293) can be calculated. In brief, for the FDR score an average FDR Score which is interpolated between search engine scores is calculated. For the combined FDR score, the n-th root of spectrum matches from n different search engine results for the same spectrum is used as basis for a merged FDR calculation.
If an alternative or more advanced method for the estimation of false positives on PSM level, like Percolator or the calculation of posterior error probabilities, was externally applied to the PSMs, the FDR calculation performed by PIA needs not to be executed at all.
For the peptide level, basically the same options as for the PSM level FDR calculation are given. The only difference is, that the PSMs are merged together into peptides based on the amino acid sequence and, if selected, also the PTM states. Additionally to the search engine scores, the PSM level (combined) FDR score can be used as basic score for the FDR calculation.
Before the merge into peptides and therefore before the peptide level FDR calculation, all PSMs above a PSM level FDR threshold can (and should) be filtered. For this filtering also a basic search engine score, Percolator values or PEPs can be used, if no PSM level FDR calculation was performed or needed.
For the protein level, first one of the inference algorithms (Report all, Occam's Razor or Spectrum Extractor) needs to be selected. These always need a PSM level base score for the calculation of the final protein score. As these base scores, any search engine or post-processing score like Percolator value, PEPs can be used, if provided from externally. The default way is to use the PSM level (combined) FDR score. Also, the PSMs should be filtered e.g. by PSM level FDR 1% before protein inference.
Besides the inference algorithm, also the scoring method must be selected. The recommended method depends on the selected base score and there are the major methods are "additive scoring" and "multiplicative scoring". "Additive scoring" should be selected, if the base score is a score like the "Mascot ion score", where the score is not capped and a higher score means better identification. The score for the protein using this method is calculated by adding up the best score for each of the protein's peptide or, if selected, each PSM.
When using scores like the FDR score or PEPs, the "multiplicative scoring" should be applied. Here, the final protein score is the log10 of the product of each peptide's best PSM value. (Actually, to be numerical consistent, the log10 values of the peptides are added up.) So, also here a higher protein score means better evidence for the respective protein.
Finally, PIA can calculate the FDR and q-value for teh proteins in the same way as described on PSM level, but using the protein scores for the sorting.