Skip to content

Large scale analysis with Philosopher pipeline

Felipe Leprevost edited this page Feb 18, 2020 · 2 revisions

For this example we will see how to process and analyze the Clear Cell Renal Carcinoma cohort data from CPTAC 3 using MSFragger, Philosopher and TMT-Integrator. You will learn how to process a large dataset composed of multiple fractionated TMT-labeled samples. This tutorial will contain the details needed for you to reproduce the published results. (There may be some small differences in the results, as our tools continue to be improved.)

We will need:

  • Philosopher (version 2.1.1 or higher)

  • MSFragger (version 2.3 or higher)

  • TMT-Integrator (version 2.3 or higher)

  • Java 8 Runtime Environment (required by MSFragger)

  • mzML files from the Clear Cell Renal Carcinoma data set from CPTAC 3 (instructions for download below)

  • A protein sequence database (see this example)

  • A computer server running GNU/Linux with at least 64GB of RAM

We ran this example on a Linux Red Hat 7, so the commands shown below are Linux compatible. To reproduce this on a Windows machine, you will need to adjust the folder separators ('\' for windows and '/' for Linux).

Download the data set

The CPTAC 3 data can be downloaded from the NIH/CPTAC data portal, which may require installation of the IBM Aspera Connect browser extension and application. You'll also need to agree to the terms of use.

Select the mzML files you want to download, in this example we will use the 'Proteome' (non-phospho enriched) fraction. We don't need to do any file conversion because we are already using the mzML files provided by the consortium, but you will need to unzip/decompress the files.

Organize the workspace

Start by creating a folder for the entire analysis that will be called 6_CPTAC3_Clear_Cell_Renal_Cell_Carcinoma, inside we will create a folder for the whole proteome data we've downloaded. Inside this directory, there should be 23 folders, each of which will contain all fractions from each multiplexed TMT-labeled sample. Create a folder called bin for the software tools we will use, a folder called params for the parameter files, and a folder called database with our protein sequence FASTA file.

The workspace structure should look like this:

6_CPTAC3_Clear_Cell_Renal_Cell_Carcinoma
|── whole
|   ├── 01CPTAC_CCRCC_W_JHU_20171007
|   ├── 02CPTAC_CCRCC_W_JHU_20171003
|   ├── 03CPTAC_CCRCC_W_JHU_20171022
|   ├── 04CPTAC_CCRCC_W_JHU_20171026
|   ├── 05CPTAC_CCRCC_W_JHU_20171030
|   ├── 06CPTAC_CCRCC_W_JHU_20171120
|   ├── 07CPTAC_CCRCC_W_JHU_20171127
|   ├── 08CPTAC_CCRCC_W_JHU_20171205
|   ├── 09CPTAC_CCRCC_W_JHU_20171215
|   ├── 10CPTAC_CCRCC_W_JHU_20180119
|   ├── 11CPTAC_CCRCC_W_JHU_20180126
|   ├── 12CPTAC_CCRCC_W_JHU_20180202
|   ├── 13CPTAC_CCRCC_W_JHU_20180215
|   ├── 14CPTAC_CCRCC_W_JHU_20180223
|   ├── 15CPTAC_CCRCC_W_JHU_20180315
|   ├── 16CPTAC_CCRCC_W_JHU_20180322
|   ├── 17CPTAC_CCRCC_W_JHU_20180517
|   ├── 18CPTAC_CCRCC_W_JHU_20180521
|   ├── 19CPTAC_CCRCC_W_JHU_20180526
|   ├── 20CPTAC_CCRCC_W_JHU_20180602
|   ├── 21CPTAC_CCRCC_W_JHU_20180621
|   ├── 22CPTAC_CCRCC_W_JHU_20180625
|   ├── 23CPTAC_CCRCC_W_JHU_20180629
├── bin
│   ├── MSFragger-2.3.jar
│   ├── philosopher
|── params
|   ├── fragger.params
|   ├── philosopher.yaml
|── database
|   └── 2020-02-18-decoys-reviewed-contam-UP000005640.fas

Inside each one of these folders, place the mzML files corresponding to all fractions for that multiplexed sample, plus an annotation file for the TMT channels:

.
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f01.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f02.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f03.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f04.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f05.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f06.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f07.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f08.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f09.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f10.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f11.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f12.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f13.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f14.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f15.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f16.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f17.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f18.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f19.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f20.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f21.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f22.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f23.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f24.mzML
├── 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_fA.mzML
├── annotation.txt

The annotation file is a simple text file with mappings between the TMT channels and the sample labels, which is needed to generate the final reports. Each data set folder should contain a text file called annotation.txt with the mapping. Below is an example of the annotation file for the data set #01:

126 CPT0079430001
127N CPT0023360001
127C CPT0023350003
128N CPT0079410003
128C CPT0087040003
129N CPT0077310003
129C CPT0077320001
130N CPT0087050003
130C CPT0002270011
131N pool01

The given labels for each cohort and data set can also be found on the NIH CPTAC data portal (in the CPTAC_CCRCC_metadata folder).

Set up the MSFragger parameter file

We will use the parameter file displayed below for our analysis. More details about each parameter can be found in the MSFragger wiki.

num_threads = 0                             # Number of CPU threads to use. 
database_name = /workspace/6_CPTAC3_Clear_Cell_Renal_Cell_Carcinoma/database/2020-02-18-decoys-reviewed-contam-UP000005640.fas                # Path to the protein database file in FASTA format.

precursor_mass_lower = -20                  # Lower bound of the precursor mass window.
precursor_mass_upper = 20                   # Upper bound of the precursor mass window.
precursor_mass_units = 1                    # Precursor mass tolerance units (0 for Da, 1 for ppm).
precursor_true_tolerance = 20               # True precursor mass tolerance (window is +/- this value).
precursor_true_units = 1                    # True precursor mass tolerance units (0 for Da, 1 for ppm).
fragment_mass_tolerance = 20                # Fragment mass tolerance (window is +/- this value).
fragment_mass_units = 1                     # Fragment mass tolerance units (0 for Da, 1 for ppm).
calibrate_mass = 0                          # Perform mass calibration (0 for OFF, 1 for ON, 2 for ON and find optimal parameters).
write_calibrated_mgf = 0                    # Write calibrated MS2 scan to a MGF file (0 for No, 1 for Yes).
decoy_prefix = rev_                         # Prefix added to the decoy protein ID.

isotope_error = 0/1/2                       # Also search for MS/MS events triggered on specified isotopic peaks.
mass_offsets = 0                            # Creates multiple precursor tolerance windows with specified mass offsets.
precursor_mass_mode = selected              # One of isolated/selected/recalculated.

localize_delta_mass = 0                     # Include fragment ions mass-shifted by unknown modifications (recommended for open
                                            # and mass offset searches) (0 for OFF, 1 for ON).
delta_mass_exclude_ranges = (-1.5,3.5)      # Exclude mass range for shifted ions searching.
fragment_ion_series = b,y                   # Ion series used in search, specify any of a,b,c,x,y,z (comma separated).

search_enzyme_name = Trypsin                # Name of enzyme to be written to the pepXML file.
search_enzyme_cutafter = KR                 # Residues after which the enzyme cuts.
search_enzyme_butnotafter = P               # Residues that the enzyme will not cut before.

num_enzyme_termini = 2                      # 0 for non-enzymatic, 1 for semi-enzymatic, and 2 for fully-enzymatic.
allowed_missed_cleavage = 1                 # Allowed number of missed cleavages per peptide. Maximum value is 5.

clip_nTerm_M = 1                            # Specifies the trimming of a protein N-terminal methionine as a variable modification (0 or 1).

# maximum of 16 mods - amino acid codes, * for any amino acid,
# [ and ] specifies protein termini, n and c specifies
# peptide termini
variable_mod_01 = 15.9949 M 3
variable_mod_02 = 42.0106 [^ 1
variable_mod_03 = 229.162932 n^ 1
variable_mod_04 = 229.162932 S 1

allow_multiple_variable_mods_on_residue = 0 # Allow each residue to be modified by multiple variable modifications (0 or 1).
max_variable_mods_per_peptide = 3           # Maximum total number of variable modifications per peptide.
max_variable_mods_combinations = 50000      # Maximum number of modified forms allowed for each peptide (up to 65534).

output_file_extension = pepXML              # File extension of output files.
output_format = pepXML                      # File format of output files (pepXML or tsv).
output_report_topN = 1                      # Reports top N PSMs per input spectrum.
output_max_expect = 50                      # Suppresses reporting of PSM if top hit has expectation value greater than this threshold.
report_alternative_proteins = 0             # Report alternative proteins for peptides that are found in multiple proteins (0 for no, 1 for yes).

precursor_charge = 1 4                      # Assumed range of potential precursor charge states. Only relevant when override_charge is set to 1.
override_charge = 0                         # Ignores precursor charge and uses charge state specified in precursor_charge range (0 or 1).

digest_min_length = 7                       # Minimum length of peptides to be generated during in-silico digestion.
digest_max_length = 50                      # Maximum length of peptides to be generated during in-silico digestion.
digest_mass_range = 500.0 5000.0            # Mass range of peptides to be generated during in-silico digestion in Daltons.
max_fragment_charge = 2                     # Maximum charge state for theoretical fragments to match (1-4).
# excluded_scan_list_file =                 # Text file containing a list of scan names to be ignored in the search.

track_zero_topN = 0                         # Track top N unmodified peptide results separately from main results internally for boosting features. Should be
                                            # set to a number greater than output_report_topN if zero bin boosting is desired.
zero_bin_accept_expect = 0.00               # Ranks a zero-bin hit above all non-zero-bin hit if it has expectation less than this value.
zero_bin_mult_expect = 1.00                 # Multiplies expect value of PSMs in the zero-bin during  results ordering (set to less than 1 for boosting).
add_topN_complementary = 0                  # Inserts complementary ions corresponding to the top N most intense fragments in each experimental spectra.

minimum_peaks = 15                          # Minimum number of peaks in experimental spectrum for matching.
use_topN_peaks = 150                        # Pre-process experimental spectrum to only use top N peaks.
deisotope = 1                               # Perform deisotoping or not (0=no, 1=yes and assume singleton peaks single charged, 2=yes and assume singleton
                                            # peaks single or double charged).
min_fragments_modelling = 3                 # Minimum number of matched peaks in PSM for inclusion in statistical modeling.
min_matched_fragments = 4                   # Minimum number of matched peaks for PSM to be reported.
minimum_ratio = 0.01                        # Filters out all peaks in experimental spectrum less intense than this multiple of the base peak intensity.
clear_mz_range = 125.5 131.5                # Removes peaks in this m/z range prior to matching.
remove_precursor_peak = 0                   # Remove precursor peaks from tandem mass spectra. 0 = not remove; 1 = remove the peak with precursor charge;
                                            # 2 = remove the peaks with all charge states.
remove_precursor_range = -1.5,1.5           # m/z range in removing precursor peaks. Unit: Da.
intensity_transform = 0                     # Transform peaks intensities with sqrt root. 0 = not transform; 1 = transform using sqrt root.

# Fixed modifications
add_Cterm_peptide = 0.000000
add_Nterm_peptide = 0.000000
add_Cterm_protein = 0.000000
add_Nterm_protein = 0.000000
add_G_glycine = 0.000000
add_A_alanine = 0.000000
add_S_serine = 0.000000
add_P_proline = 0.000000
add_V_valine = 0.000000
add_T_threonine = 0.000000
add_C_cysteine = 57.021464
add_L_leucine = 0.000000
add_I_isoleucine = 0.000000
add_N_asparagine = 0.000000
add_D_aspartic_acid = 0.000000
add_Q_glutamine = 0.000000
add_K_lysine = 229.162932
add_E_glutamic_acid = 0.000000
add_M_methionine = 0.000000
add_H_histidine = 0.000000
add_F_phenylalanine = 0.000000
add_R_arginine = 0.000000
add_Y_tyrosine = 0.000000
add_W_tryptophan = 0.000000
add_B_user_amino_acid = 0.000000
add_J_user_amino_acid = 0.000000
add_O_user_amino_acid = 0.000000
add_U_user_amino_acid = 0.000000
add_X_user_amino_acid = 0.000000
add_Z_user_amino_acid = 0.000000

Set up the Philosopher pipeline configuration file

For the Philosopher analysis we are going to run it using the automated pipeline mode, this mode will automatically run all the necessary steps for us, since we have multiple folders, it would be difficult to run them all manually. For doing so we first need to set the philosopher.yaml configuration file. The configuration file is divided in two sections; the upper part contains a list of all the commands the program is able to automate, the following sections are the individual commands parameter lists. We will set each of the desired commands to yes on the upper part, then we will configure the individual steps. The example below is what we will use for the analysis. You can check on the documentation page the meaning of each parameter and how to adjust them for your analysis.

# Philosopher pipeline configuration file.
#
# The pipeline mode automates the processing done by Philosopher and other tools. First, check
# the steps you want to execute in the commands section and change them to
# 'yes'. For each selected command, go to its section and adjust the parameters
# accordingly to your analysis.
#
# If you want to include MSFragger and TMT-Integrator into your analysis, you will
# have to download them separately, and then add their location in their 
# configuration section.
#
# Usage:
# philosopher pipeline --config <this_configuration_file> [list_of_data_set_folders]

analytics: false                               # reports when a workspace is created for usage estimation (default true)
slackToken:                                    # specify the Slack API token
slackChannel:                                  # specify the channel name

commands:
  workspace: yes                               # manage the experiment workspace for the analysis
  database: yes                                # target-decoy database formatting
  comet: no                                    # peptide spectrum matching with Comet
  msfragger: yes                               # peptide spectrum matching with MSFragger
  peptideprophet: yes                          # peptide assignment validation
  ptmprophet: no                               # PTM site localization
  proteinprophet: no                           # protein identification validation
  filter: yes                                  # statistical filtering, validation and False Discovery Rates assessment
  freequant: yes                               # label-free Quantification
  labelquant: yes                              # isobaric Labeling-Based Relative Quantification
  bioquant: no                                 # protein report based on Uniprot protein clusters
  report: yes                                  # multi-level reporting for both narrow-searches and open-searches
  abacus: yes                                  # combined analysis of LC-MS/MS results
  tmtintegrator: no                            # integrates channel abundances from multiple TMT samples

database:
  protein_database: /workspace/6_CPTAC3_Clear_Cell_Renal_Cell_Carcinoma/database/2020-02-18-decoys-reviewed-contam-UP000005640.fas                           # path to the target-decoy protein database
  decoy_tag: rev_                              # prefix tag used added to decoy sequences

comet:
  noindex: true                                # skip raw file indexing
  param:                                       # comet parameter file (default "comet.params.txt")
  raw: mzML                                    # format of the spectra file

msfragger:                                     # v2.3
  path: /workspace/6_CPTAC3_Clear_Cell_Renal_Cell_Carcinoma/bin/MSFragger-2.3.jar      # path to MSFragger jar
  memory: 60                                   # how much memory in GB to use
  param: /workspace/6_CPTAC3_Clear_Cell_Renal_Cell_Carcinoma/params/fragger.params     # MSFragger parameter file
  raw: mzML                                    # spectra format
  num_threads: 0                               # 0=poll CPU to set num threads; else specify num threads directly (max 64)
  precursor_mass_lower: -50                    # lower bound of the precursor mass window
  precursor_mass_upper: 50                     # upper bound of the precursor mass window
  precursor_mass_units: 1                      # 0=Daltons, 1=ppm
  precursor_true_tolerance: 20                 # true precursor mass tolerance (window is +/- this value)
  precursor_true_units: 1                      # 0=Daltons, 1=ppm
  fragment_mass_tolerance: 20                  # fragment mass tolerance (window is +/- this value)
  fragment_mass_units: 1                       # fragment mass tolerance units (0 for Da, 1 for ppm)
  calibrate_mass: 2                            # 0=Off, 1=On, 2=On and find optimal parameters
  deisotope: 1                                 # activates deisotoping.
  isotope_error: 0/1/2                         # 0=off, 0/1/2 (standard C13 error)
  mass_offsets: 0                              # allow for additional precursor mass window shifts. Multiplexed with isotope_error. mass_offsets = 0/79.966 can be used as a restricted ‘open’ search that looks for unmodified and phosphorylated peptides (on any residue)
  precursor_mass_mode: selected                # selected or isolated
  localize_delta_mass: 0                       # this allows shifted fragment ions - fragment ions with mass increased by the calculated mass difference, to be included in scoring
  delta_mass_exclude_ranges: (-1.5,3.5)        # exclude mass range for shifted ions searching
  fragment_ion_series: b,y                     # ion series used in search
  search_enzyme_name: Trypsin                  # name of enzyme to be written to the pepXML file
  search_enzyme_cutafter: KR                   # residues after which the enzyme cuts
  search_enzyme_butnotafter: P                 # residues that the enzyme will not cut before
  num_enzyme_termini: 2                        # 2 for enzymatic, 1 for semi-enzymatic, 0 for nonspecific digestion
  allowed_missed_cleavage: 1                   # maximum value is 5
  clip_nTerm_M: 1                              # specifies the trimming of a protein N-terminal methionine as a variable modification (0 or 1)
  variable_mod_01: 15.99490 M 3                # variable modification
  variable_mod_02: 42.01060 [^ 1               # variable modification
  variable_mod_03:                             # variable modification
  variable_mod_04:                             # variable modification
  variable_mod_05:                             # variable modification
  variable_mod_06:                             # variable modification
  variable_mod_07:                             # variable modification
  allow_multiple_variable_mods_on_residue: 0   # static mods are not considered
  max_variable_mods_per_peptide: 3             # maximum of 5
  max_variable_mods_combinations: 5000         # maximum of 65534, limits number of modified peptides generated from sequence
  output_file_extension: pepXML                # file extension of output files
  output_format: pepXML                        # file format of output files (pepXML or tsv)
  output_report_topN: 1                        # reports top N PSMs per input spectrum
  output_max_expect: 50                        # suppresses reporting of PSM if top hit has expectation greater than this threshold
  report_alternative_proteins: 0               # 0=no, 1=yes
  precursor_charge: 1 4                        # assume range of potential precursor charge states. Only relevant when override_charge is set to 1
  override_charge: 0                           # 0=no, 1=yes to override existing precursor charge states with precursor_charge parameter
  digest_min_length: 7                         # minimum length of peptides to be generated during in-silico digestion
  digest_max_length: 50                        # maximum length of peptides to be generated during in-silico digestion
  digest_mass_range: 500.0 5000.0              # mass range of peptides to be generated during in-silico digestion in Daltons
  max_fragment_charge: 2                       # maximum charge state for theoretical fragments to match (1-4)
  track_zero_topN: 0                           # in addition to topN results, keep track of top results in zero bin
  zero_bin_accept_expect: 0                    # boost top zero bin entry to top if it has expect under 0.01 - set to 0 to disable
  zero_bin_mult_expect: 1                      # disabled if above passes - multiply expect of zero bin for ordering purposes (does not affect reported expect)
  add_topN_complementary: 0                    # inserts complementary ions corresponding to the top N most intense fragments in each experimental spectra
  minimum_peaks: 15                            # required minimum number of peaks in spectrum to search (default 10)
  use_topN_peaks: 100                          # pre-process experimental spectrum to only use top N peaks
  min_fragments_modelling: 2                   # minimum number of matched peaks in PSM for inclusion in statistical modeling
  min_matched_fragments: 4                     # minimum number of matched peaks for PSM to be reported
  minimum_ratio: 0.01                          # filters out all peaks in experimental spectrum less intense than this multiple of the base peak intensity
  clear_mz_range: 0.0 0.0                      # for iTRAQ/TMT type data; will clear out all peaks in the specified m/z range
  remove_precursor_peak: 0                     # remove precursor peaks from tandem mass spectra. 0=not remove; 1=remove the peak with precursor charge; 2=remove the peaks with all charge states.
  remove_precursor_range: -1.5,1.5             # m/z range in removing precursor peaks. Unit: Da.
  intensity_transform: 0                       # transform peaks intensities with sqrt root. 0=not transform; 1=transform using sqrt root.
  add_Cterm_peptide: 0.000000                  # c-term peptide fixed modifications
  add_Cterm_protein: 0.000000                  # c-term protein fixed modifications
  add_Nterm_peptide: 0.000000                  # n-term peptide fixed modifications
  add_Nterm_protein: 0.000000                  # n-term protein fixed modifications
  add_A_alanine: 0.000000                      # alanine fixed modifications 
  add_C_cysteine: 57.021464                    # cysteine fixed modifications 
  add_D_aspartic_acid: 0.000000                # aspartic acid fixed modifications
  add_E_glutamic_acid: 0.000000                # glutamic acid fixed modifications
  add_F_phenylalanine: 0.000000                # phenylalanine fixed modifications
  add_G_glycine: 0.000000                      # glycine fixed modifications
  add_H_histidine: 0.000000                    # histidine fixed modifications
  add_I_isoleucine: 0.000000                   # isoleucine fixed modifications
  add_K_lysine: 0.000000                       # lysine fixed modifications
  add_L_leucine: 0.000000                      # leucine fixed modifications
  add_M_methionine: 0.000000                   # methionine fixed modifications
  add_N_asparagine: 0.000000                   # asparagine fixed modifications
  add_P_proline: 0.000000                      # proline fixed modifications
  add_Q_glutamine: 0.000000                    # glutamine fixed modifications
  add_R_arginine: 0.000000                     # arginine fixed modifications
  add_S_serine: 0.000000                       # serine fixed modifications
  add_T_threonine: 0.000000                    # threonine fixed modifications
  add_V_valine: 0.000000                       # valine fixed modifications
  add_W_tryptophan: 0.000000                   # tryptophan fixed modifications
  add_Y_tyrosine: 0.000000                     # tyrosine fixed modifications
  
peptideprophet:                                # v5.2
  extension: pepXML                            # pepXML file extension
  clevel: 0                                    # set Conservative Level in neg_stdev from the neg_mean, low numbers are less conservative, high numbers are more conservative
  accmass: true                                # use Accurate Mass model binning
  decoyprobs: true                             # compute possible non-zero probabilities for Decoy entries on the last iteration
  enzyme: trypsin                              # enzyme used in sample (optional)
  exclude: false                               # exclude deltaCn*, Mascot*, and Comet* results from results (default Penalize * results)
  expectscore: true                            # use expectation value as the only contributor to the f-value for modeling
  forcedistr: false                            # bypass quality control checks, report model despite bad modeling
  glyc: false                                  # enable peptide Glyco motif model
  icat: false                                  # apply ICAT model (default Autodetect ICAT)
  instrwarn: false                             # warn and continue if combined data was generated by different instrument models
  leave: false                                 # leave alone deltaCn*, Mascot*, and Comet* results from results (default Penalize * results)
  maldi: false                                 # enable MALDI mode
  masswidth: 5                                 # model mass width (default 5)
  minpeplen: 7                                 # minimum peptide length not rejected (default 7)
  minpintt: 2                                  # minimum number of NTT in a peptide used for positive pI model (default 2)
  minpiprob: 0.9                               # minimum probability after first pass of a peptide used for positive pI model (default 0.9)
  minprob: 0.05                                # report results with minimum probability (default 0.05)
  minrtntt: 2                                  # minimum number of NTT in a peptide used for positive RT model (default 2)
  minrtprob: 0.9                               # minimum probability after first pass of a peptide used for positive RT model (default 0.9)
  neggamma: false                              # use Gamma distribution to model the negative hits
  noicat: false                                # do no apply ICAT model (default Autodetect ICAT)
  nomass: false                                # disable mass model
  nonmc: false                                 # disable NMC missed cleavage model
  nonparam: true                               # use semi-parametric modeling, must be used in conjunction with --decoy option
  nontt: false                                 # disable NTT enzymatic termini model
  optimizefval: false                          # (SpectraST only) optimize f-value function f(dot,delta) using PCA
  phospho: false                               # enable peptide Phospho motif model
  pi: false                                    # enable peptide pI model
  ppm: true                                    # use PPM mass error instead of Dalton for mass modeling
  zero: false                                  # report results with minimum probability 0

ptmprophet:                                    # v5.2
  autodirect: false                            # use direct evidence when the lability is high, use in combination with LABILITY
  cions:                                       # use specified C-term ions, separate multiple ions by commas (default: y for CID, z for ETD)
  direct: false                                # use only direct evidence for evaluating PTM site probabilities
  em: 2                                        # set EM models to 0 (no EM), 1 (Intensity EM Model Applied) or 2 (Intensity and Matched Peaks EM Models Applied)
  static: false                                # use static fragppmtol for all PSMs instead of dynamically estimates offsets and tolerances
  fragppmtol: 15                               # when computing PSM-specific mass_offset and mass_tolerance, use specified default +/- MS2 mz tolerance on fragment ions
  ifrags: false                                # use internal fragments for localization
  keepold: false                               # retain old PTMProphet results in the pepXML file
  lability: false                              # compute Lability of PTMs
  massdiffmode: false                          # use the Mass Difference and localize
  massoffset: 0                                # adjust the massdiff by offset (0 = use default)
  maxfragz: 0                                  # limit maximum fragment charge (default: 0=precursor charge, negative values subtract from precursor charge)
  maxthreads: 4                                # use specified number of threads for processing
  mino: 0                                      # use specified number of pseudo-counts when computing Oscore (0 = use default)
  minprob: 0                                   # use specified minimum probability to evaluate peptides
  mods:                                        # specify modifications
  nions:                                       # use specified N-term ions, separate multiple ions by commas (default: a,b for CID, c for ETD)
  nominofactor: false                          # disable MINO factor correction when MINO= is set greater than 0 (default: apply MINO factor correction)
  ppmtol: 1                                    # use specified +/- MS1 ppm tolerance on peptides which may have a slight offset depending on search parameters
  verbose: false                               # produce Warnings to help troubleshoot potential PTM shuffling or mass difference issues

proteinprophet:                                # v5.2
  accuracy: false                              # equivalent to --minprob 0
  allpeps: false                               # consider all possible peptides in the database in the confidence model
  confem: false                                # use the EM to compute probability given the confidence
  delude: false                                # do NOT use peptide degeneracy information when assessing proteins
  excludezeros: false                          # exclude zero prob entries
  fpkm: false                                  # model protein FPKM values
  glyc: false                                  # highlight peptide N-glycosylation motif
  icat: false                                  # highlight peptide cysteines
  instances: false                             # use Expected Number of Ion Instances to adjust the peptide probabilities prior to NSP adjustment
  iprophet: false                              # input is from iProphet
  logprobs: false                              # use the log of the probabilities in the Confidence calculations
  maxppmdiff: 1000000                          # maximum peptide mass difference in PPM (default 20)
  minprob: 0.05                                # peptideProphet probabilty threshold (default 0.05)
  mufactor: 1                                  # fudge factor to scale MU calculation (default 1)
  nogroupwts: false                            # check peptide's Protein weight against the threshold (default: check peptide's Protein Group weight against threshold)
  nonsp: false                                 # do not use NSP model
  nooccam: false                               # non-conservative maximum protein list
  noprotlen: false                             # do not report protein length
  normprotlen: false                           # normalize NSP using Protein Length
  protmw: false                                # get protein mol weights
  softoccam: false                             # peptide weights are apportioned equally among proteins within each Protein Group (less conservative protein count estimate)
  unmapped: false                              # report results for UNMAPPED proteins

filter:
  psmFDR: 0.01                                 # psm FDR level (default 0.01)
  peptideFDR: 0.01                             # peptide FDR level (default 0.01)
  ionFDR: 0.01                                 # peptide ion FDR level (default 0.01)
  proteinFDR: 0.01                             # protein FDR level (default 0.01)
  peptideProbability: 0.7                      # top peptide probability threshold for the FDR filtering (default 0.7)
  proteinProbability: 0.5                      # protein probability threshold for the FDR filtering (not used with the razor algorithm) (default 0.5)
  peptideWeight: 0.9                           # threshold for defining peptide uniqueness (default 1)
  razor: true                                  # use razor peptides for protein FDR scoring
  picked: true                                 # apply the picked FDR algorithm before the protein scoring
  mapMods: false                               # map modifications acquired by an open search
  models: true                                 # print model distribution
  sequential: true                             # alternative algorithm that estimates FDR using both filtered PSM and Protein lists

freequant:
  peakTimeWindow: 0.4                          # specify the time windows for the peak (minute) (default 0.4)
  retentionTimeWindow: 3                       # specify the retention time window for xic (minute) (default 3)
  tolerance: 10                                # m/z tolerance in ppm (default 10)
  isolated: false                              # use the isolated ion instead of the selected ion for quantification

labelquant:
  annotation: annotation.txt                   # annotation file with custom names for the TMT channels
  bestPSM: false                               # select the best PSMs for protein quantification
  level: 2                                     # ms level for the quantification
  minProb: 0.7                                 # only use PSMs with a minimum probability score
  brand:                                       # isobairic labeling brand (tmt, itraq)
  plex: 10                                     # number of channels
  purity: 0.5                                  # ion purity threshold (default 0.5)
  removeLow: 0.05                              # ignore the lower 3% PSMs based on their summed abundances
  tolerance: 20                                # m/z tolerance in ppm (default 20)
  uniqueOnly: false                            # report quantification based on only unique peptides

report:
  msstats: false                               # create an output compatible to MSstats
  withDecoys: false                            # add decoy observations to reports
  mzID: false                                  # create a mzID output

bioquant:
  organismUniProtID:                           # UniProt proteome ID
  level: 0.9                                   # cluster identity level (default 0.9)
                  
abacus:
  protein: true                                # global level protein report
  peptide: true                                # global level peptide report
  proteinProbability: 0.9                      # minimum protein probability (default 0.9)
  peptideProbability: 0.5                      # minimum peptide probability (default 0.5)
  uniqueOnly: false                            # report TMT quantification based on only unique peptides
  reprint: false                               # create abacus reports using the Reprint format

tmtintegrator:                                 # v1.1.2
path:                                          # path to TMT-Integrator jar
memory: 6                                      # memory allocation, in Gb
output:                                        # the location of output files
channel_num: 10                                # number of channels in the multiplex (e.g. 10, 11)
ref_tag: Bridge                                # unique tag for identifying the reference channel (Bridge sample added to each multiplex)
groupby: -1                                    # level of data summarization(0: PSM aggregation to the gene level; 1: protein; 2: peptide sequence; 3: PTM site; -1: generate reports at all levels)
psm_norm: false                                # perform additional retention time-based normalization at the PSM level
outlier_removal: true                          # perform outlier removal
prot_norm: -1                                  # normalization (0: None; 1: MD (median centering); 2: GN (median centering + variance scaling); -1: generate reports with all normalization options)
min_pep_prob: 0.9                              # minimum PSM probability threshold (in addition to FDR-based filtering by Philosopher)
min_purity: 0.5                                # ion purity score threshold
min_percent: 0.05                              # remove low intensity PSMs (e.g. value of 0.05 indicates removal of PSMs with the summed TMT reporter ions intensity in the lowest 5% of all PSMs)
unique_pep: false                              # allow PSMs with unique peptides only (if true) or unique plus razor peptides (if false), as classified by Philosopher and defined in PSM.tsv files
unique_gene: 0                                 # additional, gene-level uniqueness filter (0: allow all PSMs; 1: remove PSMs mapping to more than one GENE with evidence of expression in the dataset; 2:remove all PSMs mapping to more than one GENE in the fasta file)
best_psm: true                                 # keep the best PSM only (highest summed TMT intensity) among all redundant PSMs within the same LC-MS run
prot_exclude: none                             # exclude proteins with specified tags at the beginning of the accession number (e.g. none: no exclusion; sp|,tr| : exclude protein with sp| or tr|)
allow_overlabel: true                          # allow PSMs with TMT on S (when overlabeling on S was allowed in the database search)
allow_unlabeled: true                          # allow PSMs without TMT tag or acetylation on the peptide n-terminus 
mod_tag: none                                  # PTM info for generation of PTM-specific reports (none: for Global data; S[167],T[181],Y[243]: for Phospho; K[170]: for K-Acetyl)
min_site_prob: -1                              # site localization confidence threshold (-1: for Global; 0: as determined by the search engine; above 0 (e.g. 0.75): PTMProphet probability, to be used with phosphorylation only)
ms1_int: true                                  # use MS1 precursor ion intensity (if true) or MS2 summed TMT reporter ion intensity (if false) as part of the reference sample abundance estimation 
top3_pep: true                                 # use top 3 most intense peptide ions as part of the reference sample abundance estimation
print_RefInt: false                            # print individual reference sample abundance estimates for each multiplex in the final reports (in addition to the combined reference sample abundance estimate)
add_Ref: -1                                    # add an artificial reference channel if there is no reference channel (-1: don't add the reference; 0: use summation as the reference; 1: use average as the

Running the pipeline

To start the pipeline, we need to run Philosopher using the pipeline command, passing each of the data sets we wish to process together.

$ bin/philosopher pipeline --config params/philosopher.yaml 01CPTAC_CCRCC_W_JHU_20171007 02CPTAC_CCRCC_W_JHU_20171003 03CPTAC_CCRCC_W_JHU_20171022 04CPTAC_CCRCC_W_JHU_20171026 05CPTAC_CCRCC_W_JHU_20171030 06CPTAC_CCRCC_W_JHU_20171120 07CPTAC_CCRCC_W_JHU_20171127 08CPTAC_CCRCC_W_JHU_20171205 09CPTAC_CCRCC_W_JHU_20171215 10CPTAC_CCRCC_W_JHU_20180119 11CPTAC_CCRCC_W_JHU_20180126 12CPTAC_CCRCC_W_JHU_20180202 13CPTAC_CCRCC_W_JHU_20180215 14CPTAC_CCRCC_W_JHU_20180223 15CPTAC_CCRCC_W_JHU_20180315 16CPTAC_CCRCC_W_JHU_20180322 17CPTAC_CCRCC_W_JHU_20180517 18CPTAC_CCRCC_W_JHU_20180521 19CPTAC_CCRCC_W_JHU_20180526 20CPTAC_CCRCC_W_JHU_20180602 21CPTAC_CCRCC_W_JHU_20180621 22CPTAC_CCRCC_W_JHU_20180625 23CPTAC_CCRCC_W_JHU_20180629

Each step will be executed consecutively, and no other commands or input from the user are necessary.

Wrapping up

When the analysis is done, we will have individual results for each multiplexed TMT sample as well as the combined protein expression matrix containing all TMT channels labeled according to the annotation.txt file. You should have new .tsv files in your workspace, which contain the filtered PSM, peptide, ion, and protein identifications.