Generate in silico spectral libraries from glycopeptide lists for data-independent acquisition (DIA) analysis.
This tutorial has been tested on a workstation with Intel Core i9-12900K CPU, 64 GB RAM, and Microsoft Windows 10 Version 22H2 (OS Build 19045.2604) operating system. A NVIDIA GeForce RTX 3090 GPU is needed, with CUDA Version 11.6.
The following software and packages are required for spectral library prediction:
- Python (version 3.9.16, Anaconda distribution is recommended)
- PyTorch (version 1.12.1)
- DGL (version 1.0.1)
- numpy (version 1.23.5)
- pandas (version 1.5.2)
- scipy (version 1.10.1)
- scikit-learn (version 1.2.1)
- statsmodels (version 0.13.2)
- h5py (version 3.7.0)
- pymzml (version 2.5.2)
- matplotlib (version 3.6.2)
- tensorboard (version 2.10.0)
GproDIA is not needed in the process of spectral library generation, but is used in the complete DIA data analysis workflow. The following software and packages are required:
- Python (version 3.5.6, Anaconda distribution is recommended)
- OpenSWATH (version 2.6.0)
- PyProphet (version 2.1.5)
- msproteomicstools (version 0.11.0)
- numpy (version 1.18.5)
- pandas (version 0.25.3)
- scipy (version 1.4.1)
- scikit-learn (version 0.22.2.post1)
Note that GproDIA and DeepGlyco require different versions of Python. You must install them in separate virtual environment. Conda is recommended to manage the environment.
Starting materials of this tutorial are available at ProteomeXchange and iProX.
DIA data in mzML format (3 files):
- 20200615_serum_NGlycP_1hDIA_rep*.mzML
DDA database search results for retention time calibration:
- pGlycoDB-GP-FDR-Pro.txt (in serum_DDA1h_singlerun_pGlyco.zip)
- 20200615_serum_NglycP_1hDDA_HCDFT.mgf
A SWATH window file:
- swath_window.tsv
A background glycans file:
- background_glycan.txt
A starting glycopeptide list:
- serum.glycopeptides.csv
Pretrained models for glycopeptide MS/MS and iRT prediction:
- PXD023980_HumanSerum_gpms2_model.pt
- PXD023980_HumanSerum_gprt_model.pt
Create a directory for the project.
Create a subdirectory named library
in the project directory, and place the starting glycopeptide list and models in the subfolder:
- library\serum.glycopeptides.csv
- library\PXD023980_HumanSerum_gpms2_model.pt
- library\PXD023980_HumanSerum_gprt_model.pt
Open Anaconda PowerShell Prompt (Python 3.9 for DeepGlyco), set the project folder as working directory, and set path to the scripts as global parameter.
cd "Path_to_project_data"
$script_path = "Path_to_DeepGlyco\src"
Predict MS/MS spectra for the glycopeptides in the starting glycopeptide list.
python "$script_path\predict_gpms2.py" `
--model "library\PXD023980_HumanSerum_gpms2_model.pt" `
--in "library\serum.glycopeptides.csv" `
--out "library\serum_prediction_finetune.speclib.h5"
The predicted spectra are saved in a HDF5 file (*.speclib.h5
).
Predict iRT for the glycopeptides in the starting glycopeptide list.
python "$script_path\predict_gprt.py" `
--no-validate `
--model "library\PXD023980_HumanSerum_gprt_model.pt" `
--in "library\serum.glycopeptides.csv" `
--out "library\serum_prediction_finetune.speclib.h5"
The predicted iRT are added to the HDF5 file (*.speclib.h5
).
Convert the spectral library from HDF5 to GproDIA binary format.
python "$script_path\convert_speclib_to_GproDIA_assays.py" `
--in "library\serum_prediction_finetune.speclib.h5" `
--glycopeptide "library\serum.glycopeptides.csv" `
--out "library\serum_prediction.assay.pickle"
Create a subdirectory named mzML
in the project directory, and place the mzML files in the subfolder:
- mzML\20200615_serum_NGlycP_1hDIA_rep*.mzML
Create a subdirectory named DDA
in the project directory, and place the DDA files in the subfolder:
- DDA\pGlycoDB-GP-FDR-Pro.txt
- DDA\20200615_serum_NglycP_1hDDA_HCDFT.mgf
Place the SWATH window file and background glycan file in the project directory:
- swath_window.tsv
- background_glycan.txt
Open Anaconda PowerShell Prompt (Python 3.5 for GproDIA), set the project folder as working directory, and set path to the scripts as global parameter.
cd "Path_to_project_data"
$script_path = "Path_to_GproDIA\src"
Extract retention time anchors from the DDA results and combine replicate spectra into consensus spectra within each run..
python "$script_path\build_assays_from_pGlyco_mgf.py" `
--clean_glycan_struct `
--clean_glycan_site `
--psm "DDA\pGlycoDB-GP-FDR-Pro.txt" `
--mgf "DDA\20200615_serum_NglycP_1hDDA_HCDFT.mgf" `
--out "library\serum_1h.assay.pickle"
python "$script_path\remove_redundant_assays.py" `
--within_run `
--action consensus `
--in "library\serum\serum_1h.assay.pickle" `
--out "library\serum\serum_1h_consensus.assay.pickle"
The consensus spectra are saved in Python binary files (*_consensus.assay.pickle
).
Transform retention times in iRT space to experimental gradient.
python "$script_path\calibrate_assays_rt.py" `
--multiple_runs `
--smooth lowess --lowess_frac 0.25 --lowess_it 3 `
--in "library\serum_prediction.assay.pickle" `
--reference "library\serum_1h_consensus.assay.pickle" `
--out "library\serum_prediction_rtcalibrated.assay.pickle" `
--out_anchor "library\serum_1h_consensus_rtanchor.assay.pickle"
Filtering library entries using a set of rules.
python "$script_path\filter_assays.py" `
--swath_windows "swath_window.tsv" `
--min_fragment_mz 200 --max_fragment_mz 2000 `
--in "library\serum_prediction_rtcalibrated.assay.pickle"
Generate a spectral library containing identification transitions for glycoform inference.
python "$script_path\generate_glycoform_uis_assays.py" `
--swath_windows "swath_window.tsv" `
--min_fragment_mz 200 --max_fragment_mz 2000 `
--background_glycans "background_glycan.txt" `
--max_background_glycan_number 50 `
--in "library\serum_prediction_rtcalibrated_filtered.assay.pickle"
Generate decoy libraries.
python "$script_path\generate_decoy_assays.py" `
--in "library\serum_prediction_rtcalibrated_filtered.assay.pickle"
Combine the target and decoy spectral libraries and convert them to peptide query parameter (PQP) format.
python "$script_path\convert_assays_to_OpenSWATH_library.py" `
--enable_glycoform_uis `
--in "library\serum_prediction_rtcalibrated_filtered_uis.assay.pickle" `
(ls "library" "serum_prediction_rtcalibrated_filtered*_decoy.assay.pickle" | select -ExpandProperty FullName) `
--out "library\serum_prediction_uis.PQP"
Convert the retention time anchors to TraML format.
python "$script_path\filter_assays.py" `
--swath_windows "swath_window.tsv" `
--min_fragment_mz 200 --max_fragment_mz 2000 `
--in "library\serum_1h_consensus_rtanchor.assay.pickle"
python "$script_path\convert_assays_to_traml.py" `
--in "library\serum_1h_consensus_rtanchor_filtered.assay.pickle" `
--out "library\serum_prediction_rtanchor.traML"
The processed spectral library (including decoys) is saved as a PQP file. The processed anchors are saved in a TraML files. They will be the input for OpenSWATH.
Change working directory to the mzML
subfolder. Run the OpenSWATH analysis workflow.
cd "mzML"
ls *.mzML -Name | % {
OpenSwathWorkflow `
-Library:retentionTimeInterpretation seconds `
-RTNormalization:alignmentMethod lowess `
-RTNormalization:estimateBestPeptides `
-mz_extraction_window_ms1 10 -mz_extraction_window_ms1_unit ppm `
-mz_extraction_window 20 -mz_extraction_window_unit ppm `
-enable_ms1 true -enable_ipf true `
-threads 4 `
-swath_windows_file "..\swath_window.tsv" `
-tr "..\library\serum_prediction_uis.PQP" `
-tr_irt "..\library\serum_prediction_rtanchor.traML" `
-in $_ `
-out_osw $_.Replace('.mzML', '.osw') `
-rt_extraction_window 900
}
Create a subdirectory named result
in the project directory, and move the OpenSWATH files (*.osw
) in the subfolder. Set the working directory to the result
subfolder.
mkdir "..\result" -f | Out-Null
mv *.osw "..\result"
cd "..\result"
Conduct semi-supervised learning and error-rate estimation.
pyprophet merge `
--template="..\library\serum_prediction_uis.PQP" `
--out="serum_prediction_uis.osw" `
(ls *.osw -Name)
python "$script_path\score_glycopeptide_peakgroups.py" `
--level ms2 `
--threads 4 `
--in "serum_prediction_uis.osw" `
--test
python "$script_path\score_feature_glycoform.py" `
--level ms1 `
--threads 4 `
--in "serum_prediction_uis.osw" `
--test
python "$script_path\score_feature_glycoform.py" `
--level transition `
--threads 4 `
--in "serum_prediction_uis.osw" `
--test
python "$script_path\infer_glycoforms.py" `
--in "serum_prediction_uis.osw"
python "$script_path\infer_glycopeptides.py" `
--context global `
--in "serum_prediction_uis.osw"
Note that the --test
option is used to enable test mode with with fixed random seed to get reproducible results. This option should be turned off in practical scenarios.
Export the results to text report.
python "$script_path\export_results.py" `
--in "serum_prediction_uis.osw" `
--out "serum_prediction_uis.tsv" `
--format legacy_merged `
--glycoform --max_glycoform_qvalue 0.05 `
--max_rs_peakgroup_qvalue 0.05 `
--max_global_glycopeptide_qvalue 0.01 `
--no-transition_quantification
Perform multi-run alignment. Set `$TRIC_path` as the path to TRIC script (`feature_alignment.py`) and run TRIC.
``` powershell
$TRIC_path = "C:\Users\{User Name}\.conda\envs\{Python 3.5 Environment}\Scripts\feature_alignment.py"
python $TRIC_path `
--in "serum_prediction_uis.tsv" `
--out "serum_prediction_uis_aligned.tsv" `
--file_format openswath `
--fdr_cutoff 0.01 `
--max_fdr_quality 0.2 `
--mst:useRTCorrection True `
--mst:Stdev_multiplier 3.0 `
--method LocalMST `
--max_rt_diff 90 `
--alignment_score 0.001 `
--frac_selected 0 `
--realign_method lowess `
--disable_isotopic_grouping
The aligned results are saved in tab-separated format (*_aligned.tsv
).
In Anaconda PowerShell Prompt Python 3.9 for DeepGlyco, predict MS/MS spectra and iRT using an extended starting glycopeptide list.
python "$script_path\predict_gpms2.py" `
--pretrained "library\PXD023980_HumanSerum_gpms2_model.pt" `
--in "library\serum_extend.glycopeptides.csv" `
--out "library\serum_extend_prediction_finetune.speclib.h5"
python "$script_path\predict_gprt.py" `
--no-validate `
--pretrained "library\PXD023980_HumanSerum_gprt_model.pt" `
--in "library\serum_extend.glycopeptides.csv" `
--out "library\serum_extend_prediction_finetune.speclib.h5"
python "$script_path\convert_speclib_to_GproDIA_assays.py" `
--in "library\serum_extend_prediction_finetune.speclib.h5" `
--glycopeptide "library\serum_extend.glycopeptides.csv" `
--out "library\serum_extend_prediction.assay.pickle"
In Anaconda PowerShell Prompt Python 3.5 for GproDIA, process the predicted spectral library.
python "$script_path\calibrate_assays_rt.py" `
--multiple_runs `
--smooth lowess --lowess_frac 0.25 --lowess_it 3 `
--in "library\serum_extend_prediction.assay.pickle" `
--reference "library\serum_1h_consensus.assay.pickle" `
--out "library\serum_extend_prediction_rtcalibrated.assay.pickle" `
--out_anchor "library\serum_1h_consensus_rtanchor.assay.pickle"
python "$script_path\filter_assays.py" `
--swath_windows "swath_window.tsv" `
--min_fragment_mz 200 --max_fragment_mz 2000 `
--in "library\serum_extend_prediction_rtcalibrated.assay.pickle"
python "$script_path\filter_assays.py" `
--swath_windows "swath_window.tsv" `
--min_fragment_mz 200 --max_fragment_mz 2000 `
--in "library\serum_1h_consensus_rtanchor.assay.pickle"
python "$script_path\convert_assays_to_traml.py" `
--in "library\serum_1h_consensus_rtanchor_filtered.assay.pickle" `
--out "library\serum_extend_prediction_rtanchor.traML"
In order to save computing resources when using a large spectral library, generate a PQP for prelim search without glycoform inference.
python "$script_path\generate_decoy_assays.py" `
--in "library\serum_extend_prediction_rtcalibrated_filtered.assay.pickle"
python "$script_path\convert_assays_to_OpenSWATH_library.py" `
--disable_glycoform_uis `
--in "library\serum_extend_prediction_rtcalibrated_filtered.assay.pickle" `
(ls "library" "serum_extend_prediction_rtcalibrated_filtered*_decoy.assay.pickle" | select -ExpandProperty FullName) `
--out "library\serum_extend_prediction_prelim.PQP"
Run the OpenSWATH analysis workflow.
cd "mzML"
ls *.mzML -Name | % {
OpenSwathWorkflow `
-Library:retentionTimeInterpretation seconds `
-RTNormalization:alignmentMethod lowess `
-RTNormalization:estimateBestPeptides `
-mz_extraction_window_ms1 10 -mz_extraction_window_ms1_unit ppm `
-mz_extraction_window 20 -mz_extraction_window_unit ppm `
-threads 4 `
-swath_windows_file "..\swath_window.tsv" `
-tr "..\library\serum_extend_prediction_prelim.PQP" `
-tr_irt "..\library\serum_extend_prediction_rtanchor.traML" `
-in $_ `
-out_osw $_.Replace('.mzML', '.osw') `
-rt_extraction_window 900
}
Score and export the prelim search result.
mkdir "..\prelim" -f | Out-Null
mv *.osw "..\prelim"
cd "..\prelim"
pyprophet merge `
--template="serum_extend_prediction_prelim.PQP" `
--out="serum_extend_prediction_prelim.osw" `
(ls *.osw -Name)
python "$script_path\score_glycopeptide_peakgroups.py" `
--level ms2 `
--threads 4 `
--in "serum_extend_prediction_prelim.osw" `
--test
python "$script_path\infer_glycopeptides.py" `
--context global `
--in "serum_extend_prediction_prelim.osw"
python "$script_path\export_results.py" `
--in "serum_extend_prediction_prelim.osw" `
--out "serum_extend_prediction_prelim.tsv" `
--format legacy_merged `
--max_rs_peakgroup_qvalue 0.05 `
--max_global_glycopeptide_qvalue 1 `
--no-transition_quantification
Generate a subset spectral library based on the prelim search result. This library contains identification transitions for glycoform inference.
cd ..
python "$script_path\subset_assays.py" `
--in "library\serum_extend_prediction_rtcalibrated_filtered.assay.pickle" `
--out "library\serum_extend_prediction_second_filtered.assay.pickle" `
--result "prelim\serum_extend_prediction_prelim.tsv"
python "$script_path\generate_glycoform_uis_assays.py" `
--swath_windows "swath_window.tsv" `
--min_fragment_mz 200 --max_fragment_mz 2000 `
--background_glycans "background_glycan.txt" `
--max_background_glycan_number 50 `
--in "library\serum_extend_prediction_second_filtered.assay.pickle"
python "$script_path\generate_decoy_assays.py" `
--in "library\serum_extend_prediction_second_filtered.assay.pickle"
python "$script_path\convert_assays_to_OpenSWATH_library.py" `
--enable_glycoform_uis `
--in "library\serum_extend_prediction_second_filtered_uis.assay.pickle" `
(ls "library" "serum_extend_prediction_second_filtered*_decoy.assay.pickle" | select -ExpandProperty FullName) `
--out "library\serum_extend_prediction_uis.PQP"
Run the OpenSWATH analysis workflow.
cd "mzML"
ls *.mzML -Name | % {
OpenSwathWorkflow `
-Library:retentionTimeInterpretation seconds `
-RTNormalization:alignmentMethod lowess `
-RTNormalization:estimateBestPeptides `
-mz_extraction_window_ms1 10 -mz_extraction_window_ms1_unit ppm `
-mz_extraction_window 20 -mz_extraction_window_unit ppm `
-enable_ms1 true -enable_ipf true `
-threads 4 `
-swath_windows_file "..\..\swath_window.tsv" `
-tr "..\library\serum_extend_prediction_uis.PQP" `
-tr_irt "..\library\serum_extend_prediction_rtanchor.traML" `
-in $_ `
-out_osw $_.Replace('.mzML', '.osw') `
-rt_extraction_window 900
}
Score and export the main search result.
mkdir "..\result" -f | Out-Null
mv *.osw "..\result"
cd "..\result"
pyprophet merge `
--template="..\library\serum_prediction_uis.PQP" `
--out="serum_extend_prediction_uis.osw" `
(ls *.osw -Name)
python "$script_path\score_glycopeptide_peakgroups.py" `
--level ms2 `
--threads 4 `
--in "serum_extend_prediction_uis.osw" `
--test
python "$script_path\score_feature_glycoform.py" `
--level ms1 `
--threads 4 `
--in "serum_extend_prediction_uis.osw" `
--test
python "$script_path\score_feature_glycoform.py" `
--level transition `
--threads 4 `
--in "serum_extend_prediction_uis.osw" `
--test
python "$script_path\infer_glycoforms.py" `
--in "serum_extend_prediction_uis.osw"
python "$script_path\infer_glycopeptides.py" `
--context global `
--in "serum_extend_prediction_uis.osw"
python "$script_path\export_results.py" `
--in "serum_extend_prediction_uis.osw" `
--out "serum_extend_prediction_uis.tsv" `
--format legacy_merged `
--glycoform --max_glycoform_qvalue 0.05 `
--max_rs_peakgroup_qvalue 0.05 `
--max_global_glycopeptide_qvalue 0.01 `
--no-transition_quantification
$TRIC_path = "C:\Users\{User Name}\.conda\envs\{Python 3.5 Environment}\Scripts\feature_alignment.py"
python $TRIC_path `
--in "serum_extend_prediction_uis.tsv" `
--out "serum_extend_prediction_uis_aligned.tsv" `
--file_format openswath `
--fdr_cutoff 0.01 `
--max_fdr_quality 0.2 `
--mst:useRTCorrection True `
--mst:Stdev_multiplier 3.0 `
--method LocalMST `
--max_rt_diff 90 `
--alignment_score 0.001 `
--frac_selected 0 `
--realign_method lowess `
--disable_isotopic_grouping
The aligned results are saved in tab-separated format (*_aligned.tsv
).