Using deep learning to generate in silico spectral libraries for data-independent acquisition (DIA) analysis.
1.1.0
- Dependency of R removed
- FASTA digestion
- Ion mobility prediction (experimental)
For the version of the Nat Commun 2020 publication, please refer to the commit #674e2fb.
The following software and packages are required:
- Python (version 3.7 or later, Anaconda distribution is recommended)
- TensorFlow (version 2.0 or later)
- Keras (packaged with TensorFlow)
For spectral library generation from FASTA files and data preprocessing for training detectability models, the following package is required:
- Biopython (version 1.70 or later)
DeepDIA requires the following Python packages integrated in Anaconda:
- numpy (version 1.18.5)
- pandas (version 0.25.3)
- scipy (version 1.4.1)
- statsmodels (version 0.13.2)
Later versions may be compatible, but have not been tested.
For model training, NVIDIA graphics cards with CUDA are recommended.
- CUDA Toolkit (version 11.2)
- cuDNN SDK (version 8.1.0)
Download and install Anaconda.
Check successful installation by in the Anaconda Prompt:
pip list
Ensure that the following Python packages are installed: numpy, pandas, scipy, and statsmodels. If not, install the missing packages using the following command (as an example for statsmodels):
pip install statsmodels
Ensure that NVIDIA GPU driver has been installed. Install the CUDA and cuDNN with conda. This step can be skipped if you run TensorFlow on CPU only.
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
Install TensorFlow using pip
:
pip install tensorflow
Install Biopython using pip
:
pip install biopython
or conda
:
conda install -c conda-forge biopython
A peptide list is stored in a comma-separated values (CSV) file including columns named protein
and sequence
.
"protein","sequence"
"O43504","HDGITVAVHK"
"P56470","VGSSGDIALHINPR"
"Q9UHL4","LDHFNFER"
"P68371","IREEYPDR"
"P01024","AKDQLTCNK"
Peptides can be collected from public resources.
From the Pan Human Library (Rosenberger, G. et al. Sci. Data 2014, 1, 140031, doi:10.1038/sdata.2014.31), peptide lists have been collected and provided as an example in data\peptide
folder:
- Pan_human.peptide.csv
- Pan_human_charge2.peptide.csv
- Pan_human_charge3.peptide.csv
DeepDIA only supports peptide sequences with standard amino acids (ACDEFGHIKLMNPQRSTVWY) and length <= 50.
Prepare a model for MS/MS prediction.
You can use pre-trained models or train your own models. A model trained with HeLa data on Q Exactive HF (Bruderer, R. et al. Mol. Cell. Proteomics 2017, 16, 2296-2309, doi:10.1074/mcp.RA117.000314) is provided as an example in data\models
folder:
- data\models\charge2\epoch_035.hdf5
- data\models\charge3\epoch_034.hdf5
Run predict_ms2.py
to predict MS/MS ion intensities for peptide precursors with charge 2+.
python src\predict_ms2.py `
--in data\peptide\Pan_human_charge2.peptide.csv `
--model data\models\charge2\epoch_035.hdf5 `
--charge 2 `
--out data\Pan_human_charge2.prediction.ions.json
The predicted MS/MS ion intensities are saved in a JSON file (*.prediction.ions.json
).
Predict MS/MS for charge 3+ following the same steps.
python src\predict_ms2.py `
--in data\peptide\Pan_human_charge3.peptide.csv `
--model data\models\charge3\epoch_034.hdf5 `
--charge 3 `
--out data\Pan_human_charge3.prediction.ions.json
Prepare a model for iRT prediction.
You can use pre-trained models or train your own models. A pretrained model is provided as an example in data\models
folder:
- data\models\irt\epoch_082.hdf5
Run predict_rt.py
.
python src\predict_rt.py `
--in data\peptide\Pan_human.peptide.csv `
--model data\models\irt\epoch_082.hdf5 `
--out data\Pan_human.prediction.irt.csv
The predicted iRT values are saved in a CSV file (*.prediction.irt.csv
).
Ensure that the predicted MS/MS and iRT files are present in the data
folder.
Run build_assays_from_prediction.py
.
python src\build_assays_from_prediction.py `
--peptide data\peptide\Pan_human.peptide.csv `
--ions data\Pan_human_charge2.prediction.ions.json `
data\Pan_human_charge3.prediction.ions.json `
--rt data\Pan_human.prediction.irt.csv `
--out data\Pan_human.prediction.assay.pickle
The generated spectral library is saved in a Python binary file (*.assay.pickle
).
Run convert_assays_to_Spectronaut_library.py
.
python src\convert_assays_to_Spectronaut_library.py `
--in data\Pan_human.prediction.assay.pickle `
--out data\Pan_human.prediction.library.xls
The generated spectral library is converted to a speadsheet file (*.library.xls
) that is compatible with Spectronaut and DIA-NN.
Tutorials are avaliable in the docs
folder.
DeepDIA Tutorial: Spectral Library Generation From Peptide Lists describes the workflow to generate in silico spectral libraries from peptide lists.
DeepDIA Tutorial: Spectral Library Generation with Detectability Prediction describes the complete workflow to generate in silico spectral libraries from proteome databases with detectability filtering.
DeepDIA Tutorial: Training New Models for MS/MS and iRT Prediction describes the workflow for training new models for MS/MS and iRT prediction using data-dependent acquisition (DDA) data.
DeepDIA Tutorial: Training a New Model for Detectability Prediction describes the workflow for training a new model for MS detectability prediction using data-dependent acquisition DDA data.
Yang, Y., Liu, X., Shen, C., Lin, Y., Yang, P., Qiao, L. In silico spectral libraries by deep learning facilitate data-independent acquisition proteomics. Nat Commun 11, 146 (2020). https://doi.org/10.1038/s41467-019-13866-z.
DeepDIA is distributed under a BSD license. See the LICENSE file for details.
Please report any problems directly to the github issue tracker. Also, you can send feedback to [email protected].