Generative_Protein_Design_Datasets

A repository highlighting and linking existing resources and datasets for generative protein modeling.

Sections

Protein Sequence Datasets
Protein Structure Datasets
Protein-Protein Interaction Datasets
Protein-Ligand Datasets

Protein Sequence Datasets

Suzek, Baris E et al. “UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches.”
Bioinformatics (Oxford, England) vol. 31,6 (2015): 926-32. doi:10.1093/bioinformatics/btu739
Description: A collection of protein sequence clusters from the UniProt Knowledgebase.
[Paper], [Dataset/Server]
Bryant, D.H., Bashir, A., Sinai, S. et al. "Deep diversification of an AAV capsid protein by machine learning."
Nat Biotechnol 39, 691–696 (2021). https://doi.org/10.1038/s41587-020-00793-4
Description: DL-engineered adeno-associated virus 2 (AAV2) capsid protein variants.
[Paper], [Dataset/Server]
Jaina Mistry et al. "Pfam: The protein families database in 2021"
Nucleic Acids Research, Volume 49, Issue D1, 8 January 2021, Pages D412–D419, https://doi.org/10.1093/nar/gkaa913
Description: A collection of protein families, each represented by multiple sequence alignments and hidden Markov models.
[Paper], [Dataset/Server (Legacy Host)], [Dataset/Server (Hosted at InterPro)]
Trinquier, J., Uguzzoni, G., Pagnani, A. et al. "Efficient generative modeling of protein sequences using simple autoregressive models." Nat Commun 12, 5800 (2021). https://doi.org/10.1038/s41467-021-25756-4
Description: Uses data for five protein families (PF00014, PF00072, PF00076, PF00595, PF13354) from Pfam.
[Paper], [Dataset/Server]
Elodie Laine et al. "GEMME: A Simple and Fast Global Epistatic Model Predicting Mutational Effects"
Molecular Biology and Evolution, Volume 36, Issue 11, November 2019, Pages 2604–2619, https://doi.org/10.1093/molbev/msz179
Description: Generates the mutational landscape of a protein given an input alignment
[Paper], [Datatset/Server]
Shin, JE., Riesselman, A.J., Kollasch, A.W. et al. "Protein design and variant prediction using autoregressive generative models."
Nat Commun 12, 2403 (2021). https://doi.org/10.1038/s41467-021-22732-w
Description: PTEN phosphatase deletions & IGP dehydratase insertions and deletions
[Paper], [Dataset/Server]
Typhaine Paysan-Lafosse et al. "InterPro in 2022"
Nucleic Acids Research, Volume 51, Issue D1, 6 January 2023, Pages D418–D427, https://doi.org/10.1093/nar/gkac993
Description: A database that integrates diverse information about protein families, domains and functional sites, providing a unified view of protein sequence classification and annotation.
[Paper], [Datatset/Server]
Milot Mirdita et al. "Uniclust databases of clustered and deeply annotated protein sequences and alignments"
Nucleic Acids Research, Volume 45, Issue D1, January 2017, Pages D170–D176, https://doi.org/10.1093/nar/gkw1081
Description: A database of protein sequences clustered at different sequence similarity thresholds, providing a resource to reduce redundancy and improve the efficiency of various bioinformatics analyses
[Paper], [Dataset/Server]
Gilson, Michael K et al. “BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology.”
Nucleic acids research vol. 44, D1 (2016): D1045-53. doi:10.1093/nar/gkv1072
Description: A database of measured binding affinities, focusing chiefly on the interactions of proteins considered to be drug-targets with small, drug-like molecules.
[Paper], [Dataset/Server]
Antje Chang et al. "BRENDA, the ELIXIR core data resource in 2021: new developments and updates"
Nucleic Acids Research, Volume 49, Issue D1, 8 January 2021, Pages D498–D508, https://doi.org/10.1093/nar/gkaa1025
Description: A database providing detailed information on enzymes such as function, structure, localization, and application, derived from scientific literature and manual curation.
[Paper], [Dataset/Server]
Schmitt, L.T., Paszkowski-Rogacz, M., Jug, F. et al. "Prediction of designer-recombinases for DNA editing with generative deep learning."
Nat Commun 13, 7966 (2022). https://doi.org/10.1038/s41467-022-35614-6
Description: 89 recombinase libraries for loxA-1 to loxX-6 target sites
[Paper], [Dataset/Server]
M. AlQuraishi, “ProteinNet: a standardized data set for machine learning of protein structure,”
BMC Bioinformatics, vol. 20, no. 1, p. 311, Dec. 2019, doi: 10.1186/s12859-019-2932-0.
Description: A standardized data set for machine learning of protein structure. It provides protein sequences, structures (secondary and tertiary), multiple sequence alignments (MSAs), position-specific scoring matrices (PSSMs), and standardized training/validation/test splits
[Paper], [Dataset/Server]
Dallago, Christian, et al. “FLIP: Benchmark Tasks in Fitness Landscape Inference for Proteins."
bioRxiv (Cold Spring Harbor Laboratory), Cold Spring Harbor Laboratory, Nov. 2021, https://doi.org/10.1101/2021.11.09.467890.
Description: An open-source benchmark for function prediction, designed to enhance machine learning applications in protein engineering via representation learning scoring.
[Paper], [Dataset]

Protein Structure Datasets

Ian Sillitoe et al. "CATH: expanding the horizons of structure-based functional annotations for genome sequences"
Nucleic Acids Research, Volume 47, Issue D1, 08 January 2019, Pages D280–D284, https://doi.org/10.1093/nar/gky1097
Description: Hierarchical Classification of Protein Structures based on function and evolutionary relationships.
[Paper], [Dataset/Server]
Cuff, J A, and G J Barton. “Evaluation and improvement of multiple sequence methods for protein secondary structure prediction.”
Proteins vol. 34,4 (1999): 508-19. doi:10.1002/(sici)1097-0134(19990301)34:4<508::aid-prot10>3.0.co;2-4
Description: Dataset for Secondary Structure Prediction
[Paper], [CB513 Dataset]
Kryshtafovych, A, Schwede, T, Topf, M, Fidelis, K, Moult, J. "Critical assessment of methods of protein structure prediction (CASP)—Round XIII."
Proteins. 2019; 87: 1011– 1020. https://doi.org/10.1002/prot.25823
Description: A collection of predicted structures for sequences
[Paper], [Datatset/Server]
Protein Data Bank
John-Marc Chandonia et al. "SCOPe: classification of large macromolecular structures in the structural classification of proteins—extended database."
Nucleic Acids Research, Volume 47, Issue D1, 08 January 2019, Pages D475–D481, https://doi.org/10.1093/nar/gky1134
Description: Database of known protein structures classified based on their structural and evolutionary relationships
[Paper], [Dataset/Server]
Ferdous S, Martin ACR. "AbDb: antibody structure database-a database of PDB-derived antibody structures."
Database (Oxford). 2018;2018:bay040. doi:10.1093/database/bay040
Description: A database of antibody structures.
[Paper], [Datatset/Server]
S. Axelrod and R. Gomez-Bombarelli, “GEOM: Energy-annotated molecular conformations for property prediction and molecular generation.”
arXiv, Feb. 09, 2022. doi: 10.48550/arXiv.2006.05531.
Description: A dataset of 37 million molecular conformations annotated by energy and statistical weight for over 450,000 molecules.
[Paper], [Datatset/Server]
M. AlQuraishi, “ProteinNet: a standardized data set for machine learning of protein structure,”
BMC Bioinformatics, vol. 20, no. 1, p. 311, Dec. 2019, doi: 10.1186/s12859-019-2932-0.
Description: A standardized data set for machine learning of protein structure. It provides protein sequences, structures (secondary and tertiary), multiple sequence alignments (MSAs), position-specific scoring matrices (PSSMs), and standardized training/validation/test splits
[Paper], [Dataset/Server]
Leonardo V Castorina et al. "PDBench: evaluating computational methods for protein-sequence design."
Bioinformatics, Volume 39, Issue 1, January 2023, btad027, https://doi.org/10.1093/bioinformatics/btad027
Description: Diverse dataset of proteins and tests, intended to evaluate and optimize protein-sequence design
[Paper], [Dataset/Server]
Amelia Villegas-Morcillo et al. "ManyFold: an efficient and flexible library for training and validating protein folding models."
Bioinformatics, Volume 39, Issue 1, January 2023, btac773, https://doi.org/10.1093/bioinformatics/btac773
Description: A flexible, Jax-based library for protein structure prediction using deep learning, capable of fine-tuning and training models from scratch.
[Paper], [Dataset]
Thomas, Neil, et al. “Tuned Fitness Landscapes for Benchmarking Model-Guided Protein Design.”
bioRxiv, 1 Jan. 2022, www.biorxiv.org/content/10.1101/2022.10.28.514293v1.full.
Description: An open-source framework for creating synthetic, biologically-inspired protein landscapes to test and benchmark machine learning-directed evolution (MLDE) strategies.
[Paper], [Dataset]
Dallago, Christian, et al. “FLIP: Benchmark Tasks in Fitness Landscape Inference for Proteins."
bioRxiv (Cold Spring Harbor Laboratory), Cold Spring Harbor Laboratory, Nov. 2021, https://doi.org/10.1101/2021.11.09.467890.
Description: An open-source benchmark for function prediction, designed to enhance machine learning applications in protein engineering via representation learning scoring.
[Paper], [Dataset]

Protein-Protein Interaction Datasets

P. Bryant, G. Pozzati, and A. Elofsson, “Improved prediction of protein-protein interactions using AlphaFold2,”
Nat. Commun., vol. 13, no. 1, Art. no. 1, Mar. 2022, doi: 10.1038/s41467-022-28865-w.
Description: ALphaFold2-predicted protein-protein interactions.
[Paper], [Input Datatset], [Predictions]
A. G. Green, H. Elhabashy, K. P. Brock, R. Maddamsetti, O. Kohlbacher, and D. S. Marks, “Large-scale discovery of protein interactions at residue resolution using co-evolution calculated from genomic sequences,”
Nat. Commun., vol. 12, no. 1, Art. no. 1, Mar. 2021, doi: 10.1038/s41467-021-21636-z.
Description: De-novo protein interactions in the E.coli proteome.
[Paper], [Datatset]
D. F. Burke et al., “Towards a structurally resolved human protein interaction network,”
Nat. Struct. Mol. Biol., vol. 30, no. 2, Art. no. 2, Feb. 2023, doi: 10.1038/s41594-022-00910-8.
Description: AlphaFold2-predicted structures for human protein interactions.
[Paper], [Datatset]

Protein-Ligand Datasets

J. Desaphy, G. Bret, D. Rognan, and E. Kellenberger, “sc-PDB: a 3D-database of ligandable binding sites—10 years on”
Nucleic Acids Res., vol. 43, no. D1, pp. D399–D404, Jan. 2015, doi: 10.1093/nar/gku928.
Description: A database of ligandable sites from the PDB(all-atom description of the protein, its ligand, their binding site and their binding mode)
[Paper], [Dataset/Server]
M. Naderi, R. G. Govindaraj, and M. Brylinski, “eModel-BDB: a database of comparative structure models of drug-target interactions from the Binding Database,”
GigaScience, vol. 7, no. 8, p. giy091, Aug. 2018, doi: 10.1093/gigascience/giy091.
Description: Database of atomic-level models of drug-protein assemblies
[Paper], [Datatset/Server]
PS: Read the paper to see how to extract the relevant structures/models from BDB.
A. Gaulton et al., “ChEMBL: a large-scale bioactivity database for drug discovery,”
Nucleic Acids Res., vol. 40, no. D1, pp. D1100–D1107, Jan. 2012, doi: 10.1093/nar/gkr777.
Description: A database of bioactive molecules with drug-like properties(chemical, bioactivity and genomic data)
[Paper], [Dataset/Server]
P. G. Francoeur et al., “3D Convolutional Neural Networks and a CrossDocked Dataset for Structure-Based Drug Design,”
J. Chem. Inf. Model., vol. 60, no. 9, pp. 4200–4215, Sep. 2020, doi: 10.1021/acs.jcim.0c00411.
Description: Benchmark dataset for protein-ligand binding affinity prediction
[Paper], [Dataset/Server]

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generative_Protein_Design_Datasets

Sections

Protein Sequence Datasets

Protein Structure Datasets

Protein-Protein Interaction Datasets

Protein-Ligand Datasets

About

Releases

Packages

Contributors 3

ranaabarghout/Generative_Protein_Design_Datasets

Folders and files

Latest commit

History

Repository files navigation

Generative_Protein_Design_Datasets

Sections

Protein Sequence Datasets

Protein Structure Datasets

Protein-Protein Interaction Datasets

Protein-Ligand Datasets

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages