Affinity2Vec: Drug-Target Binding Affinity Prediction Method Developed using Representation Learning, Graph Mining, and Machine Learning
This repositery provides an implementation of Affinity2Vec tool which is described in a research paper:
Scientific Report Journal
Received: 22 June 2021
Accepted: 08 March 2022
Published: 19 March 2022
This code is implemented using Python 3.8.
For any qutions please contact the first author:
Maha A. Thafar
Email: [email protected]
Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST)
Collage of Computers and Information Technology, Taif University (TU).
There are several required Python packages to run the code:
- gensim
- numpy
- Scikit-learn
- keras
- deepchem
- protVec
- xgboost
- pandas
These packages can be installed using pip or conda as the follwoing example
pip install -r requirements.txt
1.Input folder: that includes two folder for 2 datasets include:
- Davis dataset,
- KIBA dataset,
- where each one of these folder has all required data of drug-target binding affinity (in Adjacency matrix format), drug-drug and target-target similarities in (square matrix format), the drugs' SMILES in dictionary format with drugs' IDs, and the proteins' amino-acid sequences in dictionary format with proteins' IDs
2.Embedding folder: that has two folders coressponding for 2 datasets, each folder contains the generated seq2seq embeddings for drugs, and generated ProtVec embeddings for proteins.
3.aupr folder: to convert the data first to binary and then calculate aupr evaluation metric
4.Code_to_generate_Embeddings folder: we add seq2seq model code and ProtVec model code that are necessory to generate the embeddings
5. Predictions Figures folder: These two. figures represent the binding affinities predicted by Affinity2Vec best model vs. actual binding affinity values for Davis and KIBA datasets
6. PDBBind_Refined folder: This folder has all materials related to PDBBind Refined dataset. It also has the generated embeddings for all Compounds' SMILES and Proteins' amino-acide sequences
(two main functions, one main for each dataset, and the other functions are same for all datasets which are imported in each main function)
-
training_functions.py --> for several training and processing functions such as Cosine_similarity, normalization, etc.
-
pathScores.py --> to calculate and return all meta-path scores for 6 path structures
-
evaluation.py --> define all evalution metrics used in our experments.
-
2 main functions, one for each dataset:
- Affinity2Vec_Davis.py
- Affinity2Vec_KIBA.py
- Jupyter noteboook for Affinity2Vec models using PDBBind Refined dataset
To get the development environment runining, the code get 2 parameteres from the user which is the dataset name and the model version (the defual dataset is nr) run:
python Affinity2Vec_Davis.py
python Affinity2Vec_KIBA.py
- about the source code that we utilized to generate the drugs' SMILES embedding, please refere to the main source code:
- about proteins' amio-acid sequences embeddings please refere to the main source:
Thafar, M.A., Alshahrani, M., Albaradei, S. et al. Affinity2Vec: drug-target binding affinity prediction through representation learning, graph mining, and machine learning. Sci Rep 12, 4751 (2022). https://doi.org/10.1038/s41598-022-08787-9