Skip to content

Latest commit

 

History

History
69 lines (57 loc) · 6.57 KB

README.md

File metadata and controls

69 lines (57 loc) · 6.57 KB

PyRMD-logo

PyRMD is a Ligand-Based Virtual Screening tool written in Python powered by machine learning. The project is being developed by the Cosconati Lab from University of Campania Luigi Vanvitelli. Supported by the AIRC Fellowship for Italy Clementina Colombatti.

Please check our open-access manuscript published in the Journal of Chemical Information and Modeling for the full explanation of PyRMD functions https://pubs.acs.org/doi/abs/10.1021/acs.jcim.1c00653

Authors: Dr. Giorgio Amendola and Prof. Sandro Cosconati

Installation and Usage

First, users should download and install Anaconda.

Once Anaconda has been installed, download the files from this repository and from the terminal (Linux, MacOs) or the Command Prompt (Windows) enter:

conda env create -f pyrmd_environment.yml

This will install to install the pyrmd conda environment. Follow the instructions appearing on the terminal until the environment installation is complete.

Adjust the configuration_file.ini with a text editor according to your preferences. To use PyRMD, activate the pyrmd conda environment:

conda activate pyrmd

Then, you are ready to run the software:

python PyRMD_v1.02.py configuration_file.ini

If you need a clean configuration file, running PyRMD without any argument, like this:

python PyRMD_v1.02.py 

It will automatically generate a default_config.ini with default settings.

If you are having troubles in correctly setting up PyRMD or optimizing its performance, get in touch with us.

Tutorials

In the tutorials folder are present two test cases, one for the benchmark mode and another for the screening mode, with all the files and the configurations already set up. Users only need to run PyRMD in the respective folders.

Benchmark

The benchmark test case allows to benchmark PyRMD performance using the target bioactivity data downloaded from ChEMBL for the tyrosine-kinase MET. The benchmark employs a Repeats Stratified K-Fold approach with 5 folds and 3 repetions. MET decoy compounds downloaded from the DUD-E are also included in the folder to be used as an additional test set. These settings are specified in the configuration_benchmark.ini file that can be easily modified. To activate the conda environment and run the benchmark, enter:

conda activate pyrmd
python PyRMD_v1.02.py configuration_benchmark.ini

At the end of the calculations, the benchmark_results.csv file will include the averaged benchmark metrics (TPR, FPR, Precision, F-Score, ROC AUC, PRC AUC, and BEDROC) across all the folds and repetitions. Also, the plots ROC_curve.png and PRC_curve.png will be generated.

Screening

The screening test case trains PyRMD with the MET ChEMBL bioactivity data (the same used in the benchmark) and proceeds to screen a small sample of randomly extracted compounds from MCULE. These settings are specified in the configuration_screening.ini file that can be easily modified. To activate the conda environment and run the screening, enter:

conda activate pyrmd
python PyRMD_v1.02.py configuration_screening.ini

At the end of the calculations, the database_predictions.csv file will report a summary of the molecules predicted to be active against MET. For each compound, the file will include the molecule SMILES string, the RMD confidence score(the higher the better), the most similar training active compound and its relative similarity, and a flag indicating if it is a potential PAINS. Also, the predicted_actives.smi SMILES file will be created to be readily used with other cheminformatics/molecular modeling software.

Training Data

Please check the "Training Dataset Preparation" and "Benchmarking Training Data" sections in the Supporting Information PDF of the JCIM article for a description of what kind of training data to use with PyRMD

Optimizing PyRMD Performance

Poor performance may be the result of several factors, which include:

  • Insufficient training data, either for the active set or the inactive one
  • Many more inactive compounds than actives in the training set
  • Training data sets which comprises compounds with a different mechanism of action
  • Inadequate epsilon cutoff values

The above cases and others are discussed in the JCIM article. Importantly, adjusting the epsilon cutoff values may readily improve poor performance, especially with regards to TPR, FPR, Precision, and F-Score, as they impact the classification thresholds. The default cutoff values for both the active training set and the inactive training set is 0.95, so that 95% of the training set will be considered in the model building step. This allows to account for some tolerance to the presence/absence of chemical motifs. Higher values (e.g. closer to 1) for the active training cutoff generally result in more true positives and false positives alike. Instead, a higher cutoff for the inactive set fitting should mainly decrease the false positive rate with some effect on the TPR.

We suggest benchmarking using the possible combinations of the following epsilon cutoff values: 0.84–0.95–0.98 for epsilon_active and 0.7–0.84–0.95–0.98 for epsilon_cutoff_inactive and identify the combination with the best TPR/FPR tradeoff. Further information about the epsilon cutoff values are available in the JCIM article and in its Supporting Information PDF.

RMD Algorithm

PyRMD implements the Random Matrix Discriminant (RMD) algorithm devised by Lee et al. to identify small molecules endowed with biological activity. Parts of the RMD algorithm code were adapted from the MATLAB version of the RMD and a Python implementation proposed by Laksh Aithani of the Random Matrix Theory.