include utility function for drug/target embedding only

kexinhuang12345 · Oct 28, 2020 · 6b0fa02 · 6b0fa02
1 parent 4ee813d
commit 6b0fa02
Show file tree

Hide file tree

Showing 2 changed files with 37 additions and 0 deletions.
diff --git a/CONTRIBUTE.md b/CONTRIBUTE.md
@@ -0,0 +1,28 @@
+## Instructions on how to include a new encoder
+
+Thank you for your interest in DeepPurpose! As more and more models are coming up, we want to include as much as the models and their pretrained models in our framework. Here we provide step-by-step instructions to do that:
+
+
+### Step 1: modify the ``utils.py`` file for data and parameter.
+
+For any dataset, we expect each drug is associated with SMILES and each protein with amino acid sequence. However, as different encoders expect different input to the model (e.g., MPNN expects mol graph), we need to first transform it to the expected format. To do that, in the ``utils.py`` file, define a new function ``smiles2xxx`` or ``target2xxx`` which taks a input SMILES/sequence and outputs the encoding format for that single input. 
+
+Then, in the ``encode_drug`` or ``encode_protein`` functions, include a ``elif`` statement to transform all of the data points in the input dataframe using just defined ``smiles2xxx`` or ``target2xxx``. 
+
+For special input formats such as further transformation on the fly, please add a ``elif`` statement to the ``data_process_loader``, ``data_process_DDI_loader``, ``data_process_PPI_loader``, ``data_process_loader_Protein_Prediction``, ``data_process_loader_Protein_Prediction``. You can refer to the examples for CNN in these functions.
+
+Now, in the ``generate_config`` file, add an ``elif`` statement to include all important encoder parameters (e.g. input dimension, model dim and etc.). If your encoder has new parameters that you want the users to specify in the ``model_initialize`` function, you should also add in the function parameter space. If so, please specify the default values.
+
+### Step 2: modify the ``encoders.py`` for model definition
+
+In the ```encoders.py```, define the encoder models. The input of the ``__init__`` in default should contain ``encoding``, which is either 'drug' or 'protein', and ``**config``, which includes all the model parameters defined by users. For the ``forward`` function, we expect to input one feature matrix and output the hidden embedding.
+
+### Step 3: modify the training scripts ``DTI.py, DDI.py, PPI.py, CompoundPred.py, ProteinPred.py``
+
+Finally, we need to modify the training wrappers. Every file has similar structures so we will talk about one file and the rest should follow. In the main class ``__init__`` function, include an ``elif`` statement to define the model based on the definitions in ``encoders.py``.
+
+That's it! You have successfully included your model in DeepPurpose!
+
+### Test and Write in README file
+
+Before you create a pull request, please also test it locally and send [email protected] a test case. Then, you are good to go!
diff --git a/DeepPurpose/utils.py b/DeepPurpose/utils.py
@@ -915,6 +915,15 @@ def load_dict(path):
 	'HIV': 'https://s3-us-west-1.amazonaws.com/deepchem.io/datasets/molnet_publish/hiv.zip'
 	}
 
+def obtain_compound_embedding(net, file, file_type = 'df'):
+    if file_type == 'df':
+        x = np.stack(file['drug_encoding'].values)
+    elif file_type == 'array':
+        x = file
+    else:
+        raise AttributeError
+
+    return net.model_drug(torch.FloatTensor(x))
 
 def download_unzip(name, path, file_name):
 	if not os.path.exists(path):