Expanded prediction analysis #60

gwaybio · 2022-02-26T23:00:16Z

We received reviews back from the journal, and one suggestion was for us to expand the machine learning prediction analysis.

Currently, we are using both L1000 and Cell Painting data to predict compound MOA. The reviewer asked us to also predict:

compound gene target
compound gene target pathways

I think this is a great idea!

I performed the first step of this analysis in #59 - generating the X and Y matrices required to train our models and evaluate predictions. For example, the updated training data for Cell Painting is here: https://github.com/broadinstitute/lincs-profiling-complementarity/tree/master/2.MOA-prediction/2.data_split/model_data/cp

Next Step

The next step in this analysis is to run these matrices through our machine learning pipeline and return results for plotting. Currently, the pipeline trains several multi-class machine learning models to predict compound MOA. We need to modify this pipeline to also predict compound gene target and compound gene target pathway.

I also think that we need to modify the pipeline to train single-class machine learning models, given that there are about 30,000 unique pathways, and given our sample size, this seems infeasible. We can then pass through our three different Y matrices (per assay) through this single-class pipeline.

Output

We need performance metrics for each model, and metadata indicating which model, data, single-class vs. multi-class, shuffled status, and prediction.

It would also be great to output matrices of probabilities per compound by label (either compound, target, or pathway) per assay, model, single-class/multi-class, and shuffled status.

gwaybio · 2022-02-26T23:01:51Z

Our current figure 5, which visualizes the results for the multi-class MOA predictions across models is here: https://github.com/broadinstitute/lincs-profiling-complementarity/blob/master/6.paper_figures/figure5.ipynb

It might be helpful to mirror the output data to appear like the data frame we use for plotting in this notebook. (Note, we will need more metadata in the updated version!)

AdeboyeML · 2022-02-27T18:55:14Z

Yeah, expanding the multi-label predictive analysis will be great, but that will depend on the new datasets (gene targets and gene pathways) and how different or similar their features are to the existing datasets we used for the multi-label prediction.

I will go through the new datasets this week to see how it is and what I need to modify in the machine learning pipelines.

gwaybio · 2022-03-07T20:06:20Z

Thanks for going through the code, determining next steps, and meeting with me this afternoon @AdeboyeML .

I heard your concern that >5,000 GO terms is likely to make multi-label classification difficult. Therefore, in #67, I filtered GO terms that had less than 20 compounds. Most GO terms had only 1 compound, so this filtering step drastically reduced the GO term set to 772 GO terms, which is on the same order of magnitude as the MOA prediction task.

gwaybio mentioned this issue Mar 7, 2022

[Response to Review] Filter GO terms for ML analysis #67

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expanded prediction analysis #60

Expanded prediction analysis #60

gwaybio commented Feb 26, 2022

gwaybio commented Feb 26, 2022

AdeboyeML commented Feb 27, 2022

gwaybio commented Mar 7, 2022

Expanded prediction analysis #60

Expanded prediction analysis #60

Comments

gwaybio commented Feb 26, 2022

Next Step

Output

gwaybio commented Feb 26, 2022

AdeboyeML commented Feb 27, 2022

gwaybio commented Mar 7, 2022