Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanded prediction analysis #60

Open
gwaybio opened this issue Feb 26, 2022 · 3 comments
Open

Expanded prediction analysis #60

gwaybio opened this issue Feb 26, 2022 · 3 comments

Comments

@gwaybio
Copy link
Member

gwaybio commented Feb 26, 2022

We received reviews back from the journal, and one suggestion was for us to expand the machine learning prediction analysis.

Currently, we are using both L1000 and Cell Painting data to predict compound MOA. The reviewer asked us to also predict:

  • compound gene target
  • compound gene target pathways

I think this is a great idea!

I performed the first step of this analysis in #59 - generating the X and Y matrices required to train our models and evaluate predictions. For example, the updated training data for Cell Painting is here: https://github.com/broadinstitute/lincs-profiling-complementarity/tree/master/2.MOA-prediction/2.data_split/model_data/cp

Next Step

The next step in this analysis is to run these matrices through our machine learning pipeline and return results for plotting. Currently, the pipeline trains several multi-class machine learning models to predict compound MOA. We need to modify this pipeline to also predict compound gene target and compound gene target pathway.

I also think that we need to modify the pipeline to train single-class machine learning models, given that there are about 30,000 unique pathways, and given our sample size, this seems infeasible. We can then pass through our three different Y matrices (per assay) through this single-class pipeline.

Output

We need performance metrics for each model, and metadata indicating which model, data, single-class vs. multi-class, shuffled status, and prediction.

It would also be great to output matrices of probabilities per compound by label (either compound, target, or pathway) per assay, model, single-class/multi-class, and shuffled status.

@gwaybio
Copy link
Member Author

gwaybio commented Feb 26, 2022

Our current figure 5, which visualizes the results for the multi-class MOA predictions across models is here: https://github.com/broadinstitute/lincs-profiling-complementarity/blob/master/6.paper_figures/figure5.ipynb

It might be helpful to mirror the output data to appear like the data frame we use for plotting in this notebook. (Note, we will need more metadata in the updated version!)

@AdeboyeML
Copy link
Collaborator

Yeah, expanding the multi-label predictive analysis will be great, but that will depend on the new datasets (gene targets and gene pathways) and how different or similar their features are to the existing datasets we used for the multi-label prediction.

I will go through the new datasets this week to see how it is and what I need to modify in the machine learning pipelines.

@gwaybio
Copy link
Member Author

gwaybio commented Mar 7, 2022

Thanks for going through the code, determining next steps, and meeting with me this afternoon @AdeboyeML .

I heard your concern that >5,000 GO terms is likely to make multi-label classification difficult. Therefore, in #67, I filtered GO terms that had less than 20 compounds. Most GO terms had only 1 compound, so this filtering step drastically reduced the GO term set to 772 GO terms, which is on the same order of magnitude as the MOA prediction task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants