-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expanded prediction analysis #60
Comments
Our current figure 5, which visualizes the results for the multi-class MOA predictions across models is here: https://github.com/broadinstitute/lincs-profiling-complementarity/blob/master/6.paper_figures/figure5.ipynb It might be helpful to mirror the output data to appear like the data frame we use for plotting in this notebook. (Note, we will need more metadata in the updated version!) |
Yeah, expanding the multi-label predictive analysis will be great, but that will depend on the new datasets (gene targets and gene pathways) and how different or similar their features are to the existing datasets we used for the multi-label prediction. I will go through the new datasets this week to see how it is and what I need to modify in the machine learning pipelines. |
Thanks for going through the code, determining next steps, and meeting with me this afternoon @AdeboyeML . I heard your concern that >5,000 GO terms is likely to make multi-label classification difficult. Therefore, in #67, I filtered GO terms that had less than 20 compounds. Most GO terms had only 1 compound, so this filtering step drastically reduced the GO term set to 772 GO terms, which is on the same order of magnitude as the MOA prediction task. |
We received reviews back from the journal, and one suggestion was for us to expand the machine learning prediction analysis.
Currently, we are using both L1000 and Cell Painting data to predict compound MOA. The reviewer asked us to also predict:
I think this is a great idea!
I performed the first step of this analysis in #59 - generating the X and Y matrices required to train our models and evaluate predictions. For example, the updated training data for Cell Painting is here: https://github.com/broadinstitute/lincs-profiling-complementarity/tree/master/2.MOA-prediction/2.data_split/model_data/cp
Next Step
The next step in this analysis is to run these matrices through our machine learning pipeline and return results for plotting. Currently, the pipeline trains several multi-class machine learning models to predict compound MOA. We need to modify this pipeline to also predict compound gene target and compound gene target pathway.
I also think that we need to modify the pipeline to train single-class machine learning models, given that there are about 30,000 unique pathways, and given our sample size, this seems infeasible. We can then pass through our three different Y matrices (per assay) through this single-class pipeline.
Output
We need performance metrics for each model, and metadata indicating which model, data, single-class vs. multi-class, shuffled status, and prediction.
It would also be great to output matrices of probabilities per compound by label (either compound, target, or pathway) per assay, model, single-class/multi-class, and shuffled status.
The text was updated successfully, but these errors were encountered: