Model coefficient stability analysis #31

jjc2718 · 2020-10-15T14:22:14Z

The goal of this analysis was to use the model coefficients of our mutation prediction classifiers to evaluate similarity between models. Since we're using elastic net logistic regression (which zeroes out coefficients for most genes), we can compare the nonzero coefficients between models, and if they are similar we say the models are similar.

The idea was to eventually use this to define similarities for the same gene across different cancer types (e.g. if we noticed that our KRAS mutation predictor selects similar genes in thyroid cancer and colon cancer, we would hypothesize that KRAS mutations have similar effects on gene expression in those cancer types, which could be interesting biologically).

Unfortunately, this doesn't work as well as we thought it would - even for models on different cross-validation folds of the same gene and cancer type, we see considerable variation in the nonzero coefficients. This is probably due to the large amount of multicollinearity in gene expression data: in many cases there are multiple predictors/genes in the dataset conveying essentially the same information, so the model can pick one or a few of them essentially arbitrarily.

This is a fairly well-documented characteristic of feature selection in linear models on datasets with collinear features, so it isn't too surprising.

ben-heil

Looks good! One potential way you could cut down on colinearity is by using a subset of genes like the LINCS1000 landmark genes (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5990023/). I think your plan to aggregate at the network/pathway level is probably better though

pancancer_evaluation/utilities/analysis_utilities.py

07_coefficient_analysis.ipynb

jjc2718 added 7 commits October 12, 2020 15:25

update results with new data + add error catching code

4f26ee1

add analysis of coefficient stability between folds

6827080

add inter-cancer, same gene coefficient overlap analysis

32d1a01

add titles and vary single-cancer/pancancer

34e2bbb

add option for single cancer vs. pan-cancer coefficient analysis

7511565

now with plots included

788b788

add some bells and whistles to coefficient analysis script

5a0e74a

jjc2718 requested a review from ben-heil October 15, 2020 14:27

remove hardcoded path

e87714e

ben-heil approved these changes Oct 16, 2020

View reviewed changes

add note to file_utilities about identifier format

652aa7e

This was referenced Oct 19, 2020

Coefficient stability: gene dropout/pathway aggregation experiments #32

Open

Cross-validation for imbalanced label case #4

Open

jjc2718 merged commit 6712de2 into greenelab:master Oct 19, 2020

jjc2718 deleted the coef_stability branch October 19, 2020 20:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model coefficient stability analysis #31

Model coefficient stability analysis #31

jjc2718 commented Oct 15, 2020 •

edited

Loading

ben-heil left a comment

Model coefficient stability analysis #31

Model coefficient stability analysis #31

Conversation

jjc2718 commented Oct 15, 2020 • edited Loading

ben-heil left a comment

Choose a reason for hiding this comment

jjc2718 commented Oct 15, 2020 •

edited

Loading