Add script to do exploratory data analysis on TCGA data #17

jjc2718 · 2020-08-26T20:09:17Z

Closes #9.

This PR adds a notebook to answer some questions about the TCGA data before fitting any models:

If we filter cancer type/mutation combinations to >5% of samples mutated, how many are left?
How many of the genes in the gene expression data are "important"? If we filter genes in the expression matrix (say, to those with the highest mean absolute deviation), how much do we expect this to affect our predictions?

The answers are pretty much what we expected, but it will be good to have this info going forward.

ajlee21

Really nice PR. Really interesting to learn about! Love that this is your "looking at the data" notebook

Just wanted to clarify, when you say >5% mutated, >50 samples mutated this means you are looking for cancer datasets where all genes are mutated in at least 50 samples? So 5% mutated is equivalent to 50 samples being mutated? This is mutated with respect to some reference I assume. This is probably my lack of cancer background, but is there a reason you are using this filter?
Could add a hyperlink here 00_download_data.ipynb
pancancer_data is a df of dfs?
I assume pancancer_data contains the metadata associated with rnaseq_df?
Not sure if you want to update this print statement to refer to other genetic variation types in addition to mutations: print('Genes with mutation information: {}'.format(len(overlap_genes)))
I assume the numbers in cancer_types_df refer to the number of samples per cancer type
Still learning about the best way to define paths, so perhaps this is obvious. But is there a reason you're using resolve.() instead of os.join or defining this path in your cfg?
You test_df is your filtered dataset . Nice use of pandas! I am appreciating pandas more after having lectured about it!
Not sure if this is a worth while validation. But would you consider looking at the genes that passed your filter and do a quick sanity check that they correspond to those found to be highly mutated in publications?
What is your takeaway from the mutation plot trend? I can see that your unincluded diseases are clustered by the origin as expected. Though there are some points that fall below the 50 sample threshold, do you know why that is?
Nice bar plot..I assume you are comparing this distribution to some publication, but might be nice to make a note for your future self
Nice explanation on your PCA plots.
Looks like you need < 20 PCs to explain > 0.99 of the variance in the rnaseq data. Do you think this is the same number of PCs you'd need if you looked at different cancer types individually...just me being curious. I wonder if some have more complex structures than others. Do you know which genes are highly weighted in these components that explain the variance? Perhaps that is another avenue for distinguishing cancer types
A little hard to tell based because we're looking at data compressed onto 2 axis, but 100 PCs looks like it might have slightly better separation
I understand the logic of using the MAD genes. This reminds of my generic-expression-pattern repo, where we find that some genes are commonly DE. We haven't looked into the reason but I wonder if these MAD genes are those that are common across all these cancers because they're hyper-responsive. I'm not sure if you're planning to use this moving forward, but if you are trying to find a signature that distinguishes between cancer types, you could try using a supervised learning approach on your compressed PC's gene expression data?

jjc2718 · 2020-08-27T15:27:42Z

Thanks for the detailed feedback! Responses below:

Just wanted to clarify, when you say >5% mutated, >50 samples mutated this means you are looking for cancer datasets where all genes are mutated in at least 50 samples? So 5% mutated is equivalent to 50 samples being mutated? This is mutated with respect to some reference I assume. This is probably my lack of cancer background, but is there a reason you are using this filter?

No worries, this is a bit confusing if you're not familiar with the project. We're filtering gene/cancer type combinations. So in other words, say we're interested in TP53 mutations - we're looking for cancer types where at least 5% of samples and at least 50 samples have a mutation in TP53 (so the 5% and 50 samples are two separate filters, if the cancer type fails either of these criteria we drop it).

Then we do this individually for all other genes, in this case all other genes in the top 50 most mutated genes across all of TCGA (although I'll likely generalize this to more genes in the future).

The reason for this is more a ML/prediction-related reason than a cancer reason: we think it will be hard to build a good classifier if we have very few positively labeled samples, so we filter out those cases as a preprocessing step. In the future, we could do something special to handle these cases (like apply outlier detection methods or classifiers specifically designed for imbalanced labels) but for now we're just ignoring them and focusing on the cases where we have more balanced labels.

pancancer_data is a df of dfs?

It's just a tuple of dfs. I'll document this in the script, I agree that it is a bit confusing.

I assume pancancer_data contains the metadata associated with rnaseq_df?

Yep, exactly (info about mutations, mutation burden, cancer type, etc. for the samples in rnaseq_df).

Not sure if you want to update this print statement to refer to other genetic variation types in addition to mutations: print('Genes with mutation information: {}'.format(len(overlap_genes)))

At the moment, I'm just looking at point mutations. Integrating copy number information gets a bit complicated (we need to know whether each gene is an oncogene or tumor suppressor or neither to integrate that information), so I'm not using it here but I'd definitely like to add it in the future.

I assume the numbers in cancer_types_df refer to the number of samples per cancer type

Yep, exactly!

Still learning about the best way to define paths, so perhaps this is obvious. But is there a reason you're using resolve.() instead of os.join or defining this path in your cfg?

I've been using pathlib everywhere in this repo (I think Milton suggested it in one of my past reviews) so I'm just using it here (rather than os.path) for consistency. I could define it in the config, but I don't use this file that much elsewhere so I don't think it's necessary (for now).

You test_df is your filtered dataset

I meant to change this name to something more informative, thanks for reminding me (and yes, it's the filtered dataset)

Not sure if this is a worth while validation. But would you consider looking at the genes that passed your filter and do a quick sanity check that they correspond to those found to be highly mutated in publications?

What is your takeaway from the mutation plot trend? I can see that your unincluded diseases are clustered by the origin as expected. Though there are some points that fall below the 50 sample threshold, do you know why that is?

Looking at publications would definitely be smart! Created #18 to look into this.

My takeaways are: (I'll add these to the script too)

There are more gene/cancer type combinations that aren't filtered out than we expected (Casey guessed that most combinations would be filtered out). Since we're using the top 50 most mutated genes this isn't that surprising, but still good to know.
Some of the highly mutated outliers from this plot make sense based on other studies/previous knowledge (e.g. TP53 in ovarian cancer, KRAS in pancreatic cancer, TTN in melanoma as a marker of high background mutation rate), so it's a useful sanity check.

Nice bar plot..I assume you are comparing this distribution to some publication, but might be nice to make a note for your future self

It's mostly just a general sanity check - mutation distributions in lots of cancer types/previous publications have been observed as "heavy tailed" or "hockey stick-shaped" (a few common mutations and lots of less common ones). We see that here too for many cancer types, which makes sense.

Looks like you need < 20 PCs to explain > 0.99 of the variance in the rnaseq data. Do you think this is the same number of PCs you'd need if you looked at different cancer types individually...just me being curious. I wonder if some have more complex structures than others. Do you know which genes are highly weighted in these components that explain the variance? Perhaps that is another avenue for distinguishing cancer types

Good question! I haven't looked at cancer types individually. Created #19 to remind me about this.

A little hard to tell based because we're looking at data compressed onto 2 axis, but 100 PCs looks like it might have slightly better separation

Yeah, it's hard to tell. The plot does look a bit different than for higher numbers of genes, though, which tells me that at least some of the signal is being removed or altered at that point.

I understand the logic of using the MAD genes. This reminds of my generic-expression-pattern repo, where we find that some genes are commonly DE. We haven't looked into the reason but I wonder if these MAD genes are those that are common across all these cancers because they're hyper-responsive. I'm not sure if you're planning to use this moving forward, but if you are trying to find a signature that distinguishes between cancer types, you could try using a supervised learning approach on your compressed PC's gene expression data?

Yeah, this is definitely something we're thinking about. It's a very similar idea to what Greg did in his BioBombe paper, but the cross-validation he did was different (his train/test sets were both stratified by cancer type, but we're holding out single cancer types and comparing performance). It would be interesting to try his approach here as well.

Generally, his experiments showed that using raw gene expression data performed better for the mutation prediction problem than any method of compressing the expression data (e.g. PCA, NMF, VAE), but we're not sure if that would be the case here as well.

jjc2718 added 9 commits August 12, 2020 15:32

add notebook to do some exploratory data analysis

fe05eea

Merge branch 'classify_refactor' into explore_filters

2e34084

better phrasing/documentation for things

9d7bffe

Merge branch 'master' into explore_filters

693ccfd

plot mutation proportions vs. number of mutations in TCGA top 50

ba659f7

add better documentation to script for exploring data

6411f6a

add PCA of most common cancer types to EDA script

c1b0fb2

subset genes by MAD and take PCA, much more stable (unsurprisingly)

f8bbd19

clean up EDA script and add some documentation

08d7c97

jjc2718 requested a review from ajlee21 August 26, 2020 20:11

small phrasing change

89bf89f

ajlee21 approved these changes Aug 27, 2020

View reviewed changes

This was referenced Aug 27, 2020

Do highly mutated examples from explore_data script make sense? #18

Open

PCA of individual cancer types #19

Open

jjc2718 added 2 commits August 27, 2020 11:31

rename filtered dataframe and add some additional clarification

a5d6954

rerun notebook to get rid of error

8ae9ff4

jjc2718 merged commit dcefc31 into greenelab:master Aug 27, 2020

jjc2718 deleted the explore_filters branch August 27, 2020 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add script to do exploratory data analysis on TCGA data #17

Add script to do exploratory data analysis on TCGA data #17

jjc2718 commented Aug 26, 2020

ajlee21 left a comment

jjc2718 commented Aug 27, 2020 •

edited

Loading

Add script to do exploratory data analysis on TCGA data #17

Add script to do exploratory data analysis on TCGA data #17

Conversation

jjc2718 commented Aug 26, 2020

ajlee21 left a comment

Choose a reason for hiding this comment

jjc2718 commented Aug 27, 2020 • edited Loading

jjc2718 commented Aug 27, 2020 •

edited

Loading