-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add script to do exploratory data analysis on TCGA data #17
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice PR. Really interesting to learn about! Love that this is your "looking at the data" notebook
-
Just wanted to clarify, when you say
>5% mutated, >50 samples mutated
this means you are looking for cancer datasets where all genes are mutated in at least 50 samples? So 5% mutated is equivalent to 50 samples being mutated? This is mutated with respect to some reference I assume. This is probably my lack of cancer background, but is there a reason you are using this filter? -
Could add a hyperlink here
00_download_data.ipynb
-
pancancer_data
is a df of dfs? -
I assume
pancancer_data
contains the metadata associated withrnaseq_df
? -
Not sure if you want to update this print statement to refer to other genetic variation types in addition to mutations:
print('Genes with mutation information: {}'.format(len(overlap_genes)))
-
I assume the numbers in
cancer_types_df
refer to the number of samples per cancer type -
Still learning about the best way to define paths, so perhaps this is obvious. But is there a reason you're using
resolve.()
instead ofos.join
or defining this path in your cfg? -
You
test_df
is your filtered dataset . Nice use of pandas! I am appreciating pandas more after having lectured about it! -
Not sure if this is a worth while validation. But would you consider looking at the genes that passed your filter and do a quick sanity check that they correspond to those found to be highly mutated in publications?
-
What is your takeaway from the mutation plot trend? I can see that your unincluded diseases are clustered by the origin as expected. Though there are some points that fall below the 50 sample threshold, do you know why that is?
-
Nice bar plot..I assume you are comparing this distribution to some publication, but might be nice to make a note for your future self
-
Nice explanation on your PCA plots.
-
Looks like you need < 20 PCs to explain > 0.99 of the variance in the rnaseq data. Do you think this is the same number of PCs you'd need if you looked at different cancer types individually...just me being curious. I wonder if some have more complex structures than others. Do you know which genes are highly weighted in these components that explain the variance? Perhaps that is another avenue for distinguishing cancer types
-
A little hard to tell based because we're looking at data compressed onto 2 axis, but 100 PCs looks like it might have slightly better separation
-
I understand the logic of using the MAD genes. This reminds of my generic-expression-pattern repo, where we find that some genes are commonly DE. We haven't looked into the reason but I wonder if these MAD genes are those that are common across all these cancers because they're hyper-responsive. I'm not sure if you're planning to use this moving forward, but if you are trying to find a signature that distinguishes between cancer types, you could try using a supervised learning approach on your compressed PC's gene expression data?
Thanks for the detailed feedback! Responses below:
No worries, this is a bit confusing if you're not familiar with the project. We're filtering gene/cancer type combinations. So in other words, say we're interested in TP53 mutations - we're looking for cancer types where at least 5% of samples and at least 50 samples have a mutation in TP53 (so the 5% and 50 samples are two separate filters, if the cancer type fails either of these criteria we drop it). Then we do this individually for all other genes, in this case all other genes in the top 50 most mutated genes across all of TCGA (although I'll likely generalize this to more genes in the future). The reason for this is more a ML/prediction-related reason than a cancer reason: we think it will be hard to build a good classifier if we have very few positively labeled samples, so we filter out those cases as a preprocessing step. In the future, we could do something special to handle these cases (like apply outlier detection methods or classifiers specifically designed for imbalanced labels) but for now we're just ignoring them and focusing on the cases where we have more balanced labels.
It's just a tuple of dfs. I'll document this in the script, I agree that it is a bit confusing.
Yep, exactly (info about mutations, mutation burden, cancer type, etc. for the samples in
At the moment, I'm just looking at point mutations. Integrating copy number information gets a bit complicated (we need to know whether each gene is an oncogene or tumor suppressor or neither to integrate that information), so I'm not using it here but I'd definitely like to add it in the future.
Yep, exactly!
I've been using
I meant to change this name to something more informative, thanks for reminding me (and yes, it's the filtered dataset)
Looking at publications would definitely be smart! Created #18 to look into this. My takeaways are: (I'll add these to the script too)
It's mostly just a general sanity check - mutation distributions in lots of cancer types/previous publications have been observed as "heavy tailed" or "hockey stick-shaped" (a few common mutations and lots of less common ones). We see that here too for many cancer types, which makes sense.
Good question! I haven't looked at cancer types individually. Created #19 to remind me about this.
Yeah, it's hard to tell. The plot does look a bit different than for higher numbers of genes, though, which tells me that at least some of the signal is being removed or altered at that point.
Yeah, this is definitely something we're thinking about. It's a very similar idea to what Greg did in his BioBombe paper, but the cross-validation he did was different (his train/test sets were both stratified by cancer type, but we're holding out single cancer types and comparing performance). It would be interesting to try his approach here as well. Generally, his experiments showed that using raw gene expression data performed better for the mutation prediction problem than any method of compressing the expression data (e.g. PCA, NMF, VAE), but we're not sure if that would be the case here as well. |
Closes #9.
This PR adds a notebook to answer some questions about the TCGA data before fitting any models:
The answers are pretty much what we expected, but it will be good to have this info going forward.