Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

UMAPs without mitochondria #1658

Merged
merged 42 commits into from
Feb 15, 2023
Merged

Conversation

sjspielman
Copy link
Member

This PR explores, as part of manuscript revisions, how dimension reduction might change if we do not include mitochondrial genes, as inspired by this issue: #1601
Specifically, it seems that one sequencing center has some problematic mitochondrial gene expression distributions that may have resulted from using a different kit whose identity may also be lost to the ages.

There are two main changes in this module:

  1. I updated the transcriptomic-dimension-reduction module to include an option to remove mitochondrial genes. I added this run for stranded RSEM with log2 (skipping tnse) in order to generate data for this exploration. But, I did not add this run to the 02 plot script , since we don't really need those.

  2. I added a notebook to transcriptomic-dimension-reduction to explore the mitochondrial expression, first via just looking at FPKMs of mito genes for relevant diagnoses and then via UMAPs made with and without mitochondrial genes. I don't see anything too different here, but welcome thoughts on interpretation. Here's that notebook:
    05-seq-center-mitochondrial-genes.nb.html.zip

Note that I had some merge conflicts bringing master in, so I had to open and clean up some the tumor purity scripts (including re-rendering a notebook), and along the way there VS Code ate lots of EOL spaces, so that's why those diffs are here!

@sjspielman sjspielman added the revision Related to manuscript revisions label Jan 26, 2023
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me code-wise, but I am a bit concerned about whether the UMAPs are what they should be, given https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1658/files#r1089064606

I'll wait for confirmation/rerunning before approval...

tidyr::gather("Kids_First_Biospecimen_ID",
"fpkm",
tidyselect::starts_with("BS_")) %>%
dplyr::mutate(log2_fpkm = log2(fpkm+1)) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to note my minor discomfort with +1 as a zero-correction for FPKM, but it does not really matter in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better or worse, this is all up in OpenPBTA :/

@sjspielman sjspielman requested a review from jashapiro January 30, 2023 16:03
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to approve this and say that this looks good to me, but then I realized that this may not actually be sufficient for "removing" the mitochondrial genes. The FPKM calculation will still be using the mitochondrial genes in the denominator, which will affect the expression values for all of the other genes before we get to this point. So we may want to remove the mitochondrial genes from the count files, then recalculate FPKM.

I think we can do this using the count matrices that we have, though I believe the original the FPKM calculation occurs upstream, so we might not get precisely the same results if we reimplement the FPKM calculations. (So we should recalculate with and without MT) Before attempting this, we should probably discuss whether it is the right thing to spend time on at this moment.

@jashapiro
Copy link
Member

I was about to approve this and say that this looks good to me, but then I realized that this may not actually be sufficient for "removing" the mitochondrial genes. The FPKM calculation will still be using the mitochondrial genes in the denominator, which will affect the expression values for all of the other genes before we get to this point. So we may want to remove the mitochondrial genes from the count files, then recalculate FPKM.

I think we can do this using the count matrices that we have, though I believe the original the FPKM calculation occurs upstream, so we might not get precisely the same results if we reimplement the FPKM calculations. (So we should recalculate with and without MT) Before attempting this, we should probably discuss whether it is the right thing to spend time on at this moment.

My main followup thought here is that we can probably do a rough version of this using the existing FPKM or TPM matrices: With TPM, I think we can re-sum the TPM values for each sample, excluding mitochondrial genes, divide by the sum/10^6 to rescale. That won't be quite right for FPKM, but it will probably be close enough. Still, I'd probably start with TPM for this notebook (calculating UMAP from the original TPM and then from the mito-excluded & rescaled TPM) as we are just looking to see if there is an effect of removing the mitochondrial genes. If we feel we need to go back to FPKM calculations later for better consistency with previous analyses, we can still do that.

@sjspielman
Copy link
Member Author

sjspielman commented Feb 7, 2023

@jashapiro We have some TPM UMAPs 😄
Here's the re-rendered notebook exploring the results. The plots don't look wildly different from FPKM which is a nice little sanity check to see. Overall trends look roughly the same with(out) mitochondrial genes.
05-seq-center-mitochondrial-genes.nb.html.zip

For this review, please let* (edit) me know in particular any feedback on how I did the TPM prep (scripts being called from right places, and scratch/ export choice). I also kept the COUNT_THRESHOLD bash variable for filtering out low counts at 100, which seems more or less fine in the end.

@sjspielman sjspielman requested a review from jashapiro February 7, 2023 21:57
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine. Glad to see that renomalizing doesn't really have much of an effect, and I like the additional comment you added about how this doesn't correct for everything that we might see from protocol differences.

The html file in the repo is not up to date, however, so that needs to be added before approving. I don't know if other data files are up to date, but I image a full rerun will address both.

Minor comments that do not be to be addressed here:
I probably wouldn't have created a separate script for the TPM matrix filtering, as you already had most of the filtering code in the main script, but I don't see any major problem for this check.

If this were more than a one-off, I might also suggest not doing the gather/spread, but instead converting to a matrix and using matrix ops for efficiency, but for this it really doesn't matter.

It also seems odd to me that the umap tsv files have all of the metadata in them, but not worth the effort to change here.

Rscript --vanilla scripts/run-dimension-reduction.R \
--expression ${NOMITO_TPM_FILE} \
--metadata ${METADATA} \
--remove_mito_genes \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this isn't needed here? The mito genes should have already been removed, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, and the flag in general is not needed at all anymore if we're not doing this with FPKM. I'll sort that all out.

@sjspielman
Copy link
Member Author

@jashapiro As of now, CI is passing through this module, so it's ready for another look! I updated the module to wholesale remove the remove_mito_genes flag since it's not needed at all anymore. I also re-ran the module to properly generate the HTMLs with the full data (not the CI data as was there before, woops!). Here’s the notebook:
05-seq-center-mitochondrial-genes.nb.html.zip

Since the tumor purity had a diff (see my last bit here #1658 (comment)), I also decided just to be safe to re-run that module as well. All the notebook diffs are HTML miscellany.

@sjspielman sjspielman requested review from jashapiro and removed request for jaclyn-taroni February 9, 2023 19:02
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Still not 100% sure what to do with this information, but that is separate from the analysis itself.

@sjspielman sjspielman merged commit f519494 into AlexsLemonade:master Feb 15, 2023
@sjspielman sjspielman deleted the umap-sans-mito branch February 15, 2023 14:29
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
revision Related to manuscript revisions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants