This repository has been archived by the owner on Jun 21, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 75
UMAPs without mitochondria #1658
Merged
Merged
Changes from 33 commits
Commits
Show all changes
42 commits
Select commit
Hold shift + click to select a range
6bb2e6d
Module readme
sjspielman 044b928
Initiated tumor purity module. Added run script and included in CI
sjspielman 0170550
module to readme
sjspielman 04784e0
bash scripts in fact do not use .R extensions
sjspielman f52472d
Added filtering distributions
sjspielman a1495ec
Merge branch 'master' into initiate-tumor-purity-module
sjspielman 4cb29e8
Merge branch 'master' into initiate-tumor-purity-module
sjspielman 7b2109c
Updated tumor purity to include extraction_type and output results fo…
sjspielman 174b90c
Add results TSV into readme since this is likely to be used by other …
sjspielman 260a123
Added another subsection and result file for thresholding PER cancer …
sjspielman f91caff
Add option to remove mitochondrial genes from dimension reduction and…
sjspielman a9fca41
Accidentally had used polya and found other bug. Fixed code and regen…
sjspielman a27c285
notebook to plot UMAP without mito as well as mito fpkm jitter
sjspielman b12383e
plot styling
sjspielman 977a86c
merge in master and fix conflicts
sjspielman b818e98
remove sneaky zipped html
sjspielman f6fea2f
remove old result files from when I started this branch
sjspielman b724636
add notebook to module bash scripts
sjspielman 845bd18
small title tweak in nb
sjspielman 71523e3
No need to make the plots with this data, and no need to run t-sne
sjspielman 1da693c
Merge branch 'master' into umap-sans-mito
sjspielman 677b9f4
Apply suggestions from code review
sjspielman 54ef86d
add conclusions and re-render
sjspielman cc82609
updated result file with properly used flag
sjspielman efb1936
small comment update
sjspielman f3deb28
Update analyses/transcriptomic-dimension-reduction/05-seq-center-mito…
sjspielman a89ec72
Merge branch 'master' into umap-sans-mito
sjspielman 7fb7286
merge reintroduced a straggling backtick that is now re-purged
sjspielman 62f57ca
script for TPM collapsing and filtering to nomito. We may not need to…
sjspielman 60b71b7
updated script to no longer collapse
sjspielman 5d0aacc
We now have TPM results, not collapsed, and an updated notebook to ex…
sjspielman 738e303
woops they were both nomito. fixed, and conclusions are the same
sjspielman 88a994f
Add tpm without mito to ci
sjspielman 752850c
Merge branch 'master' into umap-sans-mito
sjspielman 2388b73
Update analyses/transcriptomic-dimension-reduction/scripts/prepare-tp…
sjspielman 9a8be2a
Removed remove_mito_genes flag and re-run module with correct data
sjspielman 377375c
Add 05 notebook to README
sjspielman 7d8023e
need to remove flag from CI script
sjspielman 6879f0d
Removed description of collapse-rnaseq approach from comments since i…
sjspielman 05d53f4
ACTUALLY use the right data - the scratch was outdated, now it is fixed
sjspielman 1e73b21
reran to be safe
sjspielman bc61fcb
Merge branch 'master' into umap-sans-mito
sjspielman File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
240 changes: 240 additions & 0 deletions
240
analyses/transcriptomic-dimension-reduction/05-seq-center-mitochondrial-genes.Rmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,240 @@ | ||
--- | ||
title: "Explore mitochondrial gene effect on UMAP plots" | ||
author: "S. Spielman for CCDL" | ||
date: "2023" | ||
output: | ||
html_notebook: | ||
toc: TRUE | ||
toc_float: TRUE | ||
--- | ||
|
||
This notebook aims to explore some potential sequencing center biases that were raised in this OpenPBTA issue: https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1601 | ||
This issues notes that some of the samples sequenced at `BGI@CHOP` had higher-than-expected expression for certain mitochondrial genes, specifically RNR1 and RNR2, and somewhat lower-than-expected expression for other mitochondrial genes. | ||
|
||
To ensure appropriate normalization, the UMAPs here were created from TPM data rather than FPKM, such that mitochondrial genes were removed from the TPM results themselves. | ||
|
||
## Setup | ||
|
||
```{r setup, include=FALSE} | ||
library(magrittr) | ||
library(ggplot2) | ||
|
||
set.seed(2023) | ||
|
||
# set overall theme | ||
theme_set(ggpubr::theme_pubr() + | ||
# Legend tweaks for legibility | ||
theme(legend.position = "right", | ||
legend.direction = "vertical", | ||
legend.text = element_text(size = rel(0.5)), | ||
legend.title = element_text(size = rel(0.5)), | ||
legend.key.size = unit(0.25, "cm") | ||
)) | ||
|
||
# figure settings | ||
knitr::opts_chunk$set(fig.width = 8) | ||
``` | ||
|
||
|
||
Define directories and file names: | ||
```{r} | ||
# Directories | ||
root_dir <- rprojroot::find_root(rprojroot::has_dir(".git")) | ||
data_dir <- file.path(root_dir, "data") | ||
tumor_dir <- file.path(root_dir, "analyses", "tumor-purity-exploration") | ||
umap_dir <- file.path( | ||
root_dir, | ||
"analyses", | ||
"transcriptomic-dimension-reduction", | ||
"results" | ||
) | ||
palette_dir <- file.path(root_dir, "figures", "palettes") | ||
|
||
# UMAP files with different samples: | ||
tpm_umap <- file.path(umap_dir, "tpm_stranded_all_log_umap_scores_aligned.tsv") | ||
tpm_no_mito_umap <- file.path(umap_dir, "tpm_stranded_nomito_log_umap_scores_aligned.tsv") | ||
|
||
# palette mapping file | ||
pal_file <- file.path(palette_dir, "broad_histology_cancer_group_palette.tsv") | ||
|
||
# FPKM | ||
expression_file <- file.path(data_dir, "pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds") | ||
|
||
metadata_file <- file.path(data_dir, "pbta-histologies.tsv") | ||
``` | ||
|
||
```{r} | ||
# Read in palette data | ||
palette_mapping_df <- readr::read_tsv(pal_file) %>% | ||
dplyr::select(broad_histology, broad_histology_display, broad_histology_hex, | ||
cancer_group, cancer_group_display, cancer_group_hex) | ||
|
||
# Read in and prep UMAP data | ||
# helper function | ||
readr_prep_umap <- function(filename, | ||
pal_df = palette_mapping_df) { | ||
readr::read_tsv(filename) %>% | ||
dplyr::rename(UMAP1 = X1, UMAP2 = X2) %>% | ||
dplyr::inner_join(pal_df) %>% | ||
dplyr::mutate(seq_center = forcats::fct_relevel(seq_center, | ||
"BGI", "BGI@CHOP Genome Center", "NantOmics")) | ||
} | ||
|
||
umap_all <- readr_prep_umap(tpm_umap) | ||
umap_no_mito <- readr_prep_umap(tpm_no_mito_umap) | ||
|
||
# Read in expression and convert to data frame | ||
expression_df <- readr::read_rds(expression_file) %>% | ||
tibble::as_tibble(rownames = "gene_symbol") | ||
|
||
# Read in metadata | ||
metadata_df <- readr::read_tsv(metadata_file) | ||
``` | ||
|
||
|
||
## Explore mitochondrial genes | ||
|
||
For mitochondrial genes, do we see expression patterns for BGI@CHOP samples consistent with those posted in the issue linked above? | ||
|
||
```{r} | ||
# First, which diagnoses are sequenced at BGI@CHOP? | ||
relevant_groups <- metadata_df %>% | ||
dplyr::filter(RNA_library == "stranded", | ||
stringr::str_starts(seq_center, "BGI")) %>% | ||
tidyr::drop_na(broad_histology) %>% | ||
dplyr::pull(broad_histology) %>% | ||
unique() | ||
|
||
# Data frame of log2(FPKM) of mitochondrial genes for relevant diagnoses | ||
mito_seq_center_df <- expression_df %>% | ||
dplyr::filter(stringr::str_starts(gene_symbol, "MT-")) %>% | ||
tidyr::gather("Kids_First_Biospecimen_ID", | ||
"fpkm", | ||
tidyselect::starts_with("BS_")) %>% | ||
dplyr::mutate(log2_fpkm = log2(fpkm+1)) %>% | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just want to note my minor discomfort with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For better or worse, this is all up in OpenPBTA :/ |
||
# get diagnoses across samples | ||
dplyr::inner_join( | ||
dplyr::select(metadata_df, | ||
Kids_First_Biospecimen_ID, | ||
broad_histology, | ||
seq_center) | ||
) %>% | ||
# filter to only relevant groups | ||
dplyr::filter(broad_histology %in% relevant_groups) %>% | ||
# get display version of diagnosis | ||
dplyr::inner_join( | ||
dplyr::select(palette_mapping_df, | ||
broad_histology, | ||
broad_histology_display) | ||
) | ||
|
||
|
||
ggplot(mito_seq_center_df) + | ||
aes(x = gene_symbol, y = log2_fpkm, color = seq_center, size = seq_center) + | ||
geom_jitter(width = 0.15) + | ||
# emphasize the BGI points! | ||
scale_size_manual(values = c(2, 2, 0.5)) + | ||
facet_wrap(~broad_histology_display) + | ||
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 4)) | ||
|
||
``` | ||
|
||
We see a couple trends in the plot above that are consistent with the posted issue: | ||
|
||
- `BGI@CHOP Genome Center` samples have dramatically higher RNR1 and RNR2, but substantially lower expression for the other mitochondrial genes, across all diagnoses. | ||
- `BGI` samples tend to have lower RNR1 and RNR2 expression, and have lower expression for other genes specifically for embryonal tumors. | ||
|
||
Thus, `BIG@CHOP` (not `BGI`) samples are the main "concern" here. | ||
|
||
|
||
## UMAP | ||
|
||
How do UMAPs with and without mitochondrial genes compare? | ||
We might expect particular changes in embryonal, ependymal, and HGG since BGI is mostly those dianogses. | ||
|
||
```{r} | ||
# Function to plot UMAP and set up color palettes. | ||
plot_umap <- function(df, color_group, color_palette, title) { | ||
ggplot(df) + | ||
aes(x = UMAP1, y = UMAP2, | ||
shape = seq_center, | ||
fill = {{color_group}}, | ||
color = seq_center, | ||
alpha = seq_center) + | ||
geom_point(size = 2.5) + | ||
scale_shape_manual(values = c(22, 24, 21)) + | ||
scale_fill_manual(values = color_palette, | ||
# This "shape" override is needed for legend fill colors to actually work | ||
guide = guide_legend(override.aes = list(shape = 21, size = 2))) + | ||
scale_color_manual(values = c("grey70", "black", "black"), | ||
# These overrides ensure shapes appear all in same alpha/color in legend | ||
guide = guide_legend(override.aes = list(color = "black", alpha = 1, size = 2))) + | ||
scale_alpha_manual(values = c(0.3, 0.5, 0.5)) + | ||
ggtitle(title) + | ||
# tweaks for legibility | ||
theme(legend.position = "bottom", | ||
legend.text = element_text(size = rel(0.5)), | ||
legend.title = element_text(size = rel(0.5)), | ||
legend.key.size = unit(0.25, "cm") | ||
) | ||
} | ||
|
||
# Another plotting helper function to cowplot some plots with a shared legend | ||
combine_plots <- function(p1, p2) { | ||
p_legend <- cowplot::get_legend(p1) | ||
|
||
plot_row <- cowplot::plot_grid(p1 + theme(legend.position = "none"), | ||
p2 + theme(legend.position = "none"), | ||
nrow = 1, | ||
rel_widths = 0.95) | ||
|
||
full_grid <- cowplot::plot_grid(plot_row, p_legend, nrow = 2, rel_heights = c(1, 0.2)) | ||
|
||
full_grid | ||
} | ||
|
||
|
||
# set up color palettes | ||
bh_df <- palette_mapping_df %>% | ||
dplyr::select(broad_histology_display, broad_histology_hex) %>% | ||
dplyr::distinct() | ||
pal_bh<- bh_df$broad_histology_hex | ||
names(pal_bh) <- bh_df$broad_histology_display | ||
|
||
cg_df <- palette_mapping_df %>% | ||
dplyr::select(cancer_group_display, cancer_group_hex) %>% | ||
dplyr::distinct() %>% | ||
tidyr::drop_na() | ||
pal_cg<- cg_df$cancer_group_hex | ||
names(pal_cg) <- cg_df$cancer_group_display | ||
``` | ||
|
||
|
||
Let's plot some UMAPs! | ||
|
||
```{r, fig.width = 14, fig.height = 8} | ||
p1 <- plot_umap(umap_all, broad_histology_display, pal_bh, "Includes all genes. Colored by broad histology.") | ||
p2 <- plot_umap(umap_no_mito, broad_histology_display, pal_bh, "Includes only non-mito genes. Colored by broad histology.") | ||
combine_plots(p1, p2) | ||
|
||
p1 <- plot_umap(umap_all, cancer_group_display, pal_cg, "Includes all genes. Colored by cancer group.") | ||
p2 <- plot_umap(umap_no_mito, cancer_group_display, pal_cg, "Includes only non-mito genes. Colored by cancer group.") | ||
combine_plots(p1, p2) | ||
``` | ||
|
||
## Conclusions | ||
|
||
- Samples from `BIG@CHOP` sequencing center have unique mitochondrial gene distributions. | ||
- Removing mitochondrial genes from the UMAP does not have a strong qualitiative effect on whether broad histologies or cancer groups tend to cluster together. | ||
The UMAPs created here also look broadly similar to those made with FPKM. | ||
Visually, it seems the "mixed histology" groupings identified in `04-explore-sequencing-center-effects.Rmd` | ||
notebook still are grouped together in these UMAPs. | ||
|
||
Overall, this suggests that mitochondrial genes alone will not have a strong influence on UMAP visualizations. | ||
However, there may still be protocol differences associated with BIG@CHOP samples that influence expression values more generally, and can not be corrected by removing only mitochondrial genes. | ||
|
||
## sessionInfo | ||
|
||
```{r print session info} | ||
sessionInfo() | ||
``` |
3,308 changes: 3,308 additions & 0 deletions
3,308
analyses/transcriptomic-dimension-reduction/05-seq-center-mitochondrial-genes.nb.html
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
978 changes: 978 additions & 0 deletions
978
...es/transcriptomic-dimension-reduction/results/tpm_stranded_all_log_pca_scores_aligned.tsv
Large diffs are not rendered by default.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope this isn't needed here? The mito genes should have already been removed, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, and the flag in general is not needed at all anymore if we're not doing this with FPKM. I'll sort that all out.