UMAPs without mitochondria (#1658)

* Module readme * Initiated tumor purity module. Added run script and included in CI * module to readme * bash scripts in fact do not use .R extensions * Added filtering distributions * Updated tumor purity to include extraction_type and output results for potential use in other modules * Add results TSV into readme since this is likely to be used by other modules * Added another subsection and result file for thresholding PER cancer group * Add option to remove mitochondrial genes from dimension reduction and generate associated files for the new 'rsem_stranded_no-mito_log' filename_lead value * Accidentally had used polya and found other bug. Fixed code and regenerated stranded rsem without mito umap * notebook to plot UMAP without mito as well as mito fpkm jitter * plot styling * remove sneaky zipped html * remove old result files from when I started this branch * add notebook to module bash scripts * small title tweak in nb * No need to make the plots with this data, and no need to run t-sne * Apply suggestions from code review Co-authored-by: Joshua Shapiro <[email protected]> * add conclusions and re-render * updated result file with properly used flag * small comment update * Update analyses/transcriptomic-dimension-reduction/05-seq-center-mitochondrial-genes.Rmd Co-authored-by: Joshua Shapiro <[email protected]> * merge reintroduced a straggling backtick that is now re-purged * script for TPM collapsing and filtering to nomito. We may not need to collapse though * updated script to no longer collapse * We now have TPM results, not collapsed, and an updated notebook to explore those UMAPs * woops they were both nomito. fixed, and conclusions are the same * Add tpm without mito to ci * Update analyses/transcriptomic-dimension-reduction/scripts/prepare-tpm-for-umap.R Co-authored-by: Joshua Shapiro <[email protected]> * Removed remove_mito_genes flag and re-run module with correct data * Add 05 notebook to README * need to remove flag from CI script * Removed description of collapse-rnaseq approach from comments since it is no longer used * ACTUALLY use the right data - the scratch was outdated, now it is fixed * reran to be safe --------- Co-authored-by: Joshua Shapiro <[email protected]>
AlexsLemonade · Feb 15, 2023 · f519494 · f519494
1 parent 78e4969
commit f519494
Show file tree

Hide file tree

Showing 14 changed files with 7,692 additions and 84 deletions.
diff --git a/analyses/transcriptomic-dimension-reduction/01-dimension-reduction.sh b/analyses/transcriptomic-dimension-reduction/01-dimension-reduction.sh
@@ -74,6 +74,44 @@ Rscript --vanilla scripts/run-dimension-reduction.R \
   --low_count_threshold ${COUNT_THRESHOLD} \
   --log2_transform
 
+
+
+#### TPM stranded, with and without mitochondrial genes and skipping t-sne ------------------------------------
+# Note that these results are only used for exploration in `05-seq_center-mitochondrial-genes.Rmd`
+
+# First, we'll need to create this input file, which contains filtered and re-normalied TPM, if it is missing:
+NOMITO_TPM_FILE=../../scratch/transcriptomic-dimension-reduction/tpm-nomito-stranded.rds
+if [ ! -f ${NOMITO_TPM_FILE} ]; then
+  Rscript --vanilla scripts/prepare-tpm-for-umap.R
+fi
+
+# Perform dimension reduction on all TPM
+Rscript --vanilla scripts/run-dimension-reduction.R \
+  --expression ../../data/pbta-gene-expression-rsem-tpm.stranded.rds \
+  --metadata ${METADATA} \
+  --filename_lead tpm_stranded_all_log \
+  --output_directory ${OUTPUT} \
+  --seed ${SEED} \
+  --perplexity ${PERPLEXITY} \
+  --neighbors ${NEIGHBORS} \
+  --low_count_threshold ${COUNT_THRESHOLD} \
+  --skip_tsne \
+  --log2_transform
+
+# Perform dimension reduction on nomito TPM
+Rscript --vanilla scripts/run-dimension-reduction.R \
+  --expression ${NOMITO_TPM_FILE} \
+  --metadata ${METADATA} \
+  --filename_lead tpm_stranded_nomito_log \
+  --output_directory ${OUTPUT} \
+  --seed ${SEED} \
+  --perplexity ${PERPLEXITY} \
+  --neighbors ${NEIGHBORS} \
+  --low_count_threshold ${COUNT_THRESHOLD} \
+  --skip_tsne \
+  --log2_transform
+
+
 #### kallisto ------------------------------------------------------------------
 
 Rscript --vanilla scripts/run-dimension-reduction.R \
@@ -118,3 +156,5 @@ Rscript --vanilla scripts/run-dimension-reduction.R \
   --low_count_threshold ${COUNT_THRESHOLD} \
   --log2_transform
 
+
+
diff --git a/analyses/transcriptomic-dimension-reduction/05-seq-center-mitochondrial-genes.Rmd b/analyses/transcriptomic-dimension-reduction/05-seq-center-mitochondrial-genes.Rmd
@@ -0,0 +1,240 @@
+---
+title: "Explore mitochondrial gene effect on UMAP plots"
+author: "S. Spielman for CCDL"
+date: "2023"
+output: 
+  html_notebook:
+    toc: TRUE
+    toc_float: TRUE
+---
+
+This notebook aims to explore some potential sequencing center biases that were raised in this OpenPBTA issue: https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1601
+This issues notes that some of the samples sequenced at `BGI@CHOP` had higher-than-expected expression for certain mitochondrial genes, specifically RNR1 and RNR2, and somewhat lower-than-expected expression for other mitochondrial genes.
+
+To ensure appropriate normalization, the UMAPs here were created from TPM data rather than FPKM, such that mitochondrial genes were removed from the TPM results themselves. 
+
+## Setup
+
+```{r setup, include=FALSE}
+library(magrittr)
+library(ggplot2)
+
+set.seed(2023)
+
+# set overall theme
+theme_set(ggpubr::theme_pubr() +
+ # Legend tweaks for legibility
+  theme(legend.position = "right", 
+        legend.direction = "vertical",
+        legend.text = element_text(size = rel(0.5)),
+        legend.title = element_text(size = rel(0.5)),
+        legend.key.size = unit(0.25, "cm")
+  )) 
+
+# figure settings
+knitr::opts_chunk$set(fig.width = 8)
+```
+
+
+Define directories and file names: 
+```{r}
+# Directories
+root_dir <- rprojroot::find_root(rprojroot::has_dir(".git"))
+data_dir <- file.path(root_dir, "data")
+tumor_dir <- file.path(root_dir, "analyses", "tumor-purity-exploration")
+umap_dir <- file.path(
+  root_dir,
+  "analyses",
+  "transcriptomic-dimension-reduction", 
+  "results"
+)
+palette_dir <- file.path(root_dir, "figures", "palettes")
+
+# UMAP files with different samples:
+tpm_umap <- file.path(umap_dir, "tpm_stranded_all_log_umap_scores_aligned.tsv")
+tpm_no_mito_umap <- file.path(umap_dir, "tpm_stranded_nomito_log_umap_scores_aligned.tsv")
+
+# palette mapping file
+pal_file <- file.path(palette_dir,  "broad_histology_cancer_group_palette.tsv")
+
+# FPKM
+expression_file <- file.path(data_dir, "pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds")
+
+metadata_file <- file.path(data_dir, "pbta-histologies.tsv")
+```
+
+```{r}
+# Read in palette data
+palette_mapping_df <- readr::read_tsv(pal_file) %>%
+  dplyr::select(broad_histology, broad_histology_display, broad_histology_hex, 
+                cancer_group, cancer_group_display, cancer_group_hex) 
+
+# Read in and prep UMAP data
+# helper function
+readr_prep_umap <- function(filename, 
+                            pal_df = palette_mapping_df) {
+  readr::read_tsv(filename) %>% 
+    dplyr::rename(UMAP1 = X1, UMAP2 = X2) %>%
+    dplyr::inner_join(pal_df) %>%
+    dplyr::mutate(seq_center = forcats::fct_relevel(seq_center, 
+                                                    "BGI", "BGI@CHOP Genome Center", "NantOmics"))
+}
+
+umap_all <- readr_prep_umap(tpm_umap)
+umap_no_mito <- readr_prep_umap(tpm_no_mito_umap) 
+
+# Read in expression and convert to data frame
+expression_df <- readr::read_rds(expression_file) %>%
+  tibble::as_tibble(rownames = "gene_symbol")
+
+# Read in metadata
+metadata_df <- readr::read_tsv(metadata_file)
+```
+
+
+## Explore mitochondrial genes
+
+For mitochondrial genes, do we see expression patterns for BGI@CHOP samples consistent with those posted in the issue linked above?
+
+```{r}
+# First, which diagnoses are sequenced at BGI@CHOP?
+relevant_groups <- metadata_df %>%
+  dplyr::filter(RNA_library == "stranded", 
+                stringr::str_starts(seq_center, "BGI")) %>%
+  tidyr::drop_na(broad_histology) %>%
+  dplyr::pull(broad_histology) %>%
+  unique()
+
+# Data frame of log2(FPKM) of mitochondrial genes for relevant diagnoses
+mito_seq_center_df <- expression_df %>%
+  dplyr::filter(stringr::str_starts(gene_symbol, "MT-")) %>%
+  tidyr::gather("Kids_First_Biospecimen_ID", 
+                "fpkm", 
+                tidyselect::starts_with("BS_")) %>%
+  dplyr::mutate(log2_fpkm = log2(fpkm+1)) %>%
+  # get diagnoses across samples
+  dplyr::inner_join(
+    dplyr::select(metadata_df, 
+                  Kids_First_Biospecimen_ID, 
+                  broad_histology, 
+                  seq_center)
+  ) %>%
+  # filter to only relevant groups
+  dplyr::filter(broad_histology %in% relevant_groups) %>%
+  # get display version of diagnosis
+  dplyr::inner_join(
+    dplyr::select(palette_mapping_df, 
+                  broad_histology, 
+                  broad_histology_display)
+  )
+
+
+ggplot(mito_seq_center_df) + 
+  aes(x = gene_symbol, y = log2_fpkm, color = seq_center, size = seq_center) + 
+  geom_jitter(width = 0.15) + 
+  # emphasize the BGI points!
+  scale_size_manual(values = c(2, 2, 0.5)) +
+  facet_wrap(~broad_histology_display) + 
+  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 4))
+
+```
+
+We see a couple trends in the plot above that are consistent with the posted issue:
+
+- `BGI@CHOP Genome Center` samples have dramatically higher RNR1 and RNR2, but substantially lower expression for the other mitochondrial genes, across all diagnoses.
+- `BGI` samples tend to have lower RNR1 and RNR2 expression, and have lower expression for other genes specifically for embryonal tumors.
+
+Thus, `BIG@CHOP` (not `BGI`) samples are the main "concern" here.
+
+
+## UMAP
+
+How do UMAPs with and without mitochondrial genes compare? 
+We might expect particular changes in embryonal, ependymal, and HGG since BGI is mostly those dianogses.
+
+```{r}
+# Function to plot UMAP and set up color palettes.
+plot_umap <- function(df, color_group, color_palette, title) {
+  ggplot(df) + 
+    aes(x = UMAP1, y = UMAP2, 
+      shape = seq_center, 
+      fill = {{color_group}}, 
+      color = seq_center, 
+      alpha = seq_center) + 
+  geom_point(size = 2.5) + 
+  scale_shape_manual(values = c(22, 24, 21)) + 
+  scale_fill_manual(values  = color_palette, 
+                    # This "shape" override is needed for legend fill colors to actually work
+                    guide = guide_legend(override.aes = list(shape = 21, size = 2))) +
+  scale_color_manual(values = c("grey70", "black", "black"), 
+                     # These overrides ensure shapes appear all in same alpha/color in legend
+                     guide = guide_legend(override.aes = list(color = "black", alpha = 1, size = 2))) +
+  scale_alpha_manual(values = c(0.3, 0.5, 0.5)) +
+  ggtitle(title) +
+  # tweaks for legibility
+  theme(legend.position = "bottom", 
+        legend.text = element_text(size = rel(0.5)),
+        legend.title = element_text(size = rel(0.5)),
+        legend.key.size = unit(0.25, "cm")
+  )
+}
+
+# Another plotting helper function to cowplot some plots with a shared legend
+combine_plots <- function(p1, p2) {
+  p_legend <- cowplot::get_legend(p1)
+  
+  plot_row <- cowplot::plot_grid(p1 + theme(legend.position = "none"),
+                                 p2 + theme(legend.position = "none"),
+                                 nrow = 1, 
+                                 rel_widths = 0.95)
+  
+  full_grid <- cowplot::plot_grid(plot_row, p_legend, nrow = 2, rel_heights = c(1, 0.2))
+  
+  full_grid
+}
+
+
+# set up color palettes
+bh_df <- palette_mapping_df %>%
+  dplyr::select(broad_histology_display, broad_histology_hex) %>%
+  dplyr::distinct() 
+pal_bh<- bh_df$broad_histology_hex
+names(pal_bh) <- bh_df$broad_histology_display
+
+cg_df <- palette_mapping_df %>%
+  dplyr::select(cancer_group_display, cancer_group_hex) %>%
+  dplyr::distinct() %>%
+  tidyr::drop_na()
+pal_cg<- cg_df$cancer_group_hex
+names(pal_cg) <- cg_df$cancer_group_display
+```
+
+
+Let's plot some UMAPs!
+
+```{r, fig.width = 14, fig.height = 8}
+p1 <- plot_umap(umap_all, broad_histology_display, pal_bh, "Includes all genes. Colored by broad histology.") 
+p2 <- plot_umap(umap_no_mito, broad_histology_display, pal_bh, "Includes only non-mito genes. Colored by broad histology.")
+combine_plots(p1, p2)
+
+p1 <- plot_umap(umap_all, cancer_group_display, pal_cg, "Includes all genes. Colored by cancer group.")
+p2 <- plot_umap(umap_no_mito, cancer_group_display, pal_cg, "Includes only non-mito genes. Colored by cancer group.")
+combine_plots(p1, p2)
+```
+
+## Conclusions
+
+- Samples from `BIG@CHOP` sequencing center have unique mitochondrial gene distributions.
+- Removing mitochondrial genes from the UMAP does not have a strong qualitiative effect on whether broad histologies or cancer groups tend to cluster together.
+The UMAPs created here also look broadly similar to those made with FPKM. 
+Visually, it seems the "mixed histology" groupings identified in `04-explore-sequencing-center-effects.Rmd`
+notebook still are grouped together in these UMAPs.
+
+Overall, this suggests that mitochondrial genes alone will not have a strong influence on UMAP visualizations.
+However, there may still be protocol differences associated with BIG@CHOP samples that influence expression values more generally, and can not be corrected by removing only mitochondrial genes.
+
+## sessionInfo
+
+```{r print session info}
+sessionInfo()
+```
diff --git a/analyses/transcriptomic-dimension-reduction/05-seq-center-mitochondrial-genes.nb.html b/analyses/transcriptomic-dimension-reduction/05-seq-center-mitochondrial-genes.nb.html
diff --git a/analyses/transcriptomic-dimension-reduction/README.md b/analyses/transcriptomic-dimension-reduction/README.md
@@ -55,7 +55,7 @@ done
 
 ```
 
-Unfortunately, this approach will also run PCA 5 times. 
+Unfortunately, this approach will also run PCA 5 times.
 
 ##### R
 
@@ -138,6 +138,11 @@ plot_dimension_reduction(aligned_scores_df = pca_df,
 
 The `point_color` argument of `plot_dimension_reduction` can be changed to any variable that is a column in the original TSV file.
 
-### Explore potential batch effects from the sequencing center 
+### Explore potential batch effects from the sequencing center
 
-The notebook `04-explore-sequencing-center-effects.Rmd` performs exploratory analyses and visualization to roughly assess the extent to which sequencing center is expected to induce batch effects.
+The notebook `04-explore-sequencing-center-effects.Rmd` performs exploratory analyses and visualization to roughly assess the extent to which sequencing center is expected to induce batch effects.
+
+The notebook `05-seq-center-mitochondrial-genes.Rmd` performs exploratory analyses to assess whether artifacts present in mitochondrial gene RNA-Seq data influence UMAP visualization.
+To enable this analysis, this pipeline will also perform transcriptomic dimension reduction on TPM data for both all genes and all non-mitochondrial genes.
+The TPM-derived UMAPs are assessed in this notebook.
+TPM was used rather than FPKM to enable more accurate re-normalization after filtering out mitochondrial genes from the full TPM data.
diff --git a/analyses/transcriptomic-dimension-reduction/ci-dimension-reduction-plots.sh b/analyses/transcriptomic-dimension-reduction/ci-dimension-reduction-plots.sh
@@ -30,7 +30,18 @@ Rscript --vanilla scripts/run-dimension-reduction.R \
   --skip_tsne \
   --neighbors 2
 
-# generate plot lists for both cases above
+# Run no mito
+NOMITO_TPM_FILE=../../scratch/transcriptomic-dimension-reduction/tpm-nomito-stranded.rds
+Rscript --vanilla scripts/prepare-tpm-for-umap.R
+Rscript --vanilla scripts/run-dimension-reduction.R \
+  --expression ${NOMITO_TPM_FILE} \
+  --metadata ../../data/pbta-histologies.tsv \
+  --filename_lead tpm_stranded_nomito_log \
+  --output_directory results \
+  --neighbors 2 \
+  --skip_tsne
+
+# generate plot lists for both stranded RSEM and poly-A kallisto
 Rscript --vanilla scripts/get-plot-list.R  \
   --input_directory results \
   --filename_lead rsem_stranded \
@@ -48,3 +59,6 @@ bash 03-multipanel-plots.sh
 
 # Exploration of batch effects
 Rscript --vanilla -e 'rmarkdown::render("04-explore-sequencing-center-effects.Rmd", params = list(is_ci = 1))'
+
+# Exploration of UMAPs if mitochondrial genes are removed
+Rscript --vanilla -e 'rmarkdown::render("05-seq-center-mitochondrial-genes.Rmd")'
diff --git a/analyses/transcriptomic-dimension-reduction/dimension-reduction-plots.sh b/analyses/transcriptomic-dimension-reduction/dimension-reduction-plots.sh
@@ -26,3 +26,6 @@ bash 03-multipanel-plots.sh
 
 # Exploration of batch effects
 Rscript --vanilla -e 'rmarkdown::render("04-explore-sequencing-center-effects.Rmd")'
+
+# Exploration of UMAPs if mitochondrial genes are removed
+Rscript --vanilla -e 'rmarkdown::render("05-seq-center-mitochondrial-genes.Rmd")'
diff --git a/...es/transcriptomic-dimension-reduction/results/tpm_stranded_all_log_pca_scores_aligned.tsv b/...es/transcriptomic-dimension-reduction/results/tpm_stranded_all_log_pca_scores_aligned.tsv