This repository has been archived by the owner on Jun 21, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 83
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Module readme * Initiated tumor purity module. Added run script and included in CI * module to readme * bash scripts in fact do not use .R extensions * Added filtering distributions * Updated tumor purity to include extraction_type and output results for potential use in other modules * Add results TSV into readme since this is likely to be used by other modules * Added another subsection and result file for thresholding PER cancer group * Add option to remove mitochondrial genes from dimension reduction and generate associated files for the new 'rsem_stranded_no-mito_log' filename_lead value * Accidentally had used polya and found other bug. Fixed code and regenerated stranded rsem without mito umap * notebook to plot UMAP without mito as well as mito fpkm jitter * plot styling * remove sneaky zipped html * remove old result files from when I started this branch * add notebook to module bash scripts * small title tweak in nb * No need to make the plots with this data, and no need to run t-sne * Apply suggestions from code review Co-authored-by: Joshua Shapiro <[email protected]> * add conclusions and re-render * updated result file with properly used flag * small comment update * Update analyses/transcriptomic-dimension-reduction/05-seq-center-mitochondrial-genes.Rmd Co-authored-by: Joshua Shapiro <[email protected]> * merge reintroduced a straggling backtick that is now re-purged * script for TPM collapsing and filtering to nomito. We may not need to collapse though * updated script to no longer collapse * We now have TPM results, not collapsed, and an updated notebook to explore those UMAPs * woops they were both nomito. fixed, and conclusions are the same * Add tpm without mito to ci * Update analyses/transcriptomic-dimension-reduction/scripts/prepare-tpm-for-umap.R Co-authored-by: Joshua Shapiro <[email protected]> * Removed remove_mito_genes flag and re-run module with correct data * Add 05 notebook to README * need to remove flag from CI script * Removed description of collapse-rnaseq approach from comments since it is no longer used * ACTUALLY use the right data - the scratch was outdated, now it is fixed * reran to be safe --------- Co-authored-by: Joshua Shapiro <[email protected]>
- Loading branch information
1 parent
78e4969
commit f519494
Showing
14 changed files
with
7,692 additions
and
84 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
240 changes: 240 additions & 0 deletions
240
analyses/transcriptomic-dimension-reduction/05-seq-center-mitochondrial-genes.Rmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,240 @@ | ||
--- | ||
title: "Explore mitochondrial gene effect on UMAP plots" | ||
author: "S. Spielman for CCDL" | ||
date: "2023" | ||
output: | ||
html_notebook: | ||
toc: TRUE | ||
toc_float: TRUE | ||
--- | ||
|
||
This notebook aims to explore some potential sequencing center biases that were raised in this OpenPBTA issue: https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1601 | ||
This issues notes that some of the samples sequenced at `BGI@CHOP` had higher-than-expected expression for certain mitochondrial genes, specifically RNR1 and RNR2, and somewhat lower-than-expected expression for other mitochondrial genes. | ||
|
||
To ensure appropriate normalization, the UMAPs here were created from TPM data rather than FPKM, such that mitochondrial genes were removed from the TPM results themselves. | ||
|
||
## Setup | ||
|
||
```{r setup, include=FALSE} | ||
library(magrittr) | ||
library(ggplot2) | ||
set.seed(2023) | ||
# set overall theme | ||
theme_set(ggpubr::theme_pubr() + | ||
# Legend tweaks for legibility | ||
theme(legend.position = "right", | ||
legend.direction = "vertical", | ||
legend.text = element_text(size = rel(0.5)), | ||
legend.title = element_text(size = rel(0.5)), | ||
legend.key.size = unit(0.25, "cm") | ||
)) | ||
# figure settings | ||
knitr::opts_chunk$set(fig.width = 8) | ||
``` | ||
|
||
|
||
Define directories and file names: | ||
```{r} | ||
# Directories | ||
root_dir <- rprojroot::find_root(rprojroot::has_dir(".git")) | ||
data_dir <- file.path(root_dir, "data") | ||
tumor_dir <- file.path(root_dir, "analyses", "tumor-purity-exploration") | ||
umap_dir <- file.path( | ||
root_dir, | ||
"analyses", | ||
"transcriptomic-dimension-reduction", | ||
"results" | ||
) | ||
palette_dir <- file.path(root_dir, "figures", "palettes") | ||
# UMAP files with different samples: | ||
tpm_umap <- file.path(umap_dir, "tpm_stranded_all_log_umap_scores_aligned.tsv") | ||
tpm_no_mito_umap <- file.path(umap_dir, "tpm_stranded_nomito_log_umap_scores_aligned.tsv") | ||
# palette mapping file | ||
pal_file <- file.path(palette_dir, "broad_histology_cancer_group_palette.tsv") | ||
# FPKM | ||
expression_file <- file.path(data_dir, "pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds") | ||
metadata_file <- file.path(data_dir, "pbta-histologies.tsv") | ||
``` | ||
|
||
```{r} | ||
# Read in palette data | ||
palette_mapping_df <- readr::read_tsv(pal_file) %>% | ||
dplyr::select(broad_histology, broad_histology_display, broad_histology_hex, | ||
cancer_group, cancer_group_display, cancer_group_hex) | ||
# Read in and prep UMAP data | ||
# helper function | ||
readr_prep_umap <- function(filename, | ||
pal_df = palette_mapping_df) { | ||
readr::read_tsv(filename) %>% | ||
dplyr::rename(UMAP1 = X1, UMAP2 = X2) %>% | ||
dplyr::inner_join(pal_df) %>% | ||
dplyr::mutate(seq_center = forcats::fct_relevel(seq_center, | ||
"BGI", "BGI@CHOP Genome Center", "NantOmics")) | ||
} | ||
umap_all <- readr_prep_umap(tpm_umap) | ||
umap_no_mito <- readr_prep_umap(tpm_no_mito_umap) | ||
# Read in expression and convert to data frame | ||
expression_df <- readr::read_rds(expression_file) %>% | ||
tibble::as_tibble(rownames = "gene_symbol") | ||
# Read in metadata | ||
metadata_df <- readr::read_tsv(metadata_file) | ||
``` | ||
|
||
|
||
## Explore mitochondrial genes | ||
|
||
For mitochondrial genes, do we see expression patterns for BGI@CHOP samples consistent with those posted in the issue linked above? | ||
|
||
```{r} | ||
# First, which diagnoses are sequenced at BGI@CHOP? | ||
relevant_groups <- metadata_df %>% | ||
dplyr::filter(RNA_library == "stranded", | ||
stringr::str_starts(seq_center, "BGI")) %>% | ||
tidyr::drop_na(broad_histology) %>% | ||
dplyr::pull(broad_histology) %>% | ||
unique() | ||
# Data frame of log2(FPKM) of mitochondrial genes for relevant diagnoses | ||
mito_seq_center_df <- expression_df %>% | ||
dplyr::filter(stringr::str_starts(gene_symbol, "MT-")) %>% | ||
tidyr::gather("Kids_First_Biospecimen_ID", | ||
"fpkm", | ||
tidyselect::starts_with("BS_")) %>% | ||
dplyr::mutate(log2_fpkm = log2(fpkm+1)) %>% | ||
# get diagnoses across samples | ||
dplyr::inner_join( | ||
dplyr::select(metadata_df, | ||
Kids_First_Biospecimen_ID, | ||
broad_histology, | ||
seq_center) | ||
) %>% | ||
# filter to only relevant groups | ||
dplyr::filter(broad_histology %in% relevant_groups) %>% | ||
# get display version of diagnosis | ||
dplyr::inner_join( | ||
dplyr::select(palette_mapping_df, | ||
broad_histology, | ||
broad_histology_display) | ||
) | ||
ggplot(mito_seq_center_df) + | ||
aes(x = gene_symbol, y = log2_fpkm, color = seq_center, size = seq_center) + | ||
geom_jitter(width = 0.15) + | ||
# emphasize the BGI points! | ||
scale_size_manual(values = c(2, 2, 0.5)) + | ||
facet_wrap(~broad_histology_display) + | ||
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 4)) | ||
``` | ||
|
||
We see a couple trends in the plot above that are consistent with the posted issue: | ||
|
||
- `BGI@CHOP Genome Center` samples have dramatically higher RNR1 and RNR2, but substantially lower expression for the other mitochondrial genes, across all diagnoses. | ||
- `BGI` samples tend to have lower RNR1 and RNR2 expression, and have lower expression for other genes specifically for embryonal tumors. | ||
|
||
Thus, `BIG@CHOP` (not `BGI`) samples are the main "concern" here. | ||
|
||
|
||
## UMAP | ||
|
||
How do UMAPs with and without mitochondrial genes compare? | ||
We might expect particular changes in embryonal, ependymal, and HGG since BGI is mostly those dianogses. | ||
|
||
```{r} | ||
# Function to plot UMAP and set up color palettes. | ||
plot_umap <- function(df, color_group, color_palette, title) { | ||
ggplot(df) + | ||
aes(x = UMAP1, y = UMAP2, | ||
shape = seq_center, | ||
fill = {{color_group}}, | ||
color = seq_center, | ||
alpha = seq_center) + | ||
geom_point(size = 2.5) + | ||
scale_shape_manual(values = c(22, 24, 21)) + | ||
scale_fill_manual(values = color_palette, | ||
# This "shape" override is needed for legend fill colors to actually work | ||
guide = guide_legend(override.aes = list(shape = 21, size = 2))) + | ||
scale_color_manual(values = c("grey70", "black", "black"), | ||
# These overrides ensure shapes appear all in same alpha/color in legend | ||
guide = guide_legend(override.aes = list(color = "black", alpha = 1, size = 2))) + | ||
scale_alpha_manual(values = c(0.3, 0.5, 0.5)) + | ||
ggtitle(title) + | ||
# tweaks for legibility | ||
theme(legend.position = "bottom", | ||
legend.text = element_text(size = rel(0.5)), | ||
legend.title = element_text(size = rel(0.5)), | ||
legend.key.size = unit(0.25, "cm") | ||
) | ||
} | ||
# Another plotting helper function to cowplot some plots with a shared legend | ||
combine_plots <- function(p1, p2) { | ||
p_legend <- cowplot::get_legend(p1) | ||
plot_row <- cowplot::plot_grid(p1 + theme(legend.position = "none"), | ||
p2 + theme(legend.position = "none"), | ||
nrow = 1, | ||
rel_widths = 0.95) | ||
full_grid <- cowplot::plot_grid(plot_row, p_legend, nrow = 2, rel_heights = c(1, 0.2)) | ||
full_grid | ||
} | ||
# set up color palettes | ||
bh_df <- palette_mapping_df %>% | ||
dplyr::select(broad_histology_display, broad_histology_hex) %>% | ||
dplyr::distinct() | ||
pal_bh<- bh_df$broad_histology_hex | ||
names(pal_bh) <- bh_df$broad_histology_display | ||
cg_df <- palette_mapping_df %>% | ||
dplyr::select(cancer_group_display, cancer_group_hex) %>% | ||
dplyr::distinct() %>% | ||
tidyr::drop_na() | ||
pal_cg<- cg_df$cancer_group_hex | ||
names(pal_cg) <- cg_df$cancer_group_display | ||
``` | ||
|
||
|
||
Let's plot some UMAPs! | ||
|
||
```{r, fig.width = 14, fig.height = 8} | ||
p1 <- plot_umap(umap_all, broad_histology_display, pal_bh, "Includes all genes. Colored by broad histology.") | ||
p2 <- plot_umap(umap_no_mito, broad_histology_display, pal_bh, "Includes only non-mito genes. Colored by broad histology.") | ||
combine_plots(p1, p2) | ||
p1 <- plot_umap(umap_all, cancer_group_display, pal_cg, "Includes all genes. Colored by cancer group.") | ||
p2 <- plot_umap(umap_no_mito, cancer_group_display, pal_cg, "Includes only non-mito genes. Colored by cancer group.") | ||
combine_plots(p1, p2) | ||
``` | ||
|
||
## Conclusions | ||
|
||
- Samples from `BIG@CHOP` sequencing center have unique mitochondrial gene distributions. | ||
- Removing mitochondrial genes from the UMAP does not have a strong qualitiative effect on whether broad histologies or cancer groups tend to cluster together. | ||
The UMAPs created here also look broadly similar to those made with FPKM. | ||
Visually, it seems the "mixed histology" groupings identified in `04-explore-sequencing-center-effects.Rmd` | ||
notebook still are grouped together in these UMAPs. | ||
|
||
Overall, this suggests that mitochondrial genes alone will not have a strong influence on UMAP visualizations. | ||
However, there may still be protocol differences associated with BIG@CHOP samples that influence expression values more generally, and can not be corrected by removing only mitochondrial genes. | ||
|
||
## sessionInfo | ||
|
||
```{r print session info} | ||
sessionInfo() | ||
``` |
3,326 changes: 3,326 additions & 0 deletions
3,326
analyses/transcriptomic-dimension-reduction/05-seq-center-mitochondrial-genes.nb.html
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
978 changes: 978 additions & 0 deletions
978
...es/transcriptomic-dimension-reduction/results/tpm_stranded_all_log_pca_scores_aligned.tsv
Large diffs are not rendered by default.
Oops, something went wrong.
Oops, something went wrong.