Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Commit

Permalink
UMAPs without mitochondria (#1658)
Browse files Browse the repository at this point in the history
* Module readme

* Initiated tumor purity module. Added run script and included in CI

* module to readme

* bash scripts in fact do not use .R extensions

* Added filtering distributions

* Updated tumor purity to include extraction_type and output results for potential use in other modules

* Add results TSV into readme since this is likely to be used by other modules

* Added another subsection and result file for thresholding PER cancer group

* Add option to remove mitochondrial genes from dimension reduction and generate associated files for the new 'rsem_stranded_no-mito_log' filename_lead value

* Accidentally had used polya and found other bug. Fixed code and regenerated stranded rsem without mito umap

* notebook to plot UMAP without mito as well as mito fpkm jitter

* plot styling

* remove sneaky zipped html

* remove old result files from when I started this branch

* add notebook to module bash scripts

* small title tweak in nb

* No need to make the plots with this data, and no need to run t-sne

* Apply suggestions from code review

Co-authored-by: Joshua Shapiro <[email protected]>

* add conclusions and re-render

* updated result file with properly used flag

* small comment update

* Update analyses/transcriptomic-dimension-reduction/05-seq-center-mitochondrial-genes.Rmd

Co-authored-by: Joshua Shapiro <[email protected]>

* merge reintroduced a straggling backtick that is now re-purged

* script for TPM collapsing and filtering to nomito. We may not need to collapse though

* updated script to no longer collapse

* We now have TPM results, not collapsed, and an updated notebook to explore those UMAPs

* woops they were both nomito. fixed, and conclusions are the same

* Add tpm without mito to ci

* Update analyses/transcriptomic-dimension-reduction/scripts/prepare-tpm-for-umap.R

Co-authored-by: Joshua Shapiro <[email protected]>

* Removed remove_mito_genes flag and re-run module with correct data

* Add 05 notebook to README

* need to remove flag from CI script

* Removed description of collapse-rnaseq approach from comments since it is no longer used

* ACTUALLY use the right data - the scratch was outdated, now it is fixed

* reran to be safe

---------

Co-authored-by: Joshua Shapiro <[email protected]>
  • Loading branch information
sjspielman and jashapiro authored Feb 15, 2023
1 parent 78e4969 commit f519494
Show file tree
Hide file tree
Showing 14 changed files with 7,692 additions and 84 deletions.
40 changes: 40 additions & 0 deletions analyses/transcriptomic-dimension-reduction/01-dimension-reduction.sh
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,44 @@ Rscript --vanilla scripts/run-dimension-reduction.R \
--low_count_threshold ${COUNT_THRESHOLD} \
--log2_transform



#### TPM stranded, with and without mitochondrial genes and skipping t-sne ------------------------------------
# Note that these results are only used for exploration in `05-seq_center-mitochondrial-genes.Rmd`

# First, we'll need to create this input file, which contains filtered and re-normalied TPM, if it is missing:
NOMITO_TPM_FILE=../../scratch/transcriptomic-dimension-reduction/tpm-nomito-stranded.rds
if [ ! -f ${NOMITO_TPM_FILE} ]; then
Rscript --vanilla scripts/prepare-tpm-for-umap.R
fi

# Perform dimension reduction on all TPM
Rscript --vanilla scripts/run-dimension-reduction.R \
--expression ../../data/pbta-gene-expression-rsem-tpm.stranded.rds \
--metadata ${METADATA} \
--filename_lead tpm_stranded_all_log \
--output_directory ${OUTPUT} \
--seed ${SEED} \
--perplexity ${PERPLEXITY} \
--neighbors ${NEIGHBORS} \
--low_count_threshold ${COUNT_THRESHOLD} \
--skip_tsne \
--log2_transform

# Perform dimension reduction on nomito TPM
Rscript --vanilla scripts/run-dimension-reduction.R \
--expression ${NOMITO_TPM_FILE} \
--metadata ${METADATA} \
--filename_lead tpm_stranded_nomito_log \
--output_directory ${OUTPUT} \
--seed ${SEED} \
--perplexity ${PERPLEXITY} \
--neighbors ${NEIGHBORS} \
--low_count_threshold ${COUNT_THRESHOLD} \
--skip_tsne \
--log2_transform


#### kallisto ------------------------------------------------------------------

Rscript --vanilla scripts/run-dimension-reduction.R \
Expand Down Expand Up @@ -118,3 +156,5 @@ Rscript --vanilla scripts/run-dimension-reduction.R \
--low_count_threshold ${COUNT_THRESHOLD} \
--log2_transform



Original file line number Diff line number Diff line change
@@ -0,0 +1,240 @@
---
title: "Explore mitochondrial gene effect on UMAP plots"
author: "S. Spielman for CCDL"
date: "2023"
output:
html_notebook:
toc: TRUE
toc_float: TRUE
---

This notebook aims to explore some potential sequencing center biases that were raised in this OpenPBTA issue: https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1601
This issues notes that some of the samples sequenced at `BGI@CHOP` had higher-than-expected expression for certain mitochondrial genes, specifically RNR1 and RNR2, and somewhat lower-than-expected expression for other mitochondrial genes.

To ensure appropriate normalization, the UMAPs here were created from TPM data rather than FPKM, such that mitochondrial genes were removed from the TPM results themselves.

## Setup

```{r setup, include=FALSE}
library(magrittr)
library(ggplot2)
set.seed(2023)
# set overall theme
theme_set(ggpubr::theme_pubr() +
# Legend tweaks for legibility
theme(legend.position = "right",
legend.direction = "vertical",
legend.text = element_text(size = rel(0.5)),
legend.title = element_text(size = rel(0.5)),
legend.key.size = unit(0.25, "cm")
))
# figure settings
knitr::opts_chunk$set(fig.width = 8)
```


Define directories and file names:
```{r}
# Directories
root_dir <- rprojroot::find_root(rprojroot::has_dir(".git"))
data_dir <- file.path(root_dir, "data")
tumor_dir <- file.path(root_dir, "analyses", "tumor-purity-exploration")
umap_dir <- file.path(
root_dir,
"analyses",
"transcriptomic-dimension-reduction",
"results"
)
palette_dir <- file.path(root_dir, "figures", "palettes")
# UMAP files with different samples:
tpm_umap <- file.path(umap_dir, "tpm_stranded_all_log_umap_scores_aligned.tsv")
tpm_no_mito_umap <- file.path(umap_dir, "tpm_stranded_nomito_log_umap_scores_aligned.tsv")
# palette mapping file
pal_file <- file.path(palette_dir, "broad_histology_cancer_group_palette.tsv")
# FPKM
expression_file <- file.path(data_dir, "pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds")
metadata_file <- file.path(data_dir, "pbta-histologies.tsv")
```

```{r}
# Read in palette data
palette_mapping_df <- readr::read_tsv(pal_file) %>%
dplyr::select(broad_histology, broad_histology_display, broad_histology_hex,
cancer_group, cancer_group_display, cancer_group_hex)
# Read in and prep UMAP data
# helper function
readr_prep_umap <- function(filename,
pal_df = palette_mapping_df) {
readr::read_tsv(filename) %>%
dplyr::rename(UMAP1 = X1, UMAP2 = X2) %>%
dplyr::inner_join(pal_df) %>%
dplyr::mutate(seq_center = forcats::fct_relevel(seq_center,
"BGI", "BGI@CHOP Genome Center", "NantOmics"))
}
umap_all <- readr_prep_umap(tpm_umap)
umap_no_mito <- readr_prep_umap(tpm_no_mito_umap)
# Read in expression and convert to data frame
expression_df <- readr::read_rds(expression_file) %>%
tibble::as_tibble(rownames = "gene_symbol")
# Read in metadata
metadata_df <- readr::read_tsv(metadata_file)
```


## Explore mitochondrial genes

For mitochondrial genes, do we see expression patterns for BGI@CHOP samples consistent with those posted in the issue linked above?

```{r}
# First, which diagnoses are sequenced at BGI@CHOP?
relevant_groups <- metadata_df %>%
dplyr::filter(RNA_library == "stranded",
stringr::str_starts(seq_center, "BGI")) %>%
tidyr::drop_na(broad_histology) %>%
dplyr::pull(broad_histology) %>%
unique()
# Data frame of log2(FPKM) of mitochondrial genes for relevant diagnoses
mito_seq_center_df <- expression_df %>%
dplyr::filter(stringr::str_starts(gene_symbol, "MT-")) %>%
tidyr::gather("Kids_First_Biospecimen_ID",
"fpkm",
tidyselect::starts_with("BS_")) %>%
dplyr::mutate(log2_fpkm = log2(fpkm+1)) %>%
# get diagnoses across samples
dplyr::inner_join(
dplyr::select(metadata_df,
Kids_First_Biospecimen_ID,
broad_histology,
seq_center)
) %>%
# filter to only relevant groups
dplyr::filter(broad_histology %in% relevant_groups) %>%
# get display version of diagnosis
dplyr::inner_join(
dplyr::select(palette_mapping_df,
broad_histology,
broad_histology_display)
)
ggplot(mito_seq_center_df) +
aes(x = gene_symbol, y = log2_fpkm, color = seq_center, size = seq_center) +
geom_jitter(width = 0.15) +
# emphasize the BGI points!
scale_size_manual(values = c(2, 2, 0.5)) +
facet_wrap(~broad_histology_display) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 4))
```

We see a couple trends in the plot above that are consistent with the posted issue:

- `BGI@CHOP Genome Center` samples have dramatically higher RNR1 and RNR2, but substantially lower expression for the other mitochondrial genes, across all diagnoses.
- `BGI` samples tend to have lower RNR1 and RNR2 expression, and have lower expression for other genes specifically for embryonal tumors.

Thus, `BIG@CHOP` (not `BGI`) samples are the main "concern" here.


## UMAP

How do UMAPs with and without mitochondrial genes compare?
We might expect particular changes in embryonal, ependymal, and HGG since BGI is mostly those dianogses.

```{r}
# Function to plot UMAP and set up color palettes.
plot_umap <- function(df, color_group, color_palette, title) {
ggplot(df) +
aes(x = UMAP1, y = UMAP2,
shape = seq_center,
fill = {{color_group}},
color = seq_center,
alpha = seq_center) +
geom_point(size = 2.5) +
scale_shape_manual(values = c(22, 24, 21)) +
scale_fill_manual(values = color_palette,
# This "shape" override is needed for legend fill colors to actually work
guide = guide_legend(override.aes = list(shape = 21, size = 2))) +
scale_color_manual(values = c("grey70", "black", "black"),
# These overrides ensure shapes appear all in same alpha/color in legend
guide = guide_legend(override.aes = list(color = "black", alpha = 1, size = 2))) +
scale_alpha_manual(values = c(0.3, 0.5, 0.5)) +
ggtitle(title) +
# tweaks for legibility
theme(legend.position = "bottom",
legend.text = element_text(size = rel(0.5)),
legend.title = element_text(size = rel(0.5)),
legend.key.size = unit(0.25, "cm")
)
}
# Another plotting helper function to cowplot some plots with a shared legend
combine_plots <- function(p1, p2) {
p_legend <- cowplot::get_legend(p1)
plot_row <- cowplot::plot_grid(p1 + theme(legend.position = "none"),
p2 + theme(legend.position = "none"),
nrow = 1,
rel_widths = 0.95)
full_grid <- cowplot::plot_grid(plot_row, p_legend, nrow = 2, rel_heights = c(1, 0.2))
full_grid
}
# set up color palettes
bh_df <- palette_mapping_df %>%
dplyr::select(broad_histology_display, broad_histology_hex) %>%
dplyr::distinct()
pal_bh<- bh_df$broad_histology_hex
names(pal_bh) <- bh_df$broad_histology_display
cg_df <- palette_mapping_df %>%
dplyr::select(cancer_group_display, cancer_group_hex) %>%
dplyr::distinct() %>%
tidyr::drop_na()
pal_cg<- cg_df$cancer_group_hex
names(pal_cg) <- cg_df$cancer_group_display
```


Let's plot some UMAPs!

```{r, fig.width = 14, fig.height = 8}
p1 <- plot_umap(umap_all, broad_histology_display, pal_bh, "Includes all genes. Colored by broad histology.")
p2 <- plot_umap(umap_no_mito, broad_histology_display, pal_bh, "Includes only non-mito genes. Colored by broad histology.")
combine_plots(p1, p2)
p1 <- plot_umap(umap_all, cancer_group_display, pal_cg, "Includes all genes. Colored by cancer group.")
p2 <- plot_umap(umap_no_mito, cancer_group_display, pal_cg, "Includes only non-mito genes. Colored by cancer group.")
combine_plots(p1, p2)
```

## Conclusions

- Samples from `BIG@CHOP` sequencing center have unique mitochondrial gene distributions.
- Removing mitochondrial genes from the UMAP does not have a strong qualitiative effect on whether broad histologies or cancer groups tend to cluster together.
The UMAPs created here also look broadly similar to those made with FPKM.
Visually, it seems the "mixed histology" groupings identified in `04-explore-sequencing-center-effects.Rmd`
notebook still are grouped together in these UMAPs.

Overall, this suggests that mitochondrial genes alone will not have a strong influence on UMAP visualizations.
However, there may still be protocol differences associated with BIG@CHOP samples that influence expression values more generally, and can not be corrected by removing only mitochondrial genes.

## sessionInfo

```{r print session info}
sessionInfo()
```

Large diffs are not rendered by default.

11 changes: 8 additions & 3 deletions analyses/transcriptomic-dimension-reduction/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ done

```

Unfortunately, this approach will also run PCA 5 times.
Unfortunately, this approach will also run PCA 5 times.

##### R

Expand Down Expand Up @@ -138,6 +138,11 @@ plot_dimension_reduction(aligned_scores_df = pca_df,

The `point_color` argument of `plot_dimension_reduction` can be changed to any variable that is a column in the original TSV file.

### Explore potential batch effects from the sequencing center
### Explore potential batch effects from the sequencing center

The notebook `04-explore-sequencing-center-effects.Rmd` performs exploratory analyses and visualization to roughly assess the extent to which sequencing center is expected to induce batch effects.
The notebook `04-explore-sequencing-center-effects.Rmd` performs exploratory analyses and visualization to roughly assess the extent to which sequencing center is expected to induce batch effects.

The notebook `05-seq-center-mitochondrial-genes.Rmd` performs exploratory analyses to assess whether artifacts present in mitochondrial gene RNA-Seq data influence UMAP visualization.
To enable this analysis, this pipeline will also perform transcriptomic dimension reduction on TPM data for both all genes and all non-mitochondrial genes.
The TPM-derived UMAPs are assessed in this notebook.
TPM was used rather than FPKM to enable more accurate re-normalization after filtering out mitochondrial genes from the full TPM data.
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,18 @@ Rscript --vanilla scripts/run-dimension-reduction.R \
--skip_tsne \
--neighbors 2

# generate plot lists for both cases above
# Run no mito
NOMITO_TPM_FILE=../../scratch/transcriptomic-dimension-reduction/tpm-nomito-stranded.rds
Rscript --vanilla scripts/prepare-tpm-for-umap.R
Rscript --vanilla scripts/run-dimension-reduction.R \
--expression ${NOMITO_TPM_FILE} \
--metadata ../../data/pbta-histologies.tsv \
--filename_lead tpm_stranded_nomito_log \
--output_directory results \
--neighbors 2 \
--skip_tsne

# generate plot lists for both stranded RSEM and poly-A kallisto
Rscript --vanilla scripts/get-plot-list.R \
--input_directory results \
--filename_lead rsem_stranded \
Expand All @@ -48,3 +59,6 @@ bash 03-multipanel-plots.sh

# Exploration of batch effects
Rscript --vanilla -e 'rmarkdown::render("04-explore-sequencing-center-effects.Rmd", params = list(is_ci = 1))'

# Exploration of UMAPs if mitochondrial genes are removed
Rscript --vanilla -e 'rmarkdown::render("05-seq-center-mitochondrial-genes.Rmd")'
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,6 @@ bash 03-multipanel-plots.sh

# Exploration of batch effects
Rscript --vanilla -e 'rmarkdown::render("04-explore-sequencing-center-effects.Rmd")'

# Exploration of UMAPs if mitochondrial genes are removed
Rscript --vanilla -e 'rmarkdown::render("05-seq-center-mitochondrial-genes.Rmd")'

Large diffs are not rendered by default.

Loading

0 comments on commit f519494

Please sign in to comment.