Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

UMAPs without mitochondria #1658

Merged
merged 42 commits into from
Feb 15, 2023
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
6bb2e6d
Module readme
sjspielman Nov 3, 2022
044b928
Initiated tumor purity module. Added run script and included in CI
sjspielman Nov 3, 2022
0170550
module to readme
sjspielman Nov 3, 2022
04784e0
bash scripts in fact do not use .R extensions
sjspielman Nov 3, 2022
f52472d
Added filtering distributions
sjspielman Nov 30, 2022
a1495ec
Merge branch 'master' into initiate-tumor-purity-module
sjspielman Dec 12, 2022
4cb29e8
Merge branch 'master' into initiate-tumor-purity-module
sjspielman Jan 9, 2023
7b2109c
Updated tumor purity to include extraction_type and output results fo…
sjspielman Jan 10, 2023
174b90c
Add results TSV into readme since this is likely to be used by other …
sjspielman Jan 10, 2023
260a123
Added another subsection and result file for thresholding PER cancer …
sjspielman Jan 10, 2023
f91caff
Add option to remove mitochondrial genes from dimension reduction and…
sjspielman Jan 10, 2023
a9fca41
Accidentally had used polya and found other bug. Fixed code and regen…
sjspielman Jan 10, 2023
a27c285
notebook to plot UMAP without mito as well as mito fpkm jitter
sjspielman Jan 11, 2023
b12383e
plot styling
sjspielman Jan 17, 2023
977a86c
merge in master and fix conflicts
sjspielman Jan 26, 2023
b818e98
remove sneaky zipped html
sjspielman Jan 26, 2023
f6fea2f
remove old result files from when I started this branch
sjspielman Jan 26, 2023
b724636
add notebook to module bash scripts
sjspielman Jan 26, 2023
845bd18
small title tweak in nb
sjspielman Jan 26, 2023
71523e3
No need to make the plots with this data, and no need to run t-sne
sjspielman Jan 26, 2023
1da693c
Merge branch 'master' into umap-sans-mito
sjspielman Jan 30, 2023
677b9f4
Apply suggestions from code review
sjspielman Jan 30, 2023
54ef86d
add conclusions and re-render
sjspielman Jan 30, 2023
cc82609
updated result file with properly used flag
sjspielman Jan 30, 2023
efb1936
small comment update
sjspielman Jan 30, 2023
f3deb28
Update analyses/transcriptomic-dimension-reduction/05-seq-center-mito…
sjspielman Jan 30, 2023
a89ec72
Merge branch 'master' into umap-sans-mito
sjspielman Feb 7, 2023
7fb7286
merge reintroduced a straggling backtick that is now re-purged
sjspielman Feb 7, 2023
62f57ca
script for TPM collapsing and filtering to nomito. We may not need to…
sjspielman Feb 7, 2023
60b71b7
updated script to no longer collapse
sjspielman Feb 7, 2023
5d0aacc
We now have TPM results, not collapsed, and an updated notebook to ex…
sjspielman Feb 7, 2023
738e303
woops they were both nomito. fixed, and conclusions are the same
sjspielman Feb 7, 2023
88a994f
Add tpm without mito to ci
sjspielman Feb 7, 2023
752850c
Merge branch 'master' into umap-sans-mito
sjspielman Feb 9, 2023
2388b73
Update analyses/transcriptomic-dimension-reduction/scripts/prepare-tp…
sjspielman Feb 9, 2023
9a8be2a
Removed remove_mito_genes flag and re-run module with correct data
sjspielman Feb 9, 2023
377375c
Add 05 notebook to README
sjspielman Feb 9, 2023
7d8023e
need to remove flag from CI script
sjspielman Feb 9, 2023
6879f0d
Removed description of collapse-rnaseq approach from comments since i…
sjspielman Feb 9, 2023
05d53f4
ACTUALLY use the right data - the scratch was outdated, now it is fixed
sjspielman Feb 9, 2023
1e73b21
reran to be safe
sjspielman Feb 9, 2023
bc61fcb
Merge branch 'master' into umap-sans-mito
sjspielman Feb 9, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions analyses/transcriptomic-dimension-reduction/01-dimension-reduction.sh
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,45 @@ Rscript --vanilla scripts/run-dimension-reduction.R \
--low_count_threshold ${COUNT_THRESHOLD} \
--log2_transform



#### TPM stranded, with and without mitochondrial genes and skipping t-sne ------------------------------------
# Note that these results are only used for exploration in `05-seq_center-mitochondrial-genes.Rmd`

# First, we'll need to create this input file, which contains filtered and re-normalied TPM, if it is missing:
NOMITO_TPM_FILE=../../scratch/transcriptomic-dimension-reduction/tpm-nomito-stranded.rds
if [ ! -f ${NOMITO_TPM_FILE} ]; then
Rscript --vanilla scripts/prepare-tpm-for-umap.R
fi

# Perform dimension reduction on all TPM
Rscript --vanilla scripts/run-dimension-reduction.R \
--expression ../../data/pbta-gene-expression-rsem-tpm.stranded.rds \
--metadata ${METADATA} \
--filename_lead tpm_stranded_all_log \
--output_directory ${OUTPUT} \
--seed ${SEED} \
--perplexity ${PERPLEXITY} \
--neighbors ${NEIGHBORS} \
--low_count_threshold ${COUNT_THRESHOLD} \
--skip_tsne \
--log2_transform

# Perform dimension reduction on nomito TPM
Rscript --vanilla scripts/run-dimension-reduction.R \
--expression ${NOMITO_TPM_FILE} \
--metadata ${METADATA} \
--remove_mito_genes \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this isn't needed here? The mito genes should have already been removed, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, and the flag in general is not needed at all anymore if we're not doing this with FPKM. I'll sort that all out.

--filename_lead tpm_stranded_nomito_log \
--output_directory ${OUTPUT} \
--seed ${SEED} \
--perplexity ${PERPLEXITY} \
--neighbors ${NEIGHBORS} \
--low_count_threshold ${COUNT_THRESHOLD} \
--skip_tsne \
--log2_transform


#### kallisto ------------------------------------------------------------------

Rscript --vanilla scripts/run-dimension-reduction.R \
Expand Down Expand Up @@ -118,3 +157,5 @@ Rscript --vanilla scripts/run-dimension-reduction.R \
--low_count_threshold ${COUNT_THRESHOLD} \
--log2_transform



Original file line number Diff line number Diff line change
@@ -0,0 +1,240 @@
---
title: "Explore mitochondrial gene effect on UMAP plots"
author: "S. Spielman for CCDL"
date: "2023"
output:
html_notebook:
toc: TRUE
toc_float: TRUE
---

This notebook aims to explore some potential sequencing center biases that were raised in this OpenPBTA issue: https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1601
This issues notes that some of the samples sequenced at `BGI@CHOP` had higher-than-expected expression for certain mitochondrial genes, specifically RNR1 and RNR2, and somewhat lower-than-expected expression for other mitochondrial genes.

To ensure appropriate normalization, the UMAPs here were created from TPM data rather than FPKM, such that mitochondrial genes were removed from the TPM results themselves.

## Setup

```{r setup, include=FALSE}
library(magrittr)
library(ggplot2)

set.seed(2023)

# set overall theme
theme_set(ggpubr::theme_pubr() +
# Legend tweaks for legibility
theme(legend.position = "right",
legend.direction = "vertical",
legend.text = element_text(size = rel(0.5)),
legend.title = element_text(size = rel(0.5)),
legend.key.size = unit(0.25, "cm")
))

# figure settings
knitr::opts_chunk$set(fig.width = 8)
```


Define directories and file names:
```{r}
# Directories
root_dir <- rprojroot::find_root(rprojroot::has_dir(".git"))
data_dir <- file.path(root_dir, "data")
tumor_dir <- file.path(root_dir, "analyses", "tumor-purity-exploration")
umap_dir <- file.path(
root_dir,
"analyses",
"transcriptomic-dimension-reduction",
"results"
)
palette_dir <- file.path(root_dir, "figures", "palettes")

# UMAP files with different samples:
tpm_umap <- file.path(umap_dir, "tpm_stranded_all_log_umap_scores_aligned.tsv")
tpm_no_mito_umap <- file.path(umap_dir, "tpm_stranded_nomito_log_umap_scores_aligned.tsv")

# palette mapping file
pal_file <- file.path(palette_dir, "broad_histology_cancer_group_palette.tsv")

# FPKM
expression_file <- file.path(data_dir, "pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds")

metadata_file <- file.path(data_dir, "pbta-histologies.tsv")
```

```{r}
# Read in palette data
palette_mapping_df <- readr::read_tsv(pal_file) %>%
dplyr::select(broad_histology, broad_histology_display, broad_histology_hex,
cancer_group, cancer_group_display, cancer_group_hex)

# Read in and prep UMAP data
# helper function
readr_prep_umap <- function(filename,
pal_df = palette_mapping_df) {
readr::read_tsv(filename) %>%
dplyr::rename(UMAP1 = X1, UMAP2 = X2) %>%
dplyr::inner_join(pal_df) %>%
dplyr::mutate(seq_center = forcats::fct_relevel(seq_center,
"BGI", "BGI@CHOP Genome Center", "NantOmics"))
}

umap_all <- readr_prep_umap(tpm_umap)
umap_no_mito <- readr_prep_umap(tpm_no_mito_umap)

# Read in expression and convert to data frame
expression_df <- readr::read_rds(expression_file) %>%
tibble::as_tibble(rownames = "gene_symbol")

# Read in metadata
metadata_df <- readr::read_tsv(metadata_file)
```


## Explore mitochondrial genes

For mitochondrial genes, do we see expression patterns for BGI@CHOP samples consistent with those posted in the issue linked above?

```{r}
# First, which diagnoses are sequenced at BGI@CHOP?
relevant_groups <- metadata_df %>%
dplyr::filter(RNA_library == "stranded",
stringr::str_starts(seq_center, "BGI")) %>%
tidyr::drop_na(broad_histology) %>%
dplyr::pull(broad_histology) %>%
unique()

# Data frame of log2(FPKM) of mitochondrial genes for relevant diagnoses
mito_seq_center_df <- expression_df %>%
dplyr::filter(stringr::str_starts(gene_symbol, "MT-")) %>%
tidyr::gather("Kids_First_Biospecimen_ID",
"fpkm",
tidyselect::starts_with("BS_")) %>%
dplyr::mutate(log2_fpkm = log2(fpkm+1)) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to note my minor discomfort with +1 as a zero-correction for FPKM, but it does not really matter in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better or worse, this is all up in OpenPBTA :/

# get diagnoses across samples
dplyr::inner_join(
dplyr::select(metadata_df,
Kids_First_Biospecimen_ID,
broad_histology,
seq_center)
) %>%
# filter to only relevant groups
dplyr::filter(broad_histology %in% relevant_groups) %>%
# get display version of diagnosis
dplyr::inner_join(
dplyr::select(palette_mapping_df,
broad_histology,
broad_histology_display)
)


ggplot(mito_seq_center_df) +
aes(x = gene_symbol, y = log2_fpkm, color = seq_center, size = seq_center) +
geom_jitter(width = 0.15) +
# emphasize the BGI points!
scale_size_manual(values = c(2, 2, 0.5)) +
facet_wrap(~broad_histology_display) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 4))

```

We see a couple trends in the plot above that are consistent with the posted issue:

- `BGI@CHOP Genome Center` samples have dramatically higher RNR1 and RNR2, but substantially lower expression for the other mitochondrial genes, across all diagnoses.
- `BGI` samples tend to have lower RNR1 and RNR2 expression, and have lower expression for other genes specifically for embryonal tumors.

Thus, `BIG@CHOP` (not `BGI`) samples are the main "concern" here.


## UMAP

How do UMAPs with and without mitochondrial genes compare?
We might expect particular changes in embryonal, ependymal, and HGG since BGI is mostly those dianogses.

```{r}
# Function to plot UMAP and set up color palettes.
plot_umap <- function(df, color_group, color_palette, title) {
ggplot(df) +
aes(x = UMAP1, y = UMAP2,
shape = seq_center,
fill = {{color_group}},
color = seq_center,
alpha = seq_center) +
geom_point(size = 2.5) +
scale_shape_manual(values = c(22, 24, 21)) +
scale_fill_manual(values = color_palette,
# This "shape" override is needed for legend fill colors to actually work
guide = guide_legend(override.aes = list(shape = 21, size = 2))) +
scale_color_manual(values = c("grey70", "black", "black"),
# These overrides ensure shapes appear all in same alpha/color in legend
guide = guide_legend(override.aes = list(color = "black", alpha = 1, size = 2))) +
scale_alpha_manual(values = c(0.3, 0.5, 0.5)) +
ggtitle(title) +
# tweaks for legibility
theme(legend.position = "bottom",
legend.text = element_text(size = rel(0.5)),
legend.title = element_text(size = rel(0.5)),
legend.key.size = unit(0.25, "cm")
)
}

# Another plotting helper function to cowplot some plots with a shared legend
combine_plots <- function(p1, p2) {
p_legend <- cowplot::get_legend(p1)

plot_row <- cowplot::plot_grid(p1 + theme(legend.position = "none"),
p2 + theme(legend.position = "none"),
nrow = 1,
rel_widths = 0.95)

full_grid <- cowplot::plot_grid(plot_row, p_legend, nrow = 2, rel_heights = c(1, 0.2))

full_grid
}


# set up color palettes
bh_df <- palette_mapping_df %>%
dplyr::select(broad_histology_display, broad_histology_hex) %>%
dplyr::distinct()
pal_bh<- bh_df$broad_histology_hex
names(pal_bh) <- bh_df$broad_histology_display

cg_df <- palette_mapping_df %>%
dplyr::select(cancer_group_display, cancer_group_hex) %>%
dplyr::distinct() %>%
tidyr::drop_na()
pal_cg<- cg_df$cancer_group_hex
names(pal_cg) <- cg_df$cancer_group_display
```


Let's plot some UMAPs!

```{r, fig.width = 14, fig.height = 8}
p1 <- plot_umap(umap_all, broad_histology_display, pal_bh, "Includes all genes. Colored by broad histology.")
p2 <- plot_umap(umap_no_mito, broad_histology_display, pal_bh, "Includes only non-mito genes. Colored by broad histology.")
combine_plots(p1, p2)

p1 <- plot_umap(umap_all, cancer_group_display, pal_cg, "Includes all genes. Colored by cancer group.")
p2 <- plot_umap(umap_no_mito, cancer_group_display, pal_cg, "Includes only non-mito genes. Colored by cancer group.")
combine_plots(p1, p2)
```

## Conclusions

- Samples from `BIG@CHOP` sequencing center have unique mitochondrial gene distributions.
- Removing mitochondrial genes from the UMAP does not have a strong qualitiative effect on whether broad histologies or cancer groups tend to cluster together.
The UMAPs created here also look broadly similar to those made with FPKM.
Visually, it seems the "mixed histology" groupings identified in `04-explore-sequencing-center-effects.Rmd`
notebook still are grouped together in these UMAPs.

Overall, this suggests that mitochondrial genes alone will not have a strong influence on UMAP visualizations.
However, there may still be protocol differences associated with BIG@CHOP samples that influence expression values more generally, and can not be corrected by removing only mitochondrial genes.

## sessionInfo

```{r print session info}
sessionInfo()
```

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,20 @@ Rscript --vanilla scripts/run-dimension-reduction.R \
--output_directory results \
--skip_tsne \
--neighbors 2

# Run no mito
NOMITO_TPM_FILE=../../scratch/transcriptomic-dimension-reduction/tpm-nomito-stranded.rds
Rscript --vanilla scripts/prepare-tpm-for-umap.R
Rscript --vanilla scripts/run-dimension-reduction.R \
--expression ${NOMITO_TPM_FILE} \
--metadata ../../data/pbta-histologies.tsv \
--remove_mito_genes \
--filename_lead tpm_stranded_nomito_log \
--output_directory results \
--neighbors 2 \
--skip_tsne

# generate plot lists for both cases above
# generate plot lists for both stranded RSEM and poly-A kallisto
Rscript --vanilla scripts/get-plot-list.R \
--input_directory results \
--filename_lead rsem_stranded \
Expand All @@ -48,3 +60,6 @@ bash 03-multipanel-plots.sh

# Exploration of batch effects
Rscript --vanilla -e 'rmarkdown::render("04-explore-sequencing-center-effects.Rmd", params = list(is_ci = 1))'

# Exploration of UMAPs if mitochondrial genes are removed
Rscript --vanilla -e 'rmarkdown::render("05-seq-center-mitochondrial-genes.Rmd")'
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,6 @@ bash 03-multipanel-plots.sh

# Exploration of batch effects
Rscript --vanilla -e 'rmarkdown::render("04-explore-sequencing-center-effects.Rmd")'

# Exploration of UMAPs if mitochondrial genes are removed
Rscript --vanilla -e 'rmarkdown::render("05-seq-center-mitochondrial-genes.Rmd")'

Large diffs are not rendered by default.

Loading