3_pca-plots_v2.Rmd

---
title: "Section 3: PCA plots"
date: "`r Sys.Date()`"
author:
  - name: Arun Seetharam
  - name: Ha Vu
    affiliation: Tuteja Lab
    affiliation_url: https://www.tutejalab.org
output:
  rmdformats::readthedown:
    self_contained: true
    thumbnails: false
    lightbox: true
    gallery: true
    highlight: tango
---

```{r setup, include=FALSE}
options(max.print = "75")
knitr::opts_chunk$set(
  echo = TRUE,
  collapse = TRUE,
  comment = "#>",
  fig.path = "assets/",
  fig.width = 8,
  prompt = FALSE,
  tidy = FALSE,
  message = FALSE,
  warning = TRUE
)
knitr::opts_knit$set(width = 75)
```

This section uses the count data (all datasets) generated in Section 1 for cluster analyses. Briefly, the count data are imported in R, batch corrected using `ComBat_seq`, `vst` transformation and clustering is performed using `DESeq2`, and results are visualized as PCA plots.


# Prerequisites

R packages required for this section are loaded.

```{r, warnings=TRUE, message=FALSE}
setwd("~/github/BAPvsTrophoblast_Amnion")
library(sva)
library(tidyverse)
library(DESeq2)
library(vsn)
library(pheatmap)
library(ggrepel)
library(RColorBrewer)
library(plotly)
library(PCAtools)
library(scales)
library(htmlwidgets)
library(factoextra)
library(spatstat.core)
```


# PCA plots for all datasets

## Import datasets

The `counts` data and its associated metadata (`coldata`) are imported for analyses.

```{r dataset, warnings=TRUE, message=FALSE}
counts = 'assets/counts-pca-v2.txt'
groupFile = 'assets/batch-pca-v2.txt'
coldata <-
  read.csv(
    groupFile,
    row.names = 1,
    sep = "\t",
    stringsAsFactors = TRUE
  )
cts <- as.matrix(read.csv(counts, sep = "\t", row.names = "gene.ids"))
```

Inspect the `coldata`.

```{r coldata}
DT::datatable(coldata)
```

Reorder columns of `cts` according to `coldata` rows. Check if samples in both files match.

```{r order, warnings=TRUE, message=FALSE}
colnames(cts)
all(rownames(coldata) %in% colnames(cts))
cts <- cts[, rownames(coldata)]
```

## Batch correction

Using `Combat_seq` (SVA package) run batch correction - using bioproject IDs as variable (dataset origin).

```{r batchcorrect, warnings=TRUE, message=TRUE}
cov1 <- as.factor(coldata$authors)
adjusted_counts <- ComBat_seq(cts, batch = cov1, group = NULL)
all(rownames(coldata) %in% colnames(cts))
cts <- cts[, rownames(coldata)]
```

## Run DESeq2

The batch corrected read counts are then used for running DESeq2 analyses.

```{r deseq2, warnings=TRUE, message=FALSE}
dds <- DESeqDataSetFromMatrix(countData = adjusted_counts,
                              colData = coldata,
                              design = ~ group)
```

## Transformation:
```{r vst, warnings=TRUE, message=FALSE}
vst <- assay(vst(dds))
vsd <- vst(dds, blind = FALSE)
pcaData <-
  plotPCA(vsd,
          intgroup = c("group", "authors"),
          returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
```


3D PCA plot function:
```{r pca3d, warnings=TRUE, message=FALSE}
# @ Thomas W. Battaglia

#' Plot DESeq2's PCA plotting with Plotly 3D scatterplot
#'
#' The function will generate a plot_ly 3D scatter plot image for
#' a 3D exploration of the PCA.
#'
#' @param object a DESeqTransform object, with data in assay(x), produced for example by either rlog or varianceStabilizingTransformation.
#' @param intgroup interesting groups: a character vector of names in colData(x) to use for grouping
#' @param ntop number of top genes to use for principal components, selected by highest row variance
#' @param returnData should the function only return the data.frame of PC1, PC2 and PC3 with intgroup covariates for custom plotting (default is FALSE)
#' @return An object created by plot_ly, which can be assigned and further customized.
#' @export
plotPCA3D <-
  function (object,
            intgroup = "condition",
            ntop = 500,
            returnData = FALSE) {
    rv <- rowVars(assay(object))
    select <-
      order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
    pca <- prcomp(t(assay(object)[select,]))
    percentVar <- pca$sdev ^ 2 / sum(pca$sdev ^ 2)
    if (!all(intgroup %in% names(colData(object)))) {
      stop("the argument 'intgroup' should specify columns of colData(dds)")
    }
    intgroup.df <-
      as.data.frame(colData(object)[, intgroup, drop = FALSE])
    group <- if (length(intgroup) > 1) {
      factor(apply(intgroup.df, 1, paste, collapse = " : "))
    }
    else {
      colData(object)[[intgroup]]
    }
    d <- data.frame(
      PC1 = pca$x[, 1],
      PC2 = pca$x[, 2],
      PC3 = pca$x[, 3],
      group = group,
      intgroup.df,
      name = colnames(object)
    )
    if (returnData) {
      attr(d, "percentVar") <- percentVar[1:3]
      return(d)
    }
    message("Generating plotly plot")
    p <- plotly::plot_ly(
      data = d,
      x = ~ PC1,
      y = ~ PC2,
      z = ~ PC3,
      color = group,
      mode = "markers",
      type = "scatter3d"
    )
    return(p)
  }
```


## Process data for PCA (all samples)

PCA plot for the dataset that includes all libraries.

```{r pcaFull_comp, warnings=TRUE, message=FALSE}
rv <- rowVars(assay(vsd))
select <-
  order(rv, decreasing = TRUE)[seq_len(min(500, length(rv)))]
pca <- prcomp(t(assay(vsd)[select, ]))
percentVar <- pca$sdev ^ 2 / sum(pca$sdev ^ 2)
intgroup = c("group", "authors")
intgroup.df <- as.data.frame(colData(vsd)[, intgroup, drop = FALSE])
group <- if (length(intgroup) > 1) {
  factor(apply(intgroup.df, 1, paste, collapse = " : "))
}
d <- data.frame(
  PC1 = pca$x[, 1],
  PC2 = pca$x[, 2],
  PC3 = pca$x[, 3],
  PC4 = pca$x[, 4],
  PC5 = pca$x[, 5],
  group = group,
  intgroup.df,
  name = colnames(vsd)
)
```

## Components 1 and 2

```{r pcaFull_C1-C2, fig.cap="Fig 3.1: PCA plots (all samples) - PC1 & PC2", fig.width=8, fig.height=8}
g <- ggplot(d, aes(PC1, PC2, color = group.1, shape = authors)) +
  scale_shape_manual(values = 1:8) +
  theme_bw() +
  theme(legend.title = element_blank()) +
  geom_point(size = 2, stroke = 2) +
  xlab(paste("PC1", round(percentVar[1] * 100, 2), "% variance")) +
  ylab(paste("PC2", round(percentVar[2] * 100, 2), "% variance"))
ggplotly(g)
```

## Components 1 and 3
```{r pcaFull_C1-C3, fig.cap="Fig 3.2: PCA plots (all samples) - PC1 and PC3", fig.width=8, fig.height=8}
g <- ggplot(d, aes(PC1, PC3, color = group.1, shape = authors)) +
  scale_shape_manual(values = 1:8) +
  theme_bw() +
  theme(legend.title = element_blank()) +
  geom_point(size = 2, stroke = 2) +
  xlab(paste("PC1", round(percentVar[1] * 100, 2), "% variance")) +
  ylab(paste("PC3", round(percentVar[3] * 100, 2), "% variance"))
ggplotly(g)
```

## Components 2 and 3
```{r pcaFull_C2-C3, fig.cap="Fig 3.3: PCA plots (all samples) - PC2 and PC3", fig.width=8, fig.height=8}
g <- ggplot(d, aes(PC2, PC3, color = group.1, shape = authors)) +
  scale_shape_manual(values = 1:8) +
  theme_bw() +
  theme(legend.title = element_blank()) +
  geom_point(size = 2, stroke = 2) +
  xlab(paste("PC2", round(percentVar[2] * 100, 2), "% variance")) +
  ylab(paste("PC3", round(percentVar[3] * 100, 2), "% variance"))
ggplotly(g)
```


## Scree plot
```{r pcaFull_scree, fig.cap="Fig 3.4: Scree plot showing variance contributed by each component", fig.width=8, fig.height=5}
scree_plot = data.frame(percentVar)
scree_plot[, 2] <- c(1:97)
colnames(scree_plot) <- c("variance", "component_number")
g <-
  ggplot(scree_plot, aes(component_number, variance * 100)) +
  geom_bar(stat = 'identity', fill = "slateblue") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(
      vjust = 1,
      hjust = 1,
      size = 12
    ),
    axis.text.y = element_text(size = 12),
    plot.margin = margin(10, 10, 10, 100),
    legend.position = "none",
    plot.title = element_text(color = "black", size = 18, face = "bold.italic"),
    axis.title.y = element_text(color = "black", size = 14, face = "bold"),
    axis.line.x = element_line(
      colour = 'black',
      size = 0.5,
      linetype = 'solid'
    ),
    axis.line.y = element_line(
      colour = 'black',
      size = 0.5,
      linetype = 'solid'
    ),
    axis.ticks.x = element_line(
      colour = 'black',
      size = 1,
      linetype = 'solid'
    ),
    axis.title.x = element_text(color = "black", size = 14, face = "bold")
  ) +
  xlab("Components") +
  ylab("Percent Variance") + ggtitle("All samples")
scale_y_continuous(expand = expansion(mult = c(0, .1)), breaks = pretty_breaks())
ggplotly(g)
```

## Euclidean Distance

```{r pcaFull_euc_partial, warnings=TRUE, message=FALSE, fig.cap="Fig 3.5: Heatmap of Euclidean Distance using first 3 components (all samples)", fig.width=12, fig.height=10}
pca <-
  prcomp(t(assay(vsd)[select,]),
         center = TRUE,
         scale. = FALSE,
         rank. = 3)
results <- pca$x
distance <- get_dist(results, method = "euclidean")
write.table(
  as.matrix(distance),
  "assets/all_samples-PC1-PC2-PC3.tsv",
  row.names = T,
  qmethod = "double",
  sep = "\t"
)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
```

```{r pcaFull_euc_full, warnings=TRUE, message=FALSE, fig.cap="Fig 3.6: Heatmap of Euclidean Distance using all components (all samples)", fig.width=12, fig.height=10}
pca <- prcomp(t(assay(vsd)[select,]), center = TRUE, scale. = FALSE)
results <- pca$x
distance <- get_dist(results, method = "euclidean")
write.table(
  as.matrix(distance),
  "assets/all_samples-allPC.tsv",
  row.names = T,
  qmethod = "double",
  sep = "\t"
)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
```


# PCA plot for differentiated cell libraries (no amnion datasets)

## Import datasets

The `counts` data and its associated metadata (`coldata`) are imported for analyses.

```{r import2, warnings=TRUE, message=FALSE}
setwd("~/github/BAPvsTrophoblast_Amnion")
counts2 = 'assets/counts-pca-v2.2.txt'
groupFile = 'assets/batch-pca-v2.2.txt'
coldata2 <-
  read.csv(
    groupFile,
    row.names = 1,
    sep = "\t",
    stringsAsFactors = TRUE
  )
cts2 <-
  as.matrix(read.csv(counts2, sep = "\t", row.names = "gene.ids"))
```

Inspect the `coldata`.

```{r coldata2}
DT::datatable(coldata2)
```

Reorder columns of `cts` according to `coldata` rows. Check if samples in both files match.
```{r order2}
all(rownames(coldata2) %in% colnames(cts2))
cts2 <- cts2[, rownames(coldata2)]
```

## Batch correction

Using `ComBat_seq` (SVA package) run batch correction - using bioproject IDs as variable (dataset origin).

```{r batchcorrect2, warnings=TRUE, message=TRUE}
cov1 <- as.factor(coldata2$authors)
adjusted_counts <- ComBat_seq(cts2, batch = cov1, group = NULL)
all(rownames(coldata2) %in% colnames(cts2))
cts2 <- cts2[, rownames(coldata2)]
```

## Run DESeq2
The batch corrected read counts are then used for running DESeq2 analyses.

```{r deseq2b, warnings=TRUE, message=FALSE}
dds <- DESeqDataSetFromMatrix(countData = adjusted_counts,
                              colData = coldata2,
                              design = ~ group)
```

## Transformation

```{r vst2, warnings=TRUE, message=FALSE}
vsd <- vst(dds, blind = FALSE)
pcaData <-
  plotPCA(vsd,
          intgroup = c("group", "authors"),
          returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
```

## Process data for PCA (differenticated)

```{r pcaDif_comp, warnings=TRUE, message=FALSE}
rv <- rowVars(assay(vsd))
select <-
  order(rv, decreasing = TRUE)[seq_len(min(500, length(rv)))]
pca <- prcomp(t(assay(vsd)[select,]))
percentVar <- pca$sdev ^ 2 / sum(pca$sdev ^ 2)
intgroup = c("group", "authors")
intgroup.df <- as.data.frame(colData(vsd)[, intgroup, drop = FALSE])
group <- if (length(intgroup) > 1) {
  factor(apply(intgroup.df, 1, paste, collapse = " : "))
}
d <- data.frame(
  PC1 = pca$x[, 1],
  PC2 = pca$x[, 2],
  PC3 = pca$x[, 3],
  PC4 = pca$x[, 4],
  PC5 = pca$x[, 5],
  group = group,
  intgroup.df,
  name = colnames(vsd)
)
```

## Components 1 and 2

```{r pcaDif_C1-C2, fig.cap="Fig 3.7: PCA plots (differentiated) - PC1 and PC2", fig.width=8, fig.height=8}
g <- ggplot(d, aes(PC1, PC2, color = group.1, shape = authors)) +
  scale_shape_manual(values = 1:8) +
  theme_bw() +
  theme(legend.title = element_blank()) +
  geom_point(size = 2, stroke = 2) +
  xlab(paste("PC1", round(percentVar[1] * 100, 2), "% variance")) +
  ylab(paste("PC2", round(percentVar[2] * 100, 2), "% variance"))
ggplotly(g)
```

## Components 1 and 3
```{r pcaDif_C1-C3, fig.cap="Fig 3.8: PCA plots (differentiated) - PC1 and PC3", fig.width=8, fig.height=8}
g <- ggplot(d, aes(PC1, PC3, color = group.1, shape = authors)) +
  scale_shape_manual(values = 1:8) +
  theme_bw() +
  theme(legend.title = element_blank()) +
  geom_point(size = 2, stroke = 2) +
  xlab(paste("PC1", round(percentVar[1] * 100, 2), "% variance")) +
  ylab(paste("PC3", round(percentVar[3] * 100, 2), "% variance"))
ggplotly(g)
```

## Components 2 and 3
```{r pcaDif_C2-C3, fig.cap="Fig 3.9: PCA plots (differentiated) - PC2 and PC3", fig.width=8, fig.height=8}
g <- ggplot(d, aes(PC2, PC3, color = group.1, shape = authors)) +
  scale_shape_manual(values = 1:8) +
  theme_bw() +
  theme(legend.title = element_blank()) +
  geom_point(size = 2, stroke = 2) +
  xlab(paste("PC2", round(percentVar[2] * 100, 2), "% variance")) +
  ylab(paste("PC3", round(percentVar[3] * 100, 2), "% variance"))
ggplotly(g)
```


## Scree plot
```{r pcaDif_scree, fig.cap="Fig 3.10: Scree plot showing variance contributed by each component (differentiated)", fig.width=8, fig.height=5}
scree_plot = data.frame(percentVar)
end = length(scree_plot$percentVar)
scree_plot[, 2] <- c(1:end)
colnames(scree_plot) <- c("variance", "component_number")


g <- ggplot(scree_plot, aes(component_number, variance * 100)) +
  geom_bar(stat = 'identity', fill = "slateblue") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(
      vjust = 1,
      hjust = 1,
      size = 12
    ),
    axis.text.y = element_text(size = 12),
    plot.margin = margin(10, 10, 10, 100),
    legend.position = "none",
    plot.title = element_text(color = "black", size = 18, face = "bold.italic"),
    axis.title.y = element_text(color = "black", size = 14, face = "bold"),
    axis.line.x = element_line(
      colour = 'black',
      size = 0.5,
      linetype = 'solid'
    ),
    axis.line.y = element_line(
      colour = 'black',
      size = 0.5,
      linetype = 'solid'
    ),
    axis.ticks.x = element_line(
      colour = 'black',
      size = 1,
      linetype = 'solid'
    ),
    axis.title.x = element_text(color = "black", size = 14, face = "bold")
  ) +
  xlab("Components") +
  ylab("Percent Variance") + ggtitle("differentiated samples")
scale_y_continuous(expand = expansion(mult = c(0, .1)), breaks = pretty_breaks())
ggplotly(g)
```
## Euclidean Distance

```{r pcadiff_euc_partial, warnings=TRUE, message=FALSE, fig.cap="Fig 3.11: Heatmap of Euclidean Distance using first 3 components (differentiated)", fig.width=12, fig.height=10}
pca <-
  prcomp(t(assay(vsd)[select,]),
         center = TRUE,
         scale. = FALSE,
         rank. = 3)
results <- pca$x
distance <- get_dist(results, method = "euclidean")
write.table(
  as.matrix(distance),
  "assets/diff_samples-PC1-PC2-PC3.tsv",
  row.names = T,
  qmethod = "double",
  sep = "\t"
)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
```

```{r pcadiff_euc_full, warnings=TRUE, message=FALSE, fig.cap="Fig 3.12: Heatmap of Euclidean Distance using all components (differentiated)", fig.width=12, fig.height=10}
pca <- prcomp(t(assay(vsd)[select,]), center = TRUE, scale. = FALSE)
results <- pca$x
distance <- get_dist(results, method = "euclidean")
write.table(
  as.matrix(distance),
  "assets/diff_samples-allPC.tsv",
  row.names = T,
  qmethod = "double",
  sep = "\t"
)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
```


# Interactive 3D PCA plot


```{r pcaFull_3D, warnings=TRUE, message=FALSE}
g <- plotPCA3D(vsd, intgroup = c("group", "authors"))
saveWidget(g, file = "PCA_full.html")
```


```{r pcaDif_3D, warnings=TRUE, message=FALSE}
g <- plotPCA3D(vsd, intgroup = c("group", "authors"))
saveWidget(g, file = "PCA_dif.html")
```
1. [Interactive 3D PCA plot (all samples)](PCA_full.html){target="_blank"}
2. [Interactive 3D PCA plot (differentiated samples)](PCA_dif.html){target="_blank"}


# Session Information

```{r sessioninfo}
sessionInfo()
```