From 2cb0568e11766e522f918799206b18a4b99b805d Mon Sep 17 00:00:00 2001
From: Emma Rand <emma.rand@york.ac.uk>
Date: Tue, 24 Oct 2023 16:09:12 +0100
Subject: [PATCH] added the heatmap for the mouse data to omics 3

---
 omics/week-5/workshop.qmd | 37 ++++++++++++++++++++++++++-----------
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/omics/week-5/workshop.qmd b/omics/week-5/workshop.qmd
index 28c8ba9..ed7d378 100644
--- a/omics/week-5/workshop.qmd
+++ b/omics/week-5/workshop.qmd
@@ -600,7 +600,7 @@ prog_hspc_results <- read_csv("results/prog_hspc_results.csv")
 ```
 
 🎬 Remind yourself what is in the rows and columns and the structure of
-the dataframes (perhaps using `glimpse()`)
+the dataframe (perhaps using `glimpse()`)
 
 ```{r}
 #| include: false
@@ -855,12 +855,12 @@ ggsave("figures/prog_hspc-pca.png",
 
 ## Visualise the expression of the most significant genes using a heatmap
 
-```{r}
-library(heatmaply)
-```
+A heatmap is a common way to visualise gene expression data. Often people will create heatmaps with thousands of genes but it can be more informative to use a subset along with clustering methods. We will use the genes which are significant at the 0.01 level. 
+
+We are going to create an interactive heatmap with the **`heatmaply`** [@heatmaply] package. **`heatmaply`** takes a matrix as input so we need to convert a dataframe of the log~2~ values to a matrix. We will also set the rownames to the gene names.
 
-we will use the most significant genes on a random subset of the cells
-since \~1500 columns is a lot
+
+🎬 Convert a dataframe of the log~2~ values to a matrix. I have used `sample()` to select 70 random columns so the heatmap is generated quickly:
 
 ```{r}
 mat <- prog_hspc_results_sig0.01 |> 
@@ -869,32 +869,47 @@ mat <- prog_hspc_results_sig0.01 |>
   as.matrix()
 ```
 
+
+🎬 Set the row names to the gene names:
+
 ```{r}
 rownames(mat) <- prog_hspc_results_sig0.01$external_gene_name
 ```
 
+You might want to view the matrix by clicking on it in the environment pane. 
+
+🎬 Load the **`heatmaply`** package:
+```{r}
+library(heatmaply)
+```
+
+We need to tell the clustering algorithm how many clusters to create. We will set the number of clusters for the cell types to be 2 and the number of clusters for the genes to be the same since it makes sense to see what clusters of genes correlate with the cell types.
+
 ```{r}
 n_cell_clusters <- 2
 n_gene_clusters <- 2
 ```
 
+
+🎬 Create the heatmap:
+
 ```{r}
 
 heatmaply(mat, 
           scale = "row",
-          hide_colorbar = TRUE,
           k_col = n_cell_clusters,
           k_row = n_gene_clusters,
-          label_names = c("Gene", "Cell id", "Expression (normalised, log2)"),
           fontsize_row = 7, fontsize_col = 10,
           labCol = colnames(mat),
           labRow = rownames(mat),
           heatmap_layers = theme(axis.line = element_blank()))
 ```
 
-will take a few mins to run, and longer to appear in the viewer
-separation is not as strong as for the frog data run a few times to see
-different subset
+It will take a minute to run and display. On the vertical axis are genes which are differentially expressed at the 0.01 level. On the horizontal axis are cells. We can see that cells of the same type don't cluster that well together. We can also see two clusters of genes but the pattern of gene is not as clear as it was for the frogs and the correspondence with the cell clusters is not as strong.
+
+The heatmap will open in the viewer pane (rather than the plot pane) because it is html. You can "Show in a new window" to see it in a larger format. You can also zoom in and out and pan around the heatmap and download it as a png. You might feel the colour bars is not adding much to the plot. You can remove it by setting `hide_colorbar = TRUE,` in the `heatmaply()` function. 
+
+Using all the cells is worth doing but it will take a while to generate the heatmap and then show in the viewer so do it sometime when you're ready for a coffee break. 
 
 ## Visualise all the results with a volcano plot