From b21af2971e74460e0bde257ae0c57e2a22936120 Mon Sep 17 00:00:00 2001 From: Emma Rand Date: Sat, 7 Oct 2023 11:14:01 +0100 Subject: [PATCH] comple draft added --- omics/omics.qmd | 4 +- omics/week-3/overview.qmd | 5 +- omics/week-3/study_before_workshop.qmd | 219 +++++++++++-------------- references.bib | 28 ++++ update-notes.txt | 22 +-- 5 files changed, 137 insertions(+), 141 deletions(-) diff --git a/omics/omics.qmd b/omics/omics.qmd index 5d4af92..64c39c9 100644 --- a/omics/omics.qmd +++ b/omics/omics.qmd @@ -6,14 +6,14 @@ toc-location: right # Content -## Omics 1: Hello data! +## Omics 1: πŸ‘‹ Hello data! This week you will meet your data. The independent study will concisely cover how these data were generated and how they have been processed before being given to you. There will also be an overview of the analysis we will carry out over three workshops. In the workshop, you will learn what steps to take to get a good understanding of ’omics data before you consider any statistical analysis. This is an often overlooked, but very valuable and informative, part of any data pipeline. It gives you the deep understanding of the data structures and values that you will need to code and trouble-shoot code, allows you to spot failed or problematic samples and informs your decisions on quality control. -## Omics 2: Statisitcal Analysis +## Omics 2: Statistical Analysis before diff --git a/omics/week-3/overview.qmd b/omics/week-3/overview.qmd index b9325a2..76d916a 100644 --- a/omics/week-3/overview.qmd +++ b/omics/week-3/overview.qmd @@ -1,6 +1,6 @@ --- title: "Overview" -subtitle: "Omics 1: Hello data!" +subtitle: "Omics 1: πŸ‘‹ Hello data!" toc: true toc-location: right --- @@ -8,6 +8,7 @@ toc-location: right This week you will meet your data. The independent study will concisely cover how these data were generated and how they have been processed before being given to you. There will also be an overview of the analysis we will carry out over three workshops. In the workshop, you will learn what steps to take to get a good understanding of ’omics data before you consider any statistical analysis. This is an often overlooked, but very valuable and informative, part of any data pipeline. It gives you the deep understanding of the data structures and values that you will need to code and trouble-shoot code, allows you to spot failed or problematic samples and informs your decisions on quality control. +We suggest you sit together with your group in the workshop. ### Learning objectives @@ -23,7 +24,7 @@ The successful student will be able to: 1. [Prepare](study_before_workshop.qmd) - i. πŸ“– Read how the data were generated and how they have been processed so far, insstall + i. πŸ“– Read how the data were generated and how they have been processed so far, install 2. [Workshop](workshop.qmd) diff --git a/omics/week-3/study_before_workshop.qmd b/omics/week-3/study_before_workshop.qmd index fac0482..a5ff790 100644 --- a/omics/week-3/study_before_workshop.qmd +++ b/omics/week-3/study_before_workshop.qmd @@ -1,6 +1,6 @@ --- title: "Independent Study to prepare for workshop" -subtitle: "Omics 1: Hello data!" +subtitle: "Omics 1: πŸ‘‹ Hello data!" author: "Emma Rand" format: revealjs: @@ -15,17 +15,9 @@ editor: wrap: 72 --- - -## 🚧 NB still in construction 🚧 - - - ## Overview - - ::: incremental - - Concise summary of the experimental design and aims - What the raw data consist of @@ -33,24 +25,20 @@ editor: - What has been done to the data so far - What steps we will take in the workshop - ::: ## The Data There are three datasets -- 🐸 transcriptomic data (bulk RNA-seq) from frog embryos. +- 🐸 transcriptomic data (bulk RNA-seq) from frog embryos. - 🐭 transcriptomic data (single cell RNA-seq) from stemcells - πŸ‚ ??????? Metabolomic / Metagenomic data from anaerobic digesters - - # Experimental design - ## 🐸 Experimental design {auto-animate="true"} ![Schematic of frog development @@ -95,198 +83,181 @@ width="200"} ## 🐸 Aim ::: incremental - - find genes important in frog development -- Important means genes that are differentially expressed between the control and the FGF treated sibling - -- Differentially expressed means the expression on one group is signifcantly higher than the other +- Important means genes that are differentially expressed between the + control and the FGF treated sibling +- Differentially expressed means the expression on one group is + significantly higher than the other ::: - ## 🐸 Guided analysis ::: incremental - -- The workshops will take you through comparing the control and FGF treated sibling at S30 +- The workshops will take you through comparing the control and FGF + treated sibling at S30 - This is the "least interesting" comparison -- You will be guided to carefully document your work so you can apply the same methods to other comparisons - +- You will be guided to carefully document your work so you can apply + the same methods to other comparisons ::: - - - ## 🐭 Experimental design {auto-animate="true"} -![Schematic of stem cell experiment](images/88H-exp-design-jillian.png){fig-align="center" +![Schematic of stem cell +experiment](images/88H-exp-design-jillian.png){fig-align="center" width="700"} ## 🐭 Experimental design {auto-animate="true"} -![Schematic of stem cell experiment](images/88H-exp-design-jillian.png){fig-align="left" +![Schematic of stem cell +experiment](images/88H-exp-design-jillian.png){fig-align="left" width="200"} ::: incremental - Cells were sorted using flow cytometry on the basis of cell surface -markers + markers -- There are three cell types: LT-HSCs, HSPCs, Progs +- There are three cell types: LT-HSCs, HSPCs, Progs - Many cells of each cell type were sequenced - -- ::: ## 🐭 Experimental design {auto-animate="true"} -![Schematic of stem cell experiment](images/88H-exp-design-jillian.png){fig-align="left" +![Schematic of stem cell +experiment](images/88H-exp-design-jillian.png){fig-align="left" width="200"} ::: incremental +- There are three cell types: LT-HSCs, HSPCs, Progs [These are the + "treaments"]{style="color:#009900"} -- There are three cell types: LT-HSCs, HSPCs, Progs [These are the "treaments"]{style="color:#009900"} - - -- Many cells of each cell type were sequenced: [These are the replicates]{style="color:#009900"} +- Many cells of each cell type were sequenced: [These are the + replicates]{style="color:#009900"} - [155 LT-HSCs, 701 HSPCs, 798 Progs]{style="color:#009900"} - ::: ## 🐭 Aim ::: incremental +- find genes for cell surface proteins that are important in stem cell + identity -- find genes for cell surface proteins that are important in stem cell identity - -- Important means genes that are differentially expressed between at least two cell types - -- Differentially expressed means the expression on one group is significantly higher than the other +- Important means genes that are differentially expressed between at + least two cell types +- Differentially expressed means the expression on one group is + significantly higher than the other ::: - ## 🐭 Guided analysis ::: incremental - -- The workshops will take you through comparing the HSPC and Prog cells +- The workshops will take you through comparing the HSPC and Prog + cells - This is the "least interesting" comparison -- You will be guided to carefully document your work so you can apply the same methods to other comparisons - +- You will be guided to carefully document your work so you can apply + the same methods to other comparisons ::: +# The raw data - - - - - - - +## Raw Sequence data - - - - - - - - - - - - - +::: incremental +- The raw data are "reads" from a sequencing machine. - +- A read is sequence of DNA or RNA shorter than the whole genome or + transcriptome - +- The length of the reads depends on the type of sequencing machine - + - Short-read technologies e.g. Illumina have higher base accuracy + but are harder to align + - Long-read technologies e.g. Nanopore have lower base accuracy + but are easier to align - +- Sequencing technology is constantly improving - +- Optional: You can read more about Sequencing technologies in + [Statistically useful experimental + design](https://cloud-span.github.io/experimental_design00-overview/) + [@rand_statistically_2022] +::: - +## Raw Sequence data - - +::: incremental +- The RNA-seq data are from an Illumina machine 150-300bp; Metagenomic + data are often Nanopore 10,000 - 30000bp - +- Reads are in FASTQ files - - - +- FASTQ files contain the sequence of each read and a quality score + for each base +::: - +# What has been done to the data so far +## General steps +::: incremental +- Reads are filtered and trimmed on the basis of the quality score +- They are then aligned/pseudo-aligned to a reference + genome/transcriptome or, in metagenomics, assembled de novo. +- Reads are then counted to quantify the expression or number of + genomes in metagenomics - - - - - - - - - - - - - +- Counts are normalised to account for differences in sequencing depth + and gene/transcript/genome length before statistical analysis +::: - +## 🐸 Data - - - +- Unpublished (so far!) - +- Expression for the whole transcriptome [*X. laevis* v10.1 genome + assembly](https://www.xenbase.org/xenbase/static-xenbase/ftpDatafiles.jsp) - - +- Values are raw counts - +- The statistical analysis method we will use `DESeq2` [@DESeq2] + requires raw counts and performs the normalisation itself - +## 🐭 Data - +- Published in @nestorowa2016 - - - - - - - +- Expression for a subset of genes, the surfaceome - +- Values are log2 normalised values - - - +- The statistical analysis method we will use `scran` [@scran] + requires normalised values - - - +# Workshops - - +## Workshops - - +- Omics 1: Hello data Getting to know the data. Checking the + distributions of values overall, across samples and across genes to + check things are as we expect and detect genes/samples that need to + be removed - +- Omics 2: Statistical Analysis Identifying which genes are + differentially expressed between treatments. This is the main + analysis step. We will use different methods for bulk and single + cell data. - +- Omics 3: Visualising and Interpreting Production of volcano plots + and heatmaps to visualise the results of the statistical analysis. + We will also look at how to interpret the results and how to find + out more about the genes of interest. diff --git a/references.bib b/references.bib index 8a4039c..c017b87 100644 --- a/references.bib +++ b/references.bib @@ -279,3 +279,31 @@ @article{bryan2018 url = {https://doi.org/10.1080/00031305.2017.1399928}, note = {Publisher: Taylor & Francis} } + +@misc{rand_statistically_2022, + title = {Statistically useful experimental design}, + url = {https://cloud-span.github.io/experimental_design00-overview/}, + abstract = {The Statistically useful experimental design module is a 2 - 3 hour workshop about designing β€˜omics experiments. We consider: what influences platform choice and what influences design when platform choice is fixed replication and controls sequence coverage and depth Type I and Type II errors and multiple testing correction The site infrastructure is based on The Carpentries This course not not require any software or coding. Some principles of design will be presented followed by discussion of their application using three case studies. There will also be an opportunity for participants to discuss their own designs. This module assumes no experience with designing omics’ experiments but some previous experience experimental design and statistical analysis - such as would be covered in an undergraduate bioscience degreee - would be useful. The module is designed for a 2 - 3 hour workshop.}, + author = {Rand, Emma and Forrester, Sarah}, + year = {2022}, +} + +@article{DESeq2, + title = {Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2}, + author = {Love, Michael I. and Huber, Wolfgang and Anders, Simon}, + year = {2014}, + date = {2014}, + pages = {550}, + volume = {15}, + doi = {10.1186/s13059-014-0550-8} +} + +@article{scran, + title = {A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor}, + author = {Lun, Aaron T. L. and McCarthy, Davis J. and Marioni, John C.}, + year = {2016}, + date = {2016}, + pages = {2122}, + volume = {5}, + doi = {10.12688/f1000research.9501.2} +} diff --git a/update-notes.txt b/update-notes.txt index e8a7493..4be17ac 100644 --- a/update-notes.txt +++ b/update-notes.txt @@ -41,21 +41,17 @@ fs::dir_create("data-raw") fs::dir_create("data-processed") +for workshop 2 they willneed + +biocmanager + +scran BiocManager::install("scran") + +DESeq2 BiocManager::install("DESeq2") + + for workshop 2, make sure you examine which frog genes are 0 in all of one sample but present in all of the other which genes matters: first do venn diagram, present absent then to differential expressions -- Raw data: [GEO Series - GSE81682](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81682) -- Illumina HiSeq -- short reads 150-300bp -- [A single-cell resolution map of mouse hematopoietic stem and - progenitor cell - differentiation](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5305050/) - [@nestorowa2016] -- 3,840 samples -- Reads were aligned using G-SNAP and the mapped reads were assigned - to Ensembl genes HTSeq -- GSE81682_HTSeq_counts.txt.gz (bottom of the page). And - [GSE81682_HTSeq_counts.txt.zip](../data/jillian/GSE81682_HTSeq_counts.zip)