Skip to content

Commit

Permalink
Merge pull request #8 from 3mmaRand/omics-01-prepare
Browse files Browse the repository at this point in the history
comple draft added
  • Loading branch information
3mmaRand authored Oct 7, 2023
2 parents 4ca3f94 + b21af29 commit df065c2
Show file tree
Hide file tree
Showing 5 changed files with 137 additions and 141 deletions.
4 changes: 2 additions & 2 deletions omics/omics.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@ toc-location: right

# Content

## Omics 1: Hello data!
## Omics 1: 👋 Hello data!

This week you will meet your data. The independent study will concisely cover how these data were generated and how they have been processed before being given to you. There will also be an overview of the analysis we will carry out over three workshops.
In the workshop, you will learn what steps to take to get a good understanding of ’omics data before you consider any statistical analysis. This is an often overlooked, but very valuable and informative, part of any data pipeline. It gives you the deep understanding of the data structures and values that you will need to code and trouble-shoot code, allows you to spot failed or problematic samples and informs your decisions on quality control.



## Omics 2: Statisitcal Analysis
## Omics 2: Statistical Analysis

before

Expand Down
5 changes: 3 additions & 2 deletions omics/week-3/overview.qmd
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
---
title: "Overview"
subtitle: "Omics 1: Hello data!"
subtitle: "Omics 1: 👋 Hello data!"
toc: true
toc-location: right
---

This week you will meet your data. The independent study will concisely cover how these data were generated and how they have been processed before being given to you. There will also be an overview of the analysis we will carry out over three workshops.
In the workshop, you will learn what steps to take to get a good understanding of ’omics data before you consider any statistical analysis. This is an often overlooked, but very valuable and informative, part of any data pipeline. It gives you the deep understanding of the data structures and values that you will need to code and trouble-shoot code, allows you to spot failed or problematic samples and informs your decisions on quality control.

We suggest you sit together with your group in the workshop.

### Learning objectives

Expand All @@ -23,7 +24,7 @@ The successful student will be able to:

1. [Prepare](study_before_workshop.qmd)

i. 📖 Read how the data were generated and how they have been processed so far, insstall
i. 📖 Read how the data were generated and how they have been processed so far, install


2. [Workshop](workshop.qmd)
Expand Down
219 changes: 95 additions & 124 deletions omics/week-3/study_before_workshop.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Independent Study to prepare for workshop"
subtitle: "Omics 1: Hello data!"
subtitle: "Omics 1: 👋 Hello data!"
author: "Emma Rand"
format:
revealjs:
Expand All @@ -15,42 +15,30 @@ editor:
wrap: 72
---


## 🚧 NB still in construction 🚧



## Overview



::: incremental

- Concise summary of the experimental design and aims

- What the raw data consist of

- What has been done to the data so far

- What steps we will take in the workshop

:::

## The Data

There are three datasets

- 🐸 transcriptomic data (bulk RNA-seq) from frog embryos.
- 🐸 transcriptomic data (bulk RNA-seq) from frog embryos.

- 🐭 transcriptomic data (single cell RNA-seq) from stemcells

- 🍂 ??????? Metabolomic / Metagenomic data from anaerobic digesters



# Experimental design


## 🐸 Experimental design {auto-animate="true"}

![Schematic of frog development
Expand Down Expand Up @@ -95,198 +83,181 @@ width="200"}
## 🐸 Aim

::: incremental

- find genes important in frog development

- Important means genes that are differentially expressed between the control and the FGF treated sibling

- Differentially expressed means the expression on one group is signifcantly higher than the other
- Important means genes that are differentially expressed between the
control and the FGF treated sibling

- Differentially expressed means the expression on one group is
significantly higher than the other
:::


## 🐸 Guided analysis

::: incremental

- The workshops will take you through comparing the control and FGF treated sibling at S30
- The workshops will take you through comparing the control and FGF
treated sibling at S30

- This is the "least interesting" comparison

- You will be guided to carefully document your work so you can apply the same methods to other comparisons

- You will be guided to carefully document your work so you can apply
the same methods to other comparisons
:::




## 🐭 Experimental design {auto-animate="true"}

![Schematic of stem cell experiment](images/88H-exp-design-jillian.png){fig-align="center"
![Schematic of stem cell
experiment](images/88H-exp-design-jillian.png){fig-align="center"
width="700"}

## 🐭 Experimental design {auto-animate="true"}

![Schematic of stem cell experiment](images/88H-exp-design-jillian.png){fig-align="left"
![Schematic of stem cell
experiment](images/88H-exp-design-jillian.png){fig-align="left"
width="200"}

::: incremental
- Cells were sorted using flow cytometry on the basis of cell surface
markers
markers

- There are three cell types: LT-HSCs, HSPCs, Progs
- There are three cell types: LT-HSCs, HSPCs, Progs

- Many cells of each cell type were sequenced

-
:::

## 🐭 Experimental design {auto-animate="true"}

![Schematic of stem cell experiment](images/88H-exp-design-jillian.png){fig-align="left"
![Schematic of stem cell
experiment](images/88H-exp-design-jillian.png){fig-align="left"
width="200"}

::: incremental
- There are three cell types: LT-HSCs, HSPCs, Progs [These are the
"treaments"]{style="color:#009900"}

- There are three cell types: LT-HSCs, HSPCs, Progs [These are the "treaments"]{style="color:#009900"}


- Many cells of each cell type were sequenced: [These are the replicates]{style="color:#009900"}
- Many cells of each cell type were sequenced: [These are the
replicates]{style="color:#009900"}

- [155 LT-HSCs, 701 HSPCs, 798 Progs]{style="color:#009900"}

:::

## 🐭 Aim

::: incremental
- find genes for cell surface proteins that are important in stem cell
identity

- find genes for cell surface proteins that are important in stem cell identity

- Important means genes that are differentially expressed between at least two cell types

- Differentially expressed means the expression on one group is significantly higher than the other
- Important means genes that are differentially expressed between at
least two cell types

- Differentially expressed means the expression on one group is
significantly higher than the other
:::


## 🐭 Guided analysis

::: incremental

- The workshops will take you through comparing the HSPC and Prog cells
- The workshops will take you through comparing the HSPC and Prog
cells

- This is the "least interesting" comparison

- You will be guided to carefully document your work so you can apply the same methods to other comparisons

- You will be guided to carefully document your work so you can apply
the same methods to other comparisons
:::

# The raw data

<!-- ## Stem cells: Processing so far -->

<!-- - selection of cells -->
<!-- - selection of genes: subset of surfaceome -->
<!-- - log2 normalised values -->

<!-- ## Stem cells: Aims -->
## Raw Sequence data

<!-- - Find interesting **cell surface molecule genes** that vary between -->
<!-- cell types. -->


<!-- the difference between HSPC and Prog cells the difference between the -->
<!-- control and the FGF treated sibling at S30 \## Sequence Data -->

<!-- - The raw data are "reads" from a sequencing machine. Reads are short -->
<!-- sequences of DNA or RNA. -->

<!-- - The reads are aligned to a reference genome or transcriptome. The -->
<!-- reads are then counted to quantify the expression of each gene. The -->
<!-- counts are normalised to allow comparison between samples. -->
::: incremental
- The raw data are "reads" from a sequencing machine.

<!-- - reads -->
- A read is sequence of DNA or RNA shorter than the whole genome or
transcriptome

<!-- - quality control -->
- The length of the reads depends on the type of sequencing machine

<!-- - align/pseudoalign -->
- Short-read technologies e.g. Illumina have higher base accuracy
but are harder to align
- Long-read technologies e.g. Nanopore have lower base accuracy
but are easier to align

<!-- - quantify -->
- Sequencing technology is constantly improving

<!-- - normalise -->
- Optional: You can read more about Sequencing technologies in
[Statistically useful experimental
design](https://cloud-span.github.io/experimental_design00-overview/)
[@rand_statistically_2022]
:::

<!-- ## What is a read -->
## Raw Sequence data

<!-- - FASTQ format files -->
<!-- - sequences and information about each sequence's read accuracy -->
::: incremental
- The RNA-seq data are from an Illumina machine 150-300bp; Metagenomic
data are often Nanopore 10,000 - 30000bp

<!-- ## Differential expression -->
- Reads are in FASTQ files

<!-- - what is differentially expressed -->
<!-- - how are DE genes related -->
<!-- - what are DE genes involved with -->
- FASTQ files contain the sequence of each read and a quality score
for each base
:::

<!-- ## Stem cells: background -->
# What has been done to the data so far

## General steps

::: incremental
- Reads are filtered and trimmed on the basis of the quality score

- They are then aligned/pseudo-aligned to a reference
genome/transcriptome or, in metagenomics, assembled de novo.

- Reads are then counted to quantify the expression or number of
genomes in metagenomics

<!-- - Raw data: [GEO Series -->
<!-- GSE81682](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81682) -->
<!-- - Illumina HiSeq -->
<!-- - short reads 150-300bp -->
<!-- - [A single-cell resolution map of mouse hematopoietic stem and -->
<!-- progenitor cell -->
<!-- differentiation](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5305050/) -->
<!-- [@nestorowa2016] -->
<!-- - 3,840 samples -->
<!-- - Reads were aligned using G-SNAP and the mapped reads were assigned -->
<!-- to Ensembl genes HTSeq -->
<!-- - GSE81682_HTSeq_counts.txt.gz (bottom of the page). And -->
<!-- [GSE81682_HTSeq_counts.txt.zip](../data/jillian/GSE81682_HTSeq_counts.zip) -->
- Counts are normalised to account for differences in sequencing depth
and gene/transcript/genome length before statistical analysis
:::

<!-- ## Frog development: Processing so far -->
## 🐸 Data

<!-- - selection of cells -->
<!-- - selection of genes: subset of surfaceome -->
<!-- - log2 normalised values -->
- Unpublished (so far!)

<!-- ## Frog development: Aims -->
- Expression for the whole transcriptome [*X. laevis* v10.1 genome
assembly](https://www.xenbase.org/xenbase/static-xenbase/ftpDatafiles.jsp)

<!-- - Find interesting **cell surface molecule genes** that vary between -->
<!-- cell types. -->
- Values are raw counts

<!-- ## -->
- The statistical analysis method we will use `DESeq2` [@DESeq2]
requires raw counts and performs the normalisation itself

<!-- ## Deliverables -->
## 🐭 Data

<!-- 1. Describe the data -->
- Published in @nestorowa2016

<!-- - Number of cells/samples/reps/treatments -->
<!-- - number of genes -->
<!-- - type of expression values -->
<!-- - prior processing -->
<!-- - missing values -->
<!-- - overview of expression -->
<!-- - clustering of genes/samples -->
- Expression for a subset of genes, the surfaceome

<!-- 2. Report on differential expression between two groups -->
- Values are log2 normalised values

<!-- - number of DE at 1%, 5% and 10%. -->
<!-- - table of expression, fold changes, signifcance at each sig. -->
<!-- - Volcano plot -->
- The statistical analysis method we will use `scran` [@scran]
requires normalised values

<!-- 3. Report list of marker candidate gene IDs for a cell type of choice. -->
<!-- Justify filters. Table with fold FC, p values, IDs, canonical gene -->
<!-- names -->
# Workshops

<!-- 4. Interpret the biology by reporting on a few group of genes and the -->
<!-- processes in which they are involved. -->
## Workshops

<!-- 5. Report on your chosen genes and explain why you think they are good -->
<!-- candidates for follow up work -->
- Omics 1: Hello data Getting to know the data. Checking the
distributions of values overall, across samples and across genes to
check things are as we expect and detect genes/samples that need to
be removed

<!-- ## revise pivot longer -->
- Omics 2: Statistical Analysis Identifying which genes are
differentially expressed between treatments. This is the main
analysis step. We will use different methods for bulk and single
cell data.

<!-- ## revise pivot longer -->
- Omics 3: Visualising and Interpreting Production of volcano plots
and heatmaps to visualise the results of the statistical analysis.
We will also look at how to interpret the results and how to find
out more about the genes of interest.
Loading

0 comments on commit df065c2

Please sign in to comment.