Remind yourself of the key columns you have in the results files:
FDR
or padj
)FDR
or padj
)baseMean
is the mean of the normalised counts for the gene across all sampleslfcSE
standard error of the fold change
stat
is the test statistic (the Wald statistic)DESeq2
(Love, Huber, and Anders 2014)
+Rows: 280
@@ -514,24 +538,32 @@
$ ensembl_gene_id <chr> "ENSMUSG00000028639", "ENSMUSG00000024053", "ENSMUSG00…
summary.logFC
and logFC.hspc
give the same value (in this case since comparing two cell types)scran
(Lun, McCarthy, and Marioni 2016)
+The gene id is difficult to interpret in plots/tables
Therefore we need to add information such as the gene name and a description to the results
For the 🐸 Frog data information comes from xenbase
For the 🐭 Mice data information comes from Ensembl
For the 🐸 Frog data information comes from Xenbase (Fisher et al. 2023)
For the 🐭 Mice data information comes from Ensembl (Birney et al. 2004)
Xenbase is a model organism database that provides genomic, molecular, and developmental biology information about Xenopus laevis and Xenopus tropicalis.
+Xenbase is a model organism database that provides genomic, molecular, and developmental biology information about Xenopus laevis and Xenopus tropicalis.
+It took me some time to find the information you need.
+This is listed: Xenbase Gene Product Information [readme] gzipped gpi (tab separated)
Click on the readme link to see the file format and columns
I downloaded xenbase.gpi.gz, unzipped it, removed header lines and the Xenopus tropicalis (taxon:8364) entries and saved it as xenbase_info.xlsx
In the workshop you will merge this information with the results file
In the workshop you will import this file and merge the information with the results file
from the ncbi
-biomart is a package that allows you to get information from the ncbi database such as gene names and descriptions
+Ensembl creates, integrates and distributes reference datasets and analysis tools that enable genomics
BioMart provides a access to these large datasets
biomaRt
(Durinck et al. 2009) is a Bioconductor package gives you programmatic access to BioMart.
In the workshop you use this package to get information you can merge with the results file
dimsenion reduction
-lots of variables
-lots of variables and lots of observations
-log
-normalisation regularised log is a method to bias from low count genes. https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/03_DGE_QC_analysis.html
+In general, we plot data to help us summarise and understand it
This is especially import for omics data where we have a very large number of variables and often a large number of observations
We will look at three plots very commonly used in omics analysis: Principal Component Analysis (PCA) plot, Heatmaps and Volcano Plots
Principal Component Analysis is an unsupervised machine learning technique
Unsupervised methods1 are unsupervised in that they do not use/optimise to a particular output. The goal is to uncover structure. They do not test hypotheses
It is often used to visualise high dimensional data because it is a dimension reduction technique
It takes a large number of continuous variables (like gene expression) and reduces them to a smaller number of variables (called principal components) that explain most of the variation in the data
The principal components can be plotted to see how samples cluster together
This gives some insight but we have 280 (mice) or 10,000+(frogs) genes to consider. How do we know if the pair we use is typical? How can we consider al the genes at once?
+We have done PCA in Omics 3, but often PCA might be one of the first exploratory steps because it gives you an idea whether you expect general patterns in gene expression that distinguish groups.
+are a grid of genes on one axis and samples on the other with each grid cell coloured by another variable
in this case the other variable is gene expression
they allow you to quickly get an overview of the expression patterns across genes and samples
we often couple them with clustering to group genes and samples with similar expression patterns together which helps us see which genes are responsible for distinguishing groups
rlog is a method to bias from low count genes. https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/03_DGE_QC_analysis.html gives a good explanation of regularized the log transform (rlog)
-The rlog transformation of the normalized counts is only necessary for these visualization methods during this quality assessment. They are not used for DE because DESeq2 takes care of that
-in the workshop we just to log transformed
+See next slide for information
+On the vertical axis are genes which are differentially expressed at the 0.01 level
On the horizontal axis are samples
We can see that the FGF-treated samples cluster together and the control samples cluster together
We can also see two clusters of genes; one of these shows genes upregulated (more yellow) in the FGF-treated samples and the other shows genes downregulated (more blue) in the FGF-treated samples
Volcano plots often used to visualise the results of differential expression analysis
They are just a scatter of the corrected p value against the fold change….
almost - the we actually plot the negative log of the corrected p value against the fold change
This is because just plotting the p-value means the axis is counter intuitive. Small p-values (i.e., significant values) are at the bottom of the axis)
And since p-values range from 1 to very tiny the points are all squashed at the bottom of the axis
Should be done on normalised data so meaningful comparisons can be made
The 🐭 mouse data were already log2normalised
The 🐸 frog data were normalised by the DE method and saved to file. We will log2 transform before doing visualisations
heatmaply
ggrepel
from CRAN in the the normal way:
Omics 1: Hello data Getting to know the data. Checking the distributions of values
Omics 2: Statistical Analysis Identifying which genes are differentially expressed between treatments.
Omics 3: Visualising and Interpreting. PCA, Volcano plots and heatmaps to visualise results. Interpreting the results and finding out more about genes of interest.