diff --git a/Rplot001.jpg b/Rplot001.jpg new file mode 100644 index 0000000..e69de29 diff --git a/_site/omics/week-5/figures/frog-s30-pca.png b/_site/omics/week-5/figures/frog-s30-pca.png new file mode 100644 index 0000000..f4f1799 Binary files /dev/null and b/_site/omics/week-5/figures/frog-s30-pca.png differ diff --git a/_site/omics/week-5/figures/prog-hspc-volcano.png b/_site/omics/week-5/figures/prog-hspc-volcano.png new file mode 100644 index 0000000..4122bc5 Binary files /dev/null and b/_site/omics/week-5/figures/prog-hspc-volcano.png differ diff --git a/_site/omics/week-5/figures/prog_hspc-pca.png b/_site/omics/week-5/figures/prog_hspc-pca.png new file mode 100644 index 0000000..315fc8d Binary files /dev/null and b/_site/omics/week-5/figures/prog_hspc-pca.png differ diff --git a/_site/omics/week-5/images/Xenbase-Logo-Medium.png b/_site/omics/week-5/images/Xenbase-Logo-Medium.png new file mode 100644 index 0000000..2121bb0 Binary files /dev/null and b/_site/omics/week-5/images/Xenbase-Logo-Medium.png differ diff --git a/_site/omics/week-5/images/frog-heat.png b/_site/omics/week-5/images/frog-heat.png new file mode 100644 index 0000000..8cf1b5f Binary files /dev/null and b/_site/omics/week-5/images/frog-heat.png differ diff --git a/_site/omics/week-5/images/volcano-why.png b/_site/omics/week-5/images/volcano-why.png new file mode 100644 index 0000000..26b786b Binary files /dev/null and b/_site/omics/week-5/images/volcano-why.png differ diff --git a/_site/omics/week-5/images/why_pca_frog.png b/_site/omics/week-5/images/why_pca_frog.png new file mode 100644 index 0000000..14bb9f6 Binary files /dev/null and b/_site/omics/week-5/images/why_pca_frog.png differ diff --git a/_site/omics/week-5/images/why_pca_mouse.png b/_site/omics/week-5/images/why_pca_mouse.png new file mode 100644 index 0000000..04ef843 Binary files /dev/null and b/_site/omics/week-5/images/why_pca_mouse.png differ diff --git a/_site/omics/week-5/meta/xenbase_info.xlsx b/_site/omics/week-5/meta/xenbase_info.xlsx new file mode 100644 index 0000000..940c933 Binary files /dev/null and b/_site/omics/week-5/meta/xenbase_info.xlsx differ diff --git a/_site/omics/week-5/study_before_workshop.html b/_site/omics/week-5/study_before_workshop.html new file mode 100644 index 0000000..ed3f7a3 --- /dev/null +++ b/_site/omics/week-5/study_before_workshop.html @@ -0,0 +1,1123 @@ + + + + + + + + + + + + +Data Analysis for Group Project - Independent Study to prepare for workshop + + + + + + + + + + + + + + + + + +
+
+ +

Independent Study to prepare for workshop

+

Omics 3: Visualising and Interpreting

+ +
+
+
+Emma Rand +
+
+
+ +

23 October, 2023

+

Overview

+

In these slides we will:

+
+
    +
  • Check where you are

  • +
  • +

    learn some concepts used omics visualisation

    +
      +
    • Principle Component Analysis (PCA)
    • +
    • Volcano plots
    • +
    • Heatmaps
    • +
    +
  • +
  • Find out what packages to install before the workshop

  • +
+
+

Where should you be?

+ +

What we did in Omics 2: Statistical Analysis

+
+
    +
  • carried out differential expression analysis

  • +
  • found genes not expressed at all, or expressed in one group only

  • +
  • Saved results files

  • +
+
+

Where should you be?

+

After the Omics 2: 👋 Statistical Analysis Workshop including:

+

🐸 Frogs

+
+
    +
  • An RStudio Project called frogs-88H which contains: +
      +
    • Raw data (S14, S20 and S30)
    • +
    • Processed data (s30_filtered.csv, s30_summary_gene.csv, s30_summary_gene_filtered.csv, s30_summary_samp.csv and equivalents for S14 OR S20)
    • +
    • Results files (s30_fgf_only.csv, S30_normalised_counts.csv, S30_results.csv and equivalents for S14 OR S20)
      +
    • +
    • Two scripts called cont-fgf-s30.R and either cont-fgf-s20.R OR cont-fgf-s14.R +
    • +
    +
  • +
+
+

Files should be organised into folders. Code should well commented and easy to read.

+

🐭 Mice

+
+
    +
  • An RStudio Project called mice-88H which contains +
      +
    • Raw data (hspc, prog, lthsc)
    • +
    • Processed data (hspc_summary_gene.csv, hspc_summary_samp.csv, prog_summary_gene.csv, prog_summary_samp.csv, lthsc_summary_gene.csv, lthsc_summary_samp.csv)
    • +
    +
  • +
  • Results files (prog_hspc_results.csv and an equivalent for lthsc vs prog or hspc vs lthsc)
  • +
  • Two scripts called hspc-prog.R and either hspc-lthsc.R OR prog-lthsc.R +
  • +
+
+

Files should be organised into folders. Code should well commented and easy to read.

+

🍂

+

Either of the other examples.

+

If you do not have those

+

Go through:

+

Examine the results files

+ +

Examine the results files

+

Remind yourself of the key columns you have in the results files:

+
    +
  • a log2 fold change
  • +
  • an unadjusted p-value
  • +
  • a p value adjusted for multiple testing (FDR or padj)
  • +
  • a gene id
  • +

🐸 Frogs

+
+
+
Rows: 10,136
+Columns: 7
+$ baseMean        <dbl> 237.553928, 531.565700, 86.392830, 49.813502, 419.9983…
+$ log2FoldChange  <dbl> 0.096601855, -0.089588528, -0.192811203, -0.008858703,…
+$ lfcSE           <dbl> 0.2079396, 0.1557384, 0.3253216, 0.4342614, 0.1685420,…
+$ stat            <dbl> 0.46456683, -0.57525007, -0.59267874, -0.02039947, -0.…
+$ pvalue          <dbl> 0.64224169, 0.56512218, 0.55339617, 0.98372471, 0.8699…
+$ padj            <dbl> 0.9998970, 0.9998970, 0.9998970, 0.9998970, 0.9998970,…
+$ xenbase_gene_id <chr> "XB-GENE-1000007", "XB-GENE-1000023", "XB-GENE-1000062…
+
+
+
+
    +
  • +baseMean is the mean of the normalised counts for the gene across all samples
  • +
  • +lfcSE standard error of the fold change
  • +
  • +stat is the test statistic (the Wald statistic)
  • +
  • Generated by DESeq2 (Love, Huber, and Anders 2014) +
  • +
+
+

🐭 Mice

+
+
+
Rows: 280
+Columns: 6
+$ Top             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
+$ p.value         <dbl> 7.038138e-117, 4.736622e-90, 1.832630e-88, 4.211954e-7…
+$ FDR             <dbl> 1.970679e-114, 6.631271e-88, 1.710455e-86, 2.948368e-7…
+$ summary.logFC   <dbl> 1.596910, 3.035165, 3.261056, -2.146491, -3.056730, 3.…
+$ logFC.hspc      <dbl> 1.596910, 3.035165, 3.261056, -2.146491, -3.056730, 3.…
+$ ensembl_gene_id <chr> "ENSMUSG00000028639", "ENSMUSG00000024053", "ENSMUSG00…
+
+
+
+
    +
  • Top is the rank of the gene ordered by the p-value (smallest first)
  • +
  • +summary.logFC and logFC.hspc give the same value (in this case since comparing two cell types)
  • +
  • generated by scran (Lun, McCarthy, and Marioni 2016) +
  • +
+
+

Adding gene information

+ +

Adding gene information

+
+
    +
  • The gene id is difficult to interpret in plots/tables

  • +
  • Therefore we need to add information such as the gene name and a description to the results

  • +
  • For the 🐸 Frog data information comes from Xenbase (Fisher et al. 2023)

  • +
  • For the 🐭 Mice data information comes from Ensembl (Birney et al. 2004)

  • +
+
+

🐸 Xenbase

+ +

xenbase logo

Xenbase is a model organism database that provides genomic, molecular, and developmental biology information about Xenopus laevis and Xenopus tropicalis.

+
+

It took me some time to find the information you need.

+
+

🐸 Xenbase

+
+
    +
  • I got the information from the Xenbase information pages under Data Reports | Gene Information

  • +
  • This is listed: Xenbase Gene Product Information [readme] gzipped gpi (tab separated)

  • +
  • Click on the readme link to see the file format and columns

  • +
  • I downloaded xenbase.gpi.gz, unzipped it, removed header lines and the Xenopus tropicalis (taxon:8364) entries and saved it as xenbase_info.xlsx

  • +
  • In the workshop you will import this file and merge the information with the results file

  • +
+
+

🐭 Ensembl

+
+
    +
  • Ensembl creates, integrates and distributes reference datasets and analysis tools that enable genomics

  • +
  • BioMart provides a access to these large datasets

  • +
  • biomaRt (Durinck et al. 2009) is a Bioconductor package gives you programmatic access to BioMart.

  • +
  • In the workshop you use this package to get information you can merge with the results file

  • +
+
+

Plots

+ +

What is the purpose of an Omics plot?

+
+
    +
  • In general, we plot data to help us summarise and understand it

  • +
  • This is especially import for omics data where we have a very large number of variables and often a large number of observations

  • +
  • We will look at three plots very commonly used in omics analysis: Principal Component Analysis (PCA) plot, Heatmaps and Volcano Plots

  • +
+
+

Principal Component Analysis (PCA)

+ +

PCA

+
+
    +
  • Principal Component Analysis is an unsupervised machine learning technique

  • +
  • Unsupervised methods1 are unsupervised in that they do not use/optimise to a particular output. The goal is to uncover structure. They do not test hypotheses

  • +
  • It is often used to visualise high dimensional data because it is a dimension reduction technique

  • +
+
+

PCA

+
+
    +
  • It takes a large number of continuous variables (like gene expression) and reduces them to a smaller number of variables (called principal components) that explain most of the variation in the data

  • +
  • The principal components can be plotted to see how samples cluster together

  • +
+
+

PCA

+
    +
  • To see if samples cluster as we would expect, we might plot the expression of one gene against another
  • +
+
+
+
+
+

+

Samples

+
+
+
+
+
+

+

Cells

+
+
+
+
+
+

This gives some insight but we have 280 (mice) or 10,000+(frogs) genes to consider. How do we know if the pair we use is typical? How can we consider al the genes at once?

+

PCA

+
    +
  • PCA is a solution for this - It takes a large number of continuous variables (like gene expression) and reduces them to a smaller number of “principal components” that explain most of the variation in the data.
  • +
+
+
+
+
+

+

Samples

+
+
+
+
+
+

+

Cells

+
+
+
+
+
+

We have done PCA in Omics 3, but often PCA might be one of the first exploratory steps because it gives you an idea whether you expect general patterns in gene expression that distinguish groups.

+

Heatmaps

+ +

Heatmaps

+
+
    +
  • are a grid of genes on one axis and samples on the other with each grid cell coloured by another variable

  • +
  • in this case the other variable is gene expression

  • +
  • they allow you to quickly get an overview of the expression patterns across genes and samples

  • +
  • we often couple them with clustering to group genes and samples with similar expression patterns together which helps us see which genes are responsible for distinguishing groups

  • +
+
+

+
+

+

Heat map for the frog data

+
+
+

See next slide for information

+

Heatmaps

+
+
    +
  • On the vertical axis are genes which are differentially expressed at the 0.01 level

  • +
  • On the horizontal axis are samples

  • +
  • We can see that the FGF-treated samples cluster together and the control samples cluster together

  • +
  • We can also see two clusters of genes; one of these shows genes upregulated (more yellow) in the FGF-treated samples and the other shows genes downregulated (more blue) in the FGF-treated samples

  • +
+
+

Volcano plots

+ +

Volcano plots

+
+
    +
  • Volcano plots often used to visualise the results of differential expression analysis

  • +
  • They are just a scatter of the corrected p value against the fold change….

  • +
  • almost - the we actually plot the negative log of the corrected p value against the fold change

  • +
+
+

Volcano plots

+
+
    +
  • This is because just plotting the p-value means the axis is counter intuitive. Small p-values (i.e., significant values) are at the bottom of the axis)

  • +
  • And since p-values range from 1 to very tiny the points are all squashed at the bottom of the axis

  • +
+
+ +

Volcano plot FDR against fold change

Volcano plots

+
+
    +
  • Plotting the negative log of the corrected p-value means that the values are spread out and the significant values are at the top of the axis
  • +
+
+ +

Volcano plot -log(FDR) against fold change

Visualisations

+
    +
  • Should be done on normalised data so meaningful comparisons can be made

  • +
  • The 🐭 mouse data were already log2normalised

  • +
  • The 🐸 frog data were normalised by the DE method and saved to file. We will log2 transform before doing visualisations

  • +

Packages to install before the workshop

+

heatmaply ggrepel from CRAN in the the normal way:

+
+
install.packages("heatmaply")
+install.packages("ggrepel")
+
+

biomaRt from Bioconductor using BiocManager:

+
+
BiocManager::install("biomaRt")
+
+

Workshops

+ +

Workshops

+
    +
  • Omics 1: Hello data Getting to know the data. Checking the distributions of values

  • +
  • Omics 2: Statistical Analysis Identifying which genes are differentially expressed between treatments.

  • +
  • Omics 3: Visualising and Interpreting. PCA, Volcano plots and heatmaps to visualise results. Interpreting the results and finding out more about genes of interest.

  • +

References

+ + +
+
+Birney, Ewan, T. Daniel Andrews, Paul Bevan, Mario Caccamo, Yuan Chen, Laura Clarke, Guy Coates, et al. 2004. “An Overview of Ensembl.” Genome Research 14 (5): 925–28. https://doi.org/10.1101/gr.1860604. +
+
+Durinck, Steffen, Paul T. Spellman, Ewan Birney, and Wolfgang Huber. 2009. “Mapping Identifiers for the Integration of Genomic Datasets with the r/Bioconductor Package biomaRt” 4. +
+
+Fisher, Malcolm, Christina James-Zorn, Virgilio Ponferrada, Andrew J Bell, Nivitha Sundararaj, Erik Segerdell, Praneet Chaturvedi, et al. 2023. “Xenbase: Key Features and Resources of the Xenopus Model Organism Knowledgebase.” Genetics 224 (1): iyad018. https://doi.org/10.1093/genetics/iyad018. +
+
+Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2” 15: 550. https://doi.org/10.1186/s13059-014-0550-8. +
+
+Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell RNA-Seq Data with Bioconductor” 5: 2122. https://doi.org/10.12688/f1000research.9501.2. +
+
+Rand, Emma. 2021. Data Science Strand of BIO00058M. https://doi.org/10.5281/zenodo.5527705. +
+
+
+
+
+ + + + + \ No newline at end of file diff --git a/_site/search.json b/_site/search.json index 20eaa12..41513ad 100644 --- a/_site/search.json +++ b/_site/search.json @@ -1417,5 +1417,250 @@ "Week 4: Statistical Analysis", "Prepare!" ] + }, + { + "objectID": "omics/week-5/study_before_workshop.html#overview", + "href": "omics/week-5/study_before_workshop.html#overview", + "title": "Independent Study to prepare for workshop", + "section": "Overview", + "text": "Overview\nIn these slides we will:\n\n\nCheck where you are\n\nlearn some concepts used omics visualisation\n\nPrinciple Component Analysis (PCA)\nVolcano plots\nHeatmaps\n\n\nFind out what packages to install before the workshop" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#what-we-did-in-omics-2-statistical-analysis", + "href": "omics/week-5/study_before_workshop.html#what-we-did-in-omics-2-statistical-analysis", + "title": "Independent Study to prepare for workshop", + "section": "What we did in Omics 2: Statistical Analysis", + "text": "What we did in Omics 2: Statistical Analysis\n\n\ncarried out differential expression analysis\nfound genes not expressed at all, or expressed in one group only\nSaved results files" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#where-should-you-be-1", + "href": "omics/week-5/study_before_workshop.html#where-should-you-be-1", + "title": "Independent Study to prepare for workshop", + "section": "Where should you be?", + "text": "Where should you be?\nAfter the Omics 2: 👋 Statistical Analysis Workshop including:\n\n🤗 Look after future you! and\nthe Independent Study to consolidate, you should have:" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#frogs", + "href": "omics/week-5/study_before_workshop.html#frogs", + "title": "Independent Study to prepare for workshop", + "section": "🐸 Frogs", + "text": "🐸 Frogs\n\n\nAn RStudio Project called frogs-88H which contains:\n\nRaw data (S14, S20 and S30)\nProcessed data (s30_filtered.csv, s30_summary_gene.csv, s30_summary_gene_filtered.csv, s30_summary_samp.csv and equivalents for S14 OR S20)\nResults files (s30_fgf_only.csv, S30_normalised_counts.csv, S30_results.csv and equivalents for S14 OR S20)\n\nTwo scripts called cont-fgf-s30.R and either cont-fgf-s20.R OR cont-fgf-s14.R\n\n\n\n\n\nFiles should be organised into folders. Code should well commented and easy to read." + }, + { + "objectID": "omics/week-5/study_before_workshop.html#mice", + "href": "omics/week-5/study_before_workshop.html#mice", + "title": "Independent Study to prepare for workshop", + "section": "🐭 Mice", + "text": "🐭 Mice\n\n\nAn RStudio Project called mice-88H which contains\n\nRaw data (hspc, prog, lthsc)\nProcessed data (hspc_summary_gene.csv, hspc_summary_samp.csv, prog_summary_gene.csv, prog_summary_samp.csv, lthsc_summary_gene.csv, lthsc_summary_samp.csv)\n\n\nResults files (prog_hspc_results.csv and an equivalent for lthsc vs prog or hspc vs lthsc)\nTwo scripts called hspc-prog.R and either hspc-lthsc.R OR prog-lthsc.R\n\n\n\nFiles should be organised into folders. Code should well commented and easy to read." + }, + { + "objectID": "omics/week-5/study_before_workshop.html#section", + "href": "omics/week-5/study_before_workshop.html#section", + "title": "Independent Study to prepare for workshop", + "section": "🍂", + "text": "🍂\nEither of the other examples." + }, + { + "objectID": "omics/week-5/study_before_workshop.html#if-you-do-not-have-those", + "href": "omics/week-5/study_before_workshop.html#if-you-do-not-have-those", + "title": "Independent Study to prepare for workshop", + "section": "If you do not have those", + "text": "If you do not have those\nGo through:\n\nOmics 2: Statistical Analysis including:\n🤗 Look after future you! and\nthe Independent Study to consolidate" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#examine-the-results-files-1", + "href": "omics/week-5/study_before_workshop.html#examine-the-results-files-1", + "title": "Independent Study to prepare for workshop", + "section": "Examine the results files", + "text": "Examine the results files\nRemind yourself of the key columns you have in the results files:\n\na log2 fold change\nan unadjusted p-value\na p value adjusted for multiple testing (FDR or padj)\na gene id" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#frogs-1", + "href": "omics/week-5/study_before_workshop.html#frogs-1", + "title": "Independent Study to prepare for workshop", + "section": "🐸 Frogs", + "text": "🐸 Frogs\n\n\nRows: 10,136\nColumns: 7\n$ baseMean 237.553928, 531.565700, 86.392830, 49.813502, 419.9983…\n$ log2FoldChange 0.096601855, -0.089588528, -0.192811203, -0.008858703,…\n$ lfcSE 0.2079396, 0.1557384, 0.3253216, 0.4342614, 0.1685420,…\n$ stat 0.46456683, -0.57525007, -0.59267874, -0.02039947, -0.…\n$ pvalue 0.64224169, 0.56512218, 0.55339617, 0.98372471, 0.8699…\n$ padj 0.9998970, 0.9998970, 0.9998970, 0.9998970, 0.9998970,…\n$ xenbase_gene_id \"XB-GENE-1000007\", \"XB-GENE-1000023\", \"XB-GENE-1000062…\n\n\n\n\n\nbaseMean is the mean of the normalised counts for the gene across all samples\n\nlfcSE standard error of the fold change\n\nstat is the test statistic (the Wald statistic)\nGenerated by DESeq2 (Love, Huber, and Anders 2014)" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#mice-1", + "href": "omics/week-5/study_before_workshop.html#mice-1", + "title": "Independent Study to prepare for workshop", + "section": "🐭 Mice", + "text": "🐭 Mice\n\n\nRows: 280\nColumns: 6\n$ Top 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…\n$ p.value 7.038138e-117, 4.736622e-90, 1.832630e-88, 4.211954e-7…\n$ FDR 1.970679e-114, 6.631271e-88, 1.710455e-86, 2.948368e-7…\n$ summary.logFC 1.596910, 3.035165, 3.261056, -2.146491, -3.056730, 3.…\n$ logFC.hspc 1.596910, 3.035165, 3.261056, -2.146491, -3.056730, 3.…\n$ ensembl_gene_id \"ENSMUSG00000028639\", \"ENSMUSG00000024053\", \"ENSMUSG00…\n\n\n\n\nTop is the rank of the gene ordered by the p-value (smallest first)\n\nsummary.logFC and logFC.hspc give the same value (in this case since comparing two cell types)\ngenerated by scran (Lun, McCarthy, and Marioni 2016)" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#from-xenbase", + "href": "omics/week-5/study_before_workshop.html#from-xenbase", + "title": "Independent Study to prepare for workshop", + "section": "from xenbase", + "text": "from xenbase\n\nxenbase logoXenbase (http://www.xenbase.org/, RRID:SCR_003280)\nXenbase is a model organism database that provides genomic, molecular, and developmental biology information about Xenopus laevis and Xenopus tropicalis. Xenbase is funded by the National Institutes of Health (NIH) and the National Science Foundation (NSF).\nour data gives the xenbase gene id so we are using xenbase to get the information a lot of the information would also be in the ncbi" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#from-the-ncbi", + "href": "omics/week-5/study_before_workshop.html#from-the-ncbi", + "title": "Independent Study to prepare for workshop", + "section": "from the ncbi", + "text": "from the ncbi\nbiomart is a package that allows you to get information from the ncbi database such as gene names and descriptions" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#plots-purpose", + "href": "omics/week-5/study_before_workshop.html#plots-purpose", + "title": "Independent Study to prepare for workshop", + "section": "plots purpose", + "text": "plots purpose\ndimsenion reduction" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#pca", + "href": "omics/week-5/study_before_workshop.html#pca", + "title": "Independent Study to prepare for workshop", + "section": "PCA", + "text": "PCA\n\n\nPrincipal Component Analysis is an unsupervised machine learning technique\nUnsupervised methods1 are unsupervised in that they do not use/optimise to a particular output. The goal is to uncover structure. They do not test hypotheses\nIt is often used to visualise high dimensional data because it is a dimension reduction technique\n\n\nYou may wish to read a previous introduction to unsupervised methods I have written An introduction to Machine Learning: Unsupervised methods (Rand 2021)" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#tsne", + "href": "omics/week-5/study_before_workshop.html#tsne", + "title": "Independent Study to prepare for workshop", + "section": "tsne", + "text": "tsne\nlots of variables and lots of observations" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#normalising-before-plotting", + "href": "omics/week-5/study_before_workshop.html#normalising-before-plotting", + "title": "Independent Study to prepare for workshop", + "section": "normalising before plotting", + "text": "normalising before plotting\nlog\nnormalisation regularised log is a method to bias from low count genes. https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/03_DGE_QC_analysis.html\n\n\nT\n\n\nrlog is a method to bias from low count genes. https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/03_DGE_QC_analysis.html gives a good explanation of regularized the log transform (rlog)\nThe rlog transformation of the normalized counts is only necessary for these visualization methods during this quality assessment. They are not used for DE because DESeq2 takes care of that\nin the workshop we just to log transformed\n\nThe 🐭 mouse data have been normalised to simplify the analysis for you; the 🐸 frog data have not but the DE method will do this for you." + }, + { + "objectID": "omics/week-5/study_before_workshop.html#packages-to-install-before-the-workshop", + "href": "omics/week-5/study_before_workshop.html#packages-to-install-before-the-workshop", + "title": "Independent Study to prepare for workshop", + "section": "Packages to install before the workshop", + "text": "Packages to install before the workshop\nheatmaply ggrepel from CRAN in the the normal way:\n\ninstall.packages(\"heatmaply\")\ninstall.packages(\"ggrepel\")\n\nbiomaRt from Bioconductor using BiocManager:\n\nBiocManager::install(\"biomaRt\")" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#workshops-1", + "href": "omics/week-5/study_before_workshop.html#workshops-1", + "title": "Independent Study to prepare for workshop", + "section": "Workshops", + "text": "Workshops\n\nOmics 1: Hello data Getting to know the data. Checking the distributions of values\nOmics 2: Statistical Analysis Identifying which genes are differentially expressed between treatments.\nOmics 3: Visualising and Interpreting. PCA, Volcano plots and heatmaps to visualise results. Interpreting the results and finding out more about genes of interest." + }, + { + "objectID": "omics/week-5/study_before_workshop.html#references", + "href": "omics/week-5/study_before_workshop.html#references", + "title": "Independent Study to prepare for workshop", + "section": "References", + "text": "References\n\n\n🔗 About Omics 3: Visualising and Interpreting\n\n\n\nBirney, Ewan, T. Daniel Andrews, Paul Bevan, Mario Caccamo, Yuan Chen, Laura Clarke, Guy Coates, et al. 2004. “An Overview of Ensembl.” Genome Research 14 (5): 925–28. https://doi.org/10.1101/gr.1860604.\n\n\nDurinck, Steffen, Paul T. Spellman, Ewan Birney, and Wolfgang Huber. 2009. “Mapping Identifiers for the Integration of Genomic Datasets with the r/Bioconductor Package biomaRt” 4.\n\n\nFisher, Malcolm, Christina James-Zorn, Virgilio Ponferrada, Andrew J Bell, Nivitha Sundararaj, Erik Segerdell, Praneet Chaturvedi, et al. 2023. “Xenbase: Key Features and Resources of the Xenopus Model Organism Knowledgebase.” Genetics 224 (1): iyad018. https://doi.org/10.1093/genetics/iyad018.\n\n\nLove, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2” 15: 550. https://doi.org/10.1186/s13059-014-0550-8.\n\n\nLun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell RNA-Seq Data with Bioconductor” 5: 2122. https://doi.org/10.12688/f1000research.9501.2.\n\n\nRand, Emma. 2021. Data Science Strand of BIO00058M. https://doi.org/10.5281/zenodo.5527705." + }, + { + "objectID": "omics/week-5/study_before_workshop.html#adding-gene-information-1", + "href": "omics/week-5/study_before_workshop.html#adding-gene-information-1", + "title": "Independent Study to prepare for workshop", + "section": "Adding gene information", + "text": "Adding gene information\n\n\nThe gene id is difficult to interpret in plots/tables\nTherefore we need to add information such as the gene name and a description to the results\nFor the 🐸 Frog data information comes from Xenbase (Fisher et al. 2023)\nFor the 🐭 Mice data information comes from Ensembl (Birney et al. 2004)" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#xenbase", + "href": "omics/week-5/study_before_workshop.html#xenbase", + "title": "Independent Study to prepare for workshop", + "section": "🐸 Xenbase", + "text": "🐸 Xenbase\n\nxenbase logoXenbase is a model organism database that provides genomic, molecular, and developmental biology information about Xenopus laevis and Xenopus tropicalis.\n\nIt took me some time to find the information you need." + }, + { + "objectID": "omics/week-5/study_before_workshop.html#xenbase-1", + "href": "omics/week-5/study_before_workshop.html#xenbase-1", + "title": "Independent Study to prepare for workshop", + "section": "🐸 Xenbase", + "text": "🐸 Xenbase\n\n\nI got the information from the Xenbase information pages under Data Reports | Gene Information\nThis is listed: Xenbase Gene Product Information [readme] gzipped gpi (tab separated)\nClick on the readme link to see the file format and columns\nI downloaded xenbase.gpi.gz, unzipped it, removed header lines and the Xenopus tropicalis (taxon:8364) entries and saved it as xenbase_info.xlsx\nIn the workshop you will import this file and merge the information with the results file" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#ensembl", + "href": "omics/week-5/study_before_workshop.html#ensembl", + "title": "Independent Study to prepare for workshop", + "section": "🐭 Ensembl", + "text": "🐭 Ensembl\n\n\nEnsembl creates, integrates and distributes reference datasets and analysis tools that enable genomics\nBioMart provides a access to these large datasets\nbiomaRt (Durinck et al. 2009) is a Bioconductor package gives you programmatic access to BioMart.\nIn the workshop you use this package to get information you can merge with the results file" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#what-is-the-purpose-of-an-omics-plot", + "href": "omics/week-5/study_before_workshop.html#what-is-the-purpose-of-an-omics-plot", + "title": "Independent Study to prepare for workshop", + "section": "What is the purpose of an Omics plot?", + "text": "What is the purpose of an Omics plot?\n\n\nIn general, we plot data to help us summarise and understand it\nThis is especially import for omics data where we have a very large number of variables and often a large number of observations\nWe will look at three plots very commonly used in omics analysis: Principal Component Analysis (PCA) plot, Heatmaps and Volcano Plots" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#pca-1", + "href": "omics/week-5/study_before_workshop.html#pca-1", + "title": "Independent Study to prepare for workshop", + "section": "PCA", + "text": "PCA\n\n\nIt takes a large number of continuous variables (like gene expression) and reduces them to a smaller number of variables (called principal components) that explain most of the variation in the data\nThe principal components can be plotted to see how samples cluster together" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#pca-2", + "href": "omics/week-5/study_before_workshop.html#pca-2", + "title": "Independent Study to prepare for workshop", + "section": "PCA", + "text": "PCA\n\nTo see if samples cluster as we would expect, we might plot the expression of one gene against another\n\n\n\n\n\n\nSamples\n\n\n\n\n\n\nCells\n\n\n\n\n\nThis gives some insight but we have 280 (mice) or 10,000+(frogs) genes to consider. How do we know if the pair we use is typical? How can we consider al the genes at once?" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#heatmaps-1", + "href": "omics/week-5/study_before_workshop.html#heatmaps-1", + "title": "Independent Study to prepare for workshop", + "section": "Heatmaps", + "text": "Heatmaps\n\n\nare a grid of genes on one axis and samples on the other with each grid cell coloured by another variable\nin this case the other variable is gene expression\nthey allow you to quickly get an overview of the expression patterns across genes and samples\nwe often couple them with clustering to group genes and samples with similar expression patterns together which helps us see which genes are responsible for distinguishing groups" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#volcano-plots-1", + "href": "omics/week-5/study_before_workshop.html#volcano-plots-1", + "title": "Independent Study to prepare for workshop", + "section": "Volcano plots", + "text": "Volcano plots\n\n\nVolcano plots often used to visualise the results of differential expression analysis\nThey are just a scatter of the corrected p value against the fold change….\nalmost - the we actually plot the negative log of the corrected p value against the fold change" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#volcano-plots-2", + "href": "omics/week-5/study_before_workshop.html#volcano-plots-2", + "title": "Independent Study to prepare for workshop", + "section": "Volcano plots", + "text": "Volcano plots\n\n\nThis is because just plotting the p-value means the axis is counter intuitive. Small p-values (i.e., significant values) are at the bottom of the axis)\nAnd since p-values range from 1 to very tiny the points are all squashed at the bottom of the axis\n\n\n\nVolcano plot FDR against fold change" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#volcano-plots-3", + "href": "omics/week-5/study_before_workshop.html#volcano-plots-3", + "title": "Independent Study to prepare for workshop", + "section": "Volcano plots", + "text": "Volcano plots\n\n\nPlotting the negative log of the corrected p-value means that the values are spread out and the significant values are at the top of the axis\n\n\n\nVolcano plot -log(FDR) against fold change" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#visualisations", + "href": "omics/week-5/study_before_workshop.html#visualisations", + "title": "Independent Study to prepare for workshop", + "section": "Visualisations", + "text": "Visualisations\n\nShould be done on normalised data so meaningful comparisons can be made\nThe 🐭 mouse data were already log2normalised\nThe 🐸 frog data were normalised by the DE method and saved to file. We will log2 transform before doing visualisations" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#heatmaps-2", + "href": "omics/week-5/study_before_workshop.html#heatmaps-2", + "title": "Independent Study to prepare for workshop", + "section": "Heatmaps", + "text": "Heatmaps\n\n\nOn the vertical axis are genes which are differentially expressed at the 0.01 level\nOn the horizontal axis are samples\nWe can see that the FGF-treated samples cluster together and the control samples cluster together\nWe can also see two clusters of genes; one of these shows genes upregulated (more yellow) in the FGF-treated samples and the other shows genes downregulated (more blue) in the FGF-treated samples" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#heatmaps-3", + "href": "omics/week-5/study_before_workshop.html#heatmaps-3", + "title": "Independent Study to prepare for workshop", + "section": "Heatmaps", + "text": "Heatmaps\n\n\nOn the vertical axis are genes which are differentially expressed at the 0.01 level\nOn the horizontal axis are samples\nWe can see that the FGF-treated samples cluster together and the control samples cluster together\nWe can also see two clusters of genes; one of these shows genes upregulated (more yellow) in the FGF-treated samples and the other shows genes downregulated (more blue) in the FGF-treated samples" + }, + { + "objectID": "omics/week-5/study_before_workshop.html#pca-3", + "href": "omics/week-5/study_before_workshop.html#pca-3", + "title": "Independent Study to prepare for workshop", + "section": "PCA", + "text": "PCA\n\nPCA is a solution for this - It takes a large number of continuous variables (like gene expression) and reduces them to a smaller number of “principal components” that explain most of the variation in the data.\n\n\n\n\n\n\nSamples\n\n\n\n\n\n\nCells\n\n\n\n\n\nWe have done PCA in Omics 3, but often PCA might be one of the first exploratory steps because it gives you an idea whether you expect general patterns in gene expression that distinguish groups." + }, + { + "objectID": "omics/week-5/study_before_workshop.html#section-1", + "href": "omics/week-5/study_before_workshop.html#section-1", + "title": "Independent Study to prepare for workshop", + "section": "", + "text": "Heat map for the frog data\n\n\nSee next slide for information" } ] \ No newline at end of file diff --git a/_site/site_libs/quarto-html/quarto-html.min.css b/_site/site_libs/quarto-html/quarto-html.min.css index 8b13789..c2857c3 100644 --- a/_site/site_libs/quarto-html/quarto-html.min.css +++ b/_site/site_libs/quarto-html/quarto-html.min.css @@ -1 +1 @@ - +/*# sourceMappingURL=0a6b880beb84f9b6f36107a76f82c5b1.css.map */ diff --git a/omics/crib/cont-fgf-s30.R b/omics/crib/cont-fgf-s30.R index c648943..18e4bb4 100644 --- a/omics/crib/cont-fgf-s30.R +++ b/omics/crib/cont-fgf-s30.R @@ -458,50 +458,20 @@ s30_log2_trans <- s30_results |> colnames(s30_log2_trans) <- s30_results$xenbase_gene_id # just for indep study before -# s30_log2_trans$sample <- row.names(s30_log2_trans) -# a <- s30_log2_trans |> ggplot(aes(x = `XB-GENE-1000007`, +# sample_id <- row.names(s30_log2_trans) |> str_remove("log2_") +# fig <- s30_log2_trans |> ggplot(aes(x = `XB-GENE-1000007`, # y = `XB-GENE-1000023`)) + # geom_point() + -# geom_text(aes(label = sample), +# geom_text(aes(label = sample_id), # vjust = -1, size = 3) + # scale_x_continuous(expand = c(0.05,0.05)) + # scale_y_continuous(expand = c(0.05,0.05)) + # theme_classic() # # -# b <- s30_log2_trans |> ggplot(aes(x = `XB-GENE-1000062`, -# y = `XB-GENE-1000072`)) + -# geom_point() + -# geom_text(aes(label = sample), -# vjust = -1, size = 3) + -# scale_x_continuous(expand = c(0.05,0.05)) + -# scale_y_continuous(expand = c(0.05,0.05)) + -# theme_classic() -# -# c <- s30_log2_trans |> ggplot(aes(x = `XB-GENE-1000113`, -# y = `XB-GENE-1000132`)) + -# geom_point() + -# geom_text(aes(label = sample), -# vjust = -1, size = 3) + -# scale_x_continuous(expand = c(0.05,0.05)) + -# scale_y_continuous(expand = c(0.05,0.05)) + -# theme_classic() -# -# d <- s30_log2_trans |> ggplot(aes(x = `XB-GENE-1000149`, -# y = `XB-GENE-1000251`)) + -# geom_point() + -# geom_text(aes(label = sample), -# vjust = -1, size = 3) + -# scale_x_continuous(expand = c(0.05,0.05)) + -# scale_y_continuous(expand = c(0.05,0.05)) + -# theme_classic() -# -# library(patchwork) -# fig <- (a + b) / (c + d) -# -# ggsave("omics/week-5/images/why_pca.png", +# ggsave("omics/week-5/images/why_pca_frog.png", # plot = fig, -# width = 6, height = 6) +# width = 4, height = 4) # perform PCA using standard functions pca <- s30_log2_trans |> diff --git a/omics/crib/hspc-prog.R b/omics/crib/hspc-prog.R index 637a7c4..bd2cf4c 100644 --- a/omics/crib/hspc-prog.R +++ b/omics/crib/hspc-prog.R @@ -509,6 +509,30 @@ prog_hspc_trans <- prog_hspc_results |> colnames(prog_hspc_trans) <- prog_hspc_results$ensembl_gene_id +# just for indep study before +# prog_hspc_trans$cell_id <- row.names(prog_hspc_trans) +# prog_hspc_trans <- prog_hspc_trans |> +# extract(cell_id, +# remove = FALSE, +# c("cell_type", "cell_number"), +# "([a-zA-Z]{4})_([0-9]{3})") +# +# fig <- prog_hspc_trans |> ggplot(aes(x = ENSMUSG00000028639, +# y = ENSMUSG00000024053, colour = cell_type)) + +# geom_point() + +# # geom_text(aes(label = cell_id), +# # vjust = -1, size = 3) + +# scale_x_continuous(expand = c(0.05,0.05)) + +# scale_y_continuous(expand = c(0.05,0.05)) + +# theme_classic() + +# theme(legend.position = "none") +# +# +# ggsave("omics/week-5/images/why_pca_mouse.png", +# plot = fig, +# width = 4, height = 4) + + # perform PCA using standard functions pca <- prog_hspc_trans |> prcomp(scale. = TRUE) @@ -625,3 +649,16 @@ ggsave("omics/week-5/figures/prog-hspc-volcano.png", units = "in", device = "png") + +# # just for the independent study slides +# vol <- prog_hspc_results |> +# ggplot(aes(x = summary.logFC, +# y = FDR)) + +# geom_point() + +# theme_classic() +# ggsave("omics/week-5/images/volcano-why.png", +# plot = vol, +# height = 4.5, +# width = 4.5, +# units = "in", +# device = "png") \ No newline at end of file diff --git a/omics/week-5/images/frog-heat.png b/omics/week-5/images/frog-heat.png new file mode 100644 index 0000000..8cf1b5f Binary files /dev/null and b/omics/week-5/images/frog-heat.png differ diff --git a/omics/week-5/images/volcano-why.png b/omics/week-5/images/volcano-why.png new file mode 100644 index 0000000..26b786b Binary files /dev/null and b/omics/week-5/images/volcano-why.png differ diff --git a/omics/week-5/images/why_pca.png b/omics/week-5/images/why_pca.png deleted file mode 100644 index dd45e6c..0000000 Binary files a/omics/week-5/images/why_pca.png and /dev/null differ diff --git a/omics/week-5/images/why_pca_frog.png b/omics/week-5/images/why_pca_frog.png new file mode 100644 index 0000000..14bb9f6 Binary files /dev/null and b/omics/week-5/images/why_pca_frog.png differ diff --git a/omics/week-5/images/why_pca_mouse.png b/omics/week-5/images/why_pca_mouse.png new file mode 100644 index 0000000..04ef843 Binary files /dev/null and b/omics/week-5/images/why_pca_mouse.png differ diff --git a/omics/week-5/overview.qmd b/omics/week-5/overview.qmd index 60313b4..80982a4 100644 --- a/omics/week-5/overview.qmd +++ b/omics/week-5/overview.qmd @@ -5,7 +5,7 @@ toc: true toc-location: right --- -This week we cover how to visualise and interpret the results of your differential expression analysis. The independent study will allow you to check you have what you should have following the [Omics 2: Statistical Analysis workshop](../week-4/workshop.html) and [Consolidation study](../week-4/study_after_workshop.html). It will also summarise the the methods and plots we will go through in the workshop. In the workshop, we will learn how to conduct a Principle Component Analysis (PCA) and plot the results as well as how to create a nicely formatted Volcano plot and heatmap. We will also consider three factors that help us choose an interesting/important gene: the absolute expression, the fold change and the adjusted p-value. +This week we cover how to visualise and interpret the results of your differential expression analysis. The independent study will allow you to check you have what you should have following the [Omics 2: Statistical Analysis workshop](../week-4/workshop.html) and [Consolidation study](../week-4/study_after_workshop.html). It will also summarise the the methods and plots we will go through in the workshop. In the workshop, we will learn how to conduct a Principle Component Analysis (PCA) and plot the results as well as how to create a nicely formatted Volcano plot and heatmap. We suggest you sit together with your group in the workshop. @@ -14,12 +14,11 @@ We suggest you sit together with your group in the workshop. The successful student will be able to: - verify they have the required RStudio Project set up and the data and code files from the previous Workshop and Consolidation study -- explain -- -- -- -- -- , +- explain where gene information came from and add it to their results +- perform a PCA and understand how to interpret them +- create a heatmap and understand how to interpret them +- create a volcano plot and understand how to interpret them + ### Instructions @@ -29,11 +28,13 @@ The successful student will be able to: 2. [Workshop](workshop.qmd) - i. 💻 .... + i. 💻 Add gene information to the results of DE - ii. 💻 ... + ii. 💻 Perform and plot a PCA - iii. 💻 .... + iii. 💻 Visualise results with a heatmap + + iv. 💻 Visualise all the results with a volcano plot iv. Look after future you! diff --git a/omics/week-5/study_before_workshop.qmd b/omics/week-5/study_before_workshop.qmd index 9a84a68..68198fd 100644 --- a/omics/week-5/study_before_workshop.qmd +++ b/omics/week-5/study_before_workshop.qmd @@ -15,6 +15,12 @@ editor: wrap: 72 --- +```{r} +#| include: false +library(tidyverse) + +``` + ## Overview In these slides we will: @@ -22,9 +28,11 @@ In these slides we will: ::: incremental - Check where you are -- learn some concepts +- learn some concepts used omics visualisation - - + - Principle Component Analysis (PCA) + - Volcano plots + - Heatmaps - Find out what packages to install before the workshop ::: @@ -34,13 +42,11 @@ In these slides we will: ## What we did in Omics 2: Statistical Analysis ::: incremental -::: {style="font-size: 90%;"} -- +- carried out differential expression analysis -- +- found genes not expressed at all, or expressed in one group only -- Saved files . -::: +- Saved results files ::: ## Where should you be? @@ -56,14 +62,16 @@ Workshop](../week-4/workshop.html) including: ## 🐸 Frogs -::: {style="font-size: 90%;"} +::: {style="font-size: 70%;"} - An RStudio Project called `frogs-88H` which contains: - Raw data (S14, S20 and S30) - Processed data (`s30_filtered.csv`, `s30_summary_gene.csv`, `s30_summary_gene_filtered.csv`, `s30_summary_samp.csv` and equivalents for S14 *OR* S20) - - Two scripts called `cont-fgf-s30.R` and `cont-fgf-s20.R` *OR* - `cont-fgf-s14.R` + - Results files (`s30_fgf_only.csv`, `S30_normalised_counts.csv`, + `S30_results.csv` and equivalents for S14 *OR* S20)\ + - Two scripts called `cont-fgf-s30.R` and either `cont-fgf-s20.R` + *OR* `cont-fgf-s14.R` ::: Files should be organised into folders. Code should well commented and @@ -71,12 +79,18 @@ easy to read. ## 🐭 Mice +::: {style="font-size: 70%;"} - An RStudio Project called `mice-88H` which contains - Raw data (hspc, prog, lthsc) - Processed data (`hspc_summary_gene.csv`, `hspc_summary_samp.csv`, `prog_summary_gene.csv`, - `prog_summary_samp.csv`) -- One script called `hspc-prog.R` + `prog_summary_samp.csv`, `lthsc_summary_gene.csv`, + `lthsc_summary_samp.csv`) +- Results files (`prog_hspc_results.csv` and an equivalent for lthsc + vs prog or hspc vs lthsc) +- Two scripts called `hspc-prog.R` and either `hspc-lthsc.R` *OR* + `prog-lthsc.R` +::: Files should be organised into folders. Code should well commented and easy to read. @@ -101,89 +115,294 @@ Go through: ## Examine the results files +Remind yourself of the key columns you have in the results files: + +- a log~2~ fold change +- an unadjusted *p*-value +- a *p* value adjusted for multiple testing (`FDR` or `padj`) +- a gene id + +## 🐸 Frogs + +```{r} +#| echo: false +read_csv("results/s30_results.csv") |> glimpse() + +``` + +. . . + +- `baseMean` is the mean of the normalised counts for the gene across + all samples +- `lfcSE` standard error of the fold change +- `stat` is the test statistic (the Wald statistic) +- Generated by **`DESeq2`** [@DESeq2] + +## 🐭 Mice + +```{r} +#| echo: false +read_csv("results/prog_hspc_results.csv") |> glimpse() + + +``` + +. . . +- Top is the rank of the gene ordered by the *p*-value (smallest + first) +- `summary.logFC` and `logFC.hspc` give the same value (in this case + since comparing two cell types) +- generated by **`scran`** [@scran] # Adding gene information -## from xenbase +## Adding gene information + +::: incremental +- The gene id is difficult to interpret in plots/tables + +- Therefore we need to add information such as the gene name and a + description to the results + +- For the 🐸 Frog data information comes from Xenbase [@fisher2023] + +- For the 🐭 Mice data information comes from Ensembl [@birney2004] +::: + +## 🐸 Xenbase + +![xenbase logo](images/Xenbase-Logo-Medium.png){width="800"} + +[Xenbase](http://www.xenbase.org/) is a model organism database that +provides genomic, molecular, and developmental biology information about +*Xenopus laevis* and *Xenopus tropicalis*. + +. . . + +It took me some time to find the information you need. + +## 🐸 Xenbase + +::: incremental +- I got the information from the [Xenbase information + pages](https://www.xenbase.org/xenbase/static-xenbase/ftpDatafiles.jsp) + under Data Reports \| Gene Information + +- This is listed: Xenbase Gene Product Information \[readme\] [gzipped + gpi (tab + separated)](https://download.xenbase.org/xenbase/GenePageReports/xenbase.gpi.gz) + +- Click on the readme link to see the file format and columns + +- I downloaded + [xenbase.gpi.gz](https://download.xenbase.org/xenbase/GenePageReports/xenbase.gpi.gz), + unzipped it, removed header lines and the *Xenopus tropicalis* + (taxon:8364) entries and saved it as + [xenbase_info.xlsx](meta/xenbase_info.xlsx) + +- In the workshop you will import this file and merge the information + with the results file +::: + +## 🐭 Ensembl + +::: incremental +- [Ensembl](https://www.ensembl.org/index.html) creates, integrates + and distributes reference datasets and analysis tools that enable + genomics + +- [BioMart](https://grch37.ensembl.org/info/data/biomart/index.html) + provides a access to these large datasets + +- **`biomaRt`** [@biomaRt] is a Bioconductor package gives you + programmatic access to BioMart. + +- In the workshop you use this package to get information you can + merge with the results file +::: + +# Plots + +## What is the purpose of an Omics plot? + +::: incremental +- In general, we plot data to help us summarise and understand it + +- This is especially import for omics data where we have a very large + number of variables and often a large number of observations + +- We will look at three plots very commonly used in omics analysis: + Principal Component Analysis (PCA) plot, Heatmaps and Volcano Plots +::: + +# Principal Component Analysis (PCA) + +## PCA + +::: incremental +- Principal Component Analysis is an unsupervised machine learning + technique + +- Unsupervised methods[^1] are unsupervised in that they do not + use/optimise to a particular output. The goal is to uncover + structure. They do not test hypotheses + +- It is often used to visualise high dimensional data because it is a + dimension reduction technique +::: + +[^1]: You may wish to read a previous introduction to unsupervised + methods I have written [An introduction to Machine Learning: + Unsupervised + methods](https://3mmarand.github.io/BIO00058M-Data-science-2020/slides/05_intro_to_ML_unsupervised.html#1) + [@rand2021] + +## PCA + +::: incremental +- It takes a large number of continuous variables (like gene + expression) and reduces them to a smaller number of variables + (called principal components) that explain most of the variation in + the data +- The principal components can be plotted to see how samples cluster + together +::: -![xenbase logo](images/Xenbase-Logo-Medium.png){width="700"} +## PCA +::: {style="font-size: 70%;"} +- To see if samples cluster as we would expect, we might plot the + expression of one gene against another +::: {layout-ncol="2"} +![Samples](images/why_pca_frog.png){width="300"} -Xenbase (http://www.xenbase.org/, RRID:SCR_003280) +![Cells](images/why_pca_mouse.png){width="300"} +::: -Xenbase is a model organism database that provides genomic, molecular, -and developmental biology information about Xenopus laevis and Xenopus -tropicalis. Xenbase is funded by the National Institutes of Health -(NIH) and the National Science Foundation (NSF). +This gives some insight but we have 280 (mice) or 10,000+(frogs) genes +to consider. How do we know if the pair we use is typical? How can we +consider al the genes at once? -our data gives the xenbase gene id so we are using xenbase to get the information -a lot of the information would also be in the ncbi +::: -## from the ncbi +## PCA -biomart is a package that allows you to get information from the ncbi -database such as gene names and descriptions +::: {style="font-size: 70%;"} +- PCA is a solution for this - It takes a large number of continuous + variables (like gene expression) and reduces them to a smaller + number of "principal components" that explain most of the variation + in the data. +::: {layout-ncol="2"} +![Samples](figures/frog-s30-pca.png){width="300"} +![Cells](figures/prog_hspc-pca.png){width="300"} +::: +::: +## PCA -# Plots +We have done PCA in Omics 3, but often PCA might be one of the first +exploratory steps because it gives you an idea whether you expect +general patterns in gene expression that distinguish groups. -## plots purpose +# Heatmaps -dimsenion reduction +## Heatmaps -## pca +::: incremental +- are a grid of genes on one axis and samples on the other with each + grid cell coloured by another variable + +- in this case the other variable is gene expression -lots of variables +- they allow you to quickly get an overview of the expression patterns + across genes and samples -## tsne +- we often couple them with clustering to group genes and samples with + similar expression patterns together which helps us see which genes + are responsible for distinguishing groups +::: -lots of variables and lots of observations +## +![Heat map for the frog data](images/frog-heat.png){height="800"} +See next slide for information -# normalsing before plotting +## Heatmaps + +::: incremental +- On the vertical axis are genes which are differentially expressed at + the 0.01 level -## normalising before plotting +- On the horizontal axis are samples -log +- We can see that the FGF-treated samples cluster together and the + control samples cluster together -normalisation -regularised log is a method to bias from low count genes. -https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/03_DGE_QC_analysis.html +- We can also see two clusters of genes; one of these shows genes + upregulated (more yellow) in the FGF-treated samples and the other + shows genes downregulated (more blue) in the FGF-treated samples +::: +# Volcano plots +## Volcano plots ::: incremental -- T +- Volcano plots often used to visualise the results of differential + expression analysis + +- They are just a scatter of the corrected p value against the fold + change.... + +- almost - the we actually plot the negative log of the corrected p + value against the fold change ::: -rlog is a method to bias from low count genes. -https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/03_DGE_QC_analysis.html -gives a good explanation of regularized the log transform (rlog) +## Volcano plots -The rlog transformation of the normalized counts is only necessary for -these visualization methods during this quality assessment. They are not -used for DE because DESeq2 takes care of that +::: {style="font-size: 70%;"} +- This is because just plotting the *p*-value means the axis is + counter intuitive. Small *p*-values (i.e., significant values) are + at the bottom of the axis) -in the workshop we just to log transformed +- And since *p*-values range from 1 to very tiny the points are all + squashed at the bottom of the axis +::: + +![Volcano plot FDR against fold change](images/volcano-why.png) -- The 🐭 mouse data have been normalised to simplify the analysis for - you; the 🐸 frog data have not but the DE method will do this for - you. - +## Volcano plots +::: {style="font-size: 70%;"} +- Plotting the negative log of the corrected *p*-value means that the + values are spread out and the significant values are at the top of + the axis +::: +![Volcano plot -log(FDR) against fold +change](figures/prog-hspc-volcano.png) +## Visualisations + +- Should be done on normalised data so meaningful comparisons can be + made + +- The 🐭 mouse data were already log~2~normalised + +- The 🐸 frog data were normalised by the DE method and saved to file. + We will log~2~ transform before doing visualisations ## Packages to install before the workshop -**`heatmaply`** **`ggrepel`** from CRAN in the the normal way: +**`heatmaply`** [@heatmapply] and **`ggrepel`** [@ggrepel] from CRAN in +the the normal way: ```{r} #| eval: false @@ -193,15 +412,14 @@ install.packages("ggrepel") ``` +**`biomaRt`** [@biomaRt] from Bioconductor using **`BiocManager`** +[@BiocManager] -**`biomaRt`** from Bioconductor using **`BiocManager`**: ```{r} #| eval: false BiocManager::install("biomaRt") ``` - - # Workshops ## Workshops diff --git a/omics/week-5/workshop.qmd b/omics/week-5/workshop.qmd index 21bd7c3..69ee793 100644 --- a/omics/week-5/workshop.qmd +++ b/omics/week-5/workshop.qmd @@ -136,9 +136,9 @@ The [xenbase.gpi.gz](https://download.xenbase.org/xenbase/GenePageReports/xenbas If you click on the readme link you can see information telling you that the file is in the Gene Product information 2.1 format and is provided with gzip compression. gene product information for both *Xenopus tropicalis* (taxon:8364) and *Xenopus laevis* (taxon:8355) -🎬 ...... +🎬 If you want to -```{bash} +```bash gunzip xenbase.gpi.gz less xenbase.gpi q @@ -346,6 +346,7 @@ n_gene_clusters <- 2 ``` ```{r} +#| fig-height: 8 heatmaply(mat, scale = "row", hide_colorbar = TRUE, @@ -361,7 +362,7 @@ heatmaply(mat, -## Visual all the results with a volcano plot +## Visualise all the results with a volcano plot colour the points if padj < 0.05 @@ -706,7 +707,7 @@ separation is not as strong as for the frog data run a few times to see different subset -## Visual all the results with a volcano plot +## Visualise all the results with a volcano plot colour the points if FDR < 0.05 and prog_hspc_results > 1 diff --git a/references.bib b/references.bib index 52f2ca1..fa3ab82 100644 --- a/references.bib +++ b/references.bib @@ -476,3 +476,67 @@ @article{benjamini1995 url = {http://www.jstor.org/stable/2346101}, note = {Publisher: [Royal Statistical Society, Wiley]} } + +@article{fisher2023, + title = {Xenbase: key features and resources of the Xenopus model organism knowledgebase}, + author = {Fisher, Malcolm and James-Zorn, Christina and Ponferrada, Virgilio and Bell, Andrew J and Sundararaj, Nivitha and Segerdell, Erik and Chaturvedi, Praneet and Bayyari, Nadia and Chu, Stanley and Pells, Troy and Lotay, Vaneet and Agalakov, Sergei and Wang, Dong Zhuo and Arshinoff, Bradley I and Foley, Saoirse and Karimi, Kamran and Vize, Peter D and Zorn, Aaron M}, + year = {2023}, + month = {05}, + date = {2023-05-02}, + journal = {Genetics}, + pages = {iyad018}, + volume = {224}, + number = {1}, + doi = {10.1093/genetics/iyad018}, + url = {https://doi.org/10.1093/genetics/iyad018} +} + +@article{birney2004, + title = {An Overview of Ensembl}, + author = {Birney, Ewan and Andrews, T. Daniel and Bevan, Paul and Caccamo, Mario and Chen, Yuan and Clarke, Laura and Coates, Guy and Cuff, James and Curwen, Val and Cutts, Tim and Down, Thomas and Eyras, Eduardo and Fernandez-Suarez, Xose M. and Gane, Paul and Gibbins, Brian and Gilbert, James and Hammond, Martin and Hotz, Hans-Rudolf and Iyer, Vivek and Jekosch, Kerstin and Kahari, Andreas and Kasprzyk, Arek and Keefe, Damian and Keenan, Stephen and Lehvaslaiho, Heikki and McVicker, Graham and Melsopp, Craig and Meidl, Patrick and Mongin, Emmanuel and Pettett, Roger and Potter, Simon and Proctor, Glenn and Rae, Mark and Searle, Steve and Slater, Guy and Smedley, Damian and Smith, James and Spooner, Will and Stabenau, Arne and Stalker, James and Storey, Roy and Ureta-Vidal, Abel and Woodwark, K. Cara and Cameron, Graham and Durbin, Richard and Cox, Anthony and Hubbard, Tim and Clamp, Michele}, + year = {2004}, + month = {05}, + date = {2004-05}, + journal = {Genome Research}, + pages = {925--928}, + volume = {14}, + number = {5}, + doi = {10.1101/gr.1860604}, + url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC479121/}, + note = {PMID: 15078858 +PMCID: PMC479121} +} + +@article{biomaRt, + title = {Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt}, + author = {Durinck, Steffen and Spellman, Paul T. and Birney, Ewan and Huber, Wolfgang}, + year = {2009}, + date = {2009}, + volume = {4} +} +@book{rand2021, + title = {Data Science strand of BIO00058M}, + author = {Rand, Emma}, + year = {2021}, + month = {09}, + date = {2021-09}, + doi = {10.5281/zenodo.5527705}, + url = {https://github.com/3mmaRand/BIO00058M-Data-science-2020} +} + + @Manual{BiocManager, + title = {BiocManager: Access the Bioconductor Project Package Repository}, + author = {Martin Morgan and Marcel Ramos}, + year = {2023}, + note = {R package version 1.30.22}, + url = {https://bioconductor.github.io/BiocManager/}, + } + + @Manual{ggrepel, + title = {ggrepel: Automatically Position Non-Overlapping Text Labels with +'ggplot2'}, + author = {Kamil Slowikowski}, + year = {2023}, + note = {R package version 0.9.4}, + url = {https://github.com/slowkow/ggrepel}, + } \ No newline at end of file