Skip to content

Commit

Permalink
update wastewater materials
Browse files Browse the repository at this point in the history
  • Loading branch information
tavareshugo committed Mar 18, 2024
1 parent 09afdc8 commit 2f6959c
Show file tree
Hide file tree
Showing 38 changed files with 2,860 additions and 72 deletions.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"hash": "cefd3e3c107bf0e3726fcc6575106179",
"result": {
"engine": "knitr",
"markdown": "---\npagetitle: \"SARS Genomic Surveillance\"\n---\n\n\n# Mutation Analysis\n\n\n::: {.cell}\n\n:::\n\n\n:::{.callout-tip}\n## Learning Objectives\n\n- Perform exploratory analysis of mutations detected in wastewater samples.\n- Identify the most common types of mutations identified and their occurrence in different genes.\n- Produce a heatmap-style visualisation of mutation abundances over time.\n- Analyse individual mutations and estimate frequency confidence intervals from read counts.\n:::\n\n\n## Data import\n\nIn the previous chapter we investigated the variant/lineage abundances estimated by _Freyja_. \nA complementary analysis is to look at how the frequency of individual mutations changes over time. \nThis analysis is agnostic to which lineages those mutations occur in, and may even reveal new emerging mutations not yet characterised in the known _Pango_ lineages. \n\nFor this analysis, we can use the mutation table generated by `viralrecon`, which is saved in its output directory under `variants/ivar/variants_long_table.csv`.\n(Note: \"variants\" here is used to mean \"mutations\"; see the information box in @sec-lineages where this terminology is clarified. To avoid confusion we use the term \"mutation\" in this chapter). \n\nWe will do our analysis in R, so we start by loading the packages being used: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(janitor)\ntheme_set(theme_classic())\n```\n:::\n\n\nWe then import the data into R. \nIn this case we also \"clean\" the column names of this table, to simplify them.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutations <- read_csv(\"preprocessed/viralrecon/variants/ivar/variants_long_table.csv\")\nmutations <- clean_names(mutations)\n```\n:::\n\n\n\n\nThis is what the table looks like: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(mutations)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 16\n sample chrom pos ref alt filter dp ref_dp alt_dp af gene effect\n <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> \n1 SRR1854… MN90… 14 A AC PASS 17 17 14 0.82 orf1… upstr…\n2 SRR1854… MN90… 241 C T PASS 45440 12 45428 1 orf1… upstr…\n3 SRR1854… MN90… 1549 C T PASS 2576 1747 827 0.32 orf1… synon…\n4 SRR1854… MN90… 1862 C T ft 10 7 3 0.3 orf1… misse…\n5 SRR1854… MN90… 2470 C T PASS 1105 641 464 0.42 orf1… synon…\n6 SRR1854… MN90… 2832 A G PASS 85 26 59 0.69 orf1… misse…\n# ℹ 4 more variables: hgvs_c <chr>, hgvs_p <chr>, hgvs_p_1letter <chr>,\n# caller <chr>\n```\n\n\n:::\n:::\n\n\nThere are many columns in this table, their meaning is detailed in @sec-mutations. \nWe will only retain a few columns of interest to us: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutations <- mutations |> \n # select columns of interest\n select(sample, pos, dp, alt_dp, af, gene, effect, hgvs_p_1letter) |> \n # rename one of the columns\n rename(aa_change = hgvs_p_1letter)\n```\n:::\n\n\nThe columns we retained are: \n\n- `sample` contains the sample name.\n- `pos` the position of the mutation in the reference genome.\n- `dp` the depth of sequencing (number of reads) at that position.\n- `alt_dp` the depth of sequencing of the non-reference (alternative) allele at that position.\n- `af` the allele frequency of the non-reference allele (equal to `alt_dp`/`dp`).\n- `gene` is the gene name in the reference genome.\n- `effect` is the predicted mutation effect. \n- `aa_change` (which we renamed from `hgvs_p_1letter`) is the amino acid change at that position, for non-synonymous mutations, following the [HGVS Nomenclature](https://hgvs-nomenclature.org/stable/) system.\n\nOur next step is to _merge_ this table with our metadata table, so we have information about the date of collection of each sample. \nWe start by importing the metadata table: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmetadata <- read_csv(\"sample_info.csv\")\n\n# look at the top few rows\nhead(metadata)\n```\n:::\n\n\n\n\nAs both this table and the table of mutations contain a column called \"sample\", we will _join_ the two tables based on it: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutations <- full_join(mutations, metadata, by = \"sample\")\n\nhead(mutations)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 13\n sample pos dp alt_dp af gene effect aa_change date country\n <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <date> <chr> \n1 SRR18541027 14 17 14 0.82 orf1… upstr… . 2022-01-04 United…\n2 SRR18541027 241 45440 45428 1 orf1… upstr… . 2022-01-04 United…\n3 SRR18541027 1549 2576 827 0.32 orf1… synon… p.S428S 2022-01-04 United…\n4 SRR18541027 1862 10 3 0.3 orf1… misse… p.L533F 2022-01-04 United…\n5 SRR18541027 2470 1105 464 0.42 orf1… synon… p.A735A 2022-01-04 United…\n6 SRR18541027 2832 85 59 0.69 orf1… misse… p.K856R 2022-01-04 United…\n# ℹ 3 more variables: city <chr>, latitude <dbl>, longitude <dbl>\n```\n\n\n:::\n:::\n\n\nWe now have our mutations along with the relevant metadata for each sample. \n\nFinally, we will give the values of our sample IDs and mutations an ordering based on the date (instead of the default alphabetical order): \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutations <- mutations |> \n mutate(sample = fct_reorder(sample, date), \n aa_change = fct_reorder(aa_change, date))\n```\n:::\n\n\n\n## Exploratory analysis\n\nWe start by exploring our data with some simple summaries. \nFor example, how many mutations do we have of each effect?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutations |> \n count(effect) |> \n mutate(effect = fct_reorder(effect, n)) |> \n ggplot(aes(x = n, y = effect)) + \n geom_col() +\n geom_label(aes(label = n))\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\nWe can see that the most common mutations are _missense_, i.e. causing an amino acid change. \nSeveral mutations are _synonymous_, which should be less impactful for the evolution of the virus. \nOther mutations are less common, and we will not focus on them in this analysis (although you may want to investigate them for other purposes). \n\nFor now, we will focus on missense mutations, as these have the potential to change the properties of the virus and new emerging lineages may be due to a novel adaptive mutation that changes an amino acid in one of the genes. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense <- mutations |> \n filter(effect == \"missense_variant\")\n```\n:::\n\n\nHow many of these mutations do we have in each gene?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense |> \n count(gene) |> \n mutate(gene = fct_reorder(gene, n)) |> \n ggplot(aes(x = n, y = gene)) + \n geom_col() +\n geom_label(aes(label = n))\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\nThe majority of mutations are in the _S_ and _ORF1ab_ genes. \nLet's investigate how mutations change over time.\n\n## Mutation frequency analysis\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense |> \n ggplot(aes(factor(date), aa_change, fill = af)) +\n geom_tile() +\n scale_x_discrete(guide = guide_axis(angle = 45)) +\n scale_y_discrete(guide = guide_axis(check.overlap = TRUE)) +\n scale_fill_viridis_c(limits = c(0, 1)) +\n labs(x = \"Sample (by date)\", y = \"AA change\", fill = \"Frequency\")\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-13-1.png){width=672}\n:::\n:::\n\n\nFrom this plot, we can see a \"step\" change in the observed mutations, which is likely due to the change in lineages over time. \nWe can also see some mutations that are quite frequent across many samples (they appear as horizontal strips in the plot). \nThese are likely mutations shared across several lineages. \nFinally, we can see a \"block\" of mutations appearing around Dec 2021, which are likely the Omicron mutations rising in frequency.\n\nNote that, unlike with _Freyja_, this analysis does not rely on prior knowledge of the lineages, making it suitable for detecting new emerging mutations.\nThis kind of visualisation is therefore useful to identify emerging mutations, as they would be visible as a new horizontal \"strip\" appearing on the plot. \n\n### Exercise\n\n:::{.callout-exercise}\n\nOne issue with the above plot is that we cannot see all the mutation names on the x-axis, as their names were overlapping. \n\n- Modify the code shown above to show the mutations present in the _Spike_ gene only.\n- Then do the same for _orf1ab_ gene.\n\nWhich of these genes do you think helps distinguish sub-lineages of Omicron more effectively?\n\n:::{.callout-hint}\nRemember that you can use the `filter()` function to subset the table to keep only rows of interest. \n:::\n\n:::{.callout-answer}\n\nTo look at the mutations in the _Spike_ gene only, we can use the `filter()` function, to retain the rows that match this gene only. \nWe then pipe the output to the same code we used before.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense |> \n filter(gene == \"S\") |> \n ggplot(aes(factor(date), aa_change, fill = af)) +\n geom_tile() +\n scale_x_discrete(guide = guide_axis(angle = 45)) +\n scale_y_discrete(guide = guide_axis(check.overlap = TRUE)) +\n scale_fill_viridis_c(limits = c(0, 1)) +\n labs(x = \"Sample (by date)\", y = \"AA change\", fill = \"Frequency\")\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\nThe same code can be reused to also look at the mutations from _orf1ab_:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense |> \n filter(gene == \"orf1ab\") |> \n ggplot(aes(factor(date), aa_change, fill = af)) +\n geom_tile() +\n scale_x_discrete(guide = guide_axis(angle = 45)) +\n scale_y_discrete(guide = guide_axis(check.overlap = TRUE)) +\n scale_fill_viridis_c(limits = c(0, 1)) +\n labs(x = \"Sample (by date)\", y = \"AA change\", fill = \"Frequency\")\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-15-1.png){width=672}\n:::\n:::\n\n\nWe can see that the mutations in _orf1ab_ change more often than those of the _Spike_ gene. \nThese mutations likely distinguish individual lineages, whereas the profile of the _Spike_ is generally more similar across all the lineages belonging to the Omicron variant.\n\n:::\n:::\n\n## Individual mutations\n\nWe may also be interested in looking at more details about the frequencies of individual mutations. \nFor this, it may help to calculate a _confidence interval_ for the mutation frequency, based on the counts of reads observed (i.e. the sequencing depth). \nOne way to calculate such a confidence interval is to use the so-called [Jeffreys interval](https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Jeffreys_interval), which is based on the _Beta_ distribution. \nIn R, we can calculate this as follows: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense <- missense |> \n mutate(af_lo = qbeta(0.025, alt_dp + 0.5, (dp - alt_dp) + 0.5),\n af_hi = qbeta(0.975, alt_dp + 0.5, (dp - alt_dp) + 0.5))\n```\n:::\n\n\nOne possible visualisation is to consider the mutations in individual samples, shown as a plot across the genome:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense |> \n filter(sample == \"SRR18541114\") |> \n ggplot(aes(pos, af, colour = gene)) + \n geom_pointrange(aes(ymin = af_lo, ymax = af_hi)) +\n scale_y_continuous(limits = c(0, 1))\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n\nOr, we can focus on an individual mutation and plot it over time:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense |> \n filter(aa_change == \"p.K856R\") |> \n ggplot(aes(date, af)) +\n geom_pointrange(aes(ymin = af_lo, ymax = af_hi)) +\n scale_y_continuous(limits = c(0, 1))\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-18-1.png){width=672}\n:::\n:::\n\n\n:::{.callout-note}\n#### Missing data\n\nThe way `viralrecon` performs variant (mutation) calling, does not allow distinguishing whether the absence of a mutation from the table is due to missing data or not. \nTwo things may have happened:\n\n- There was no coverage in a position (a \"gap\" in the sequencing), meaning there is missing data for that position.\n- All the reads mapped to that position carried the reference genome allele, so no mutation was reported. This could mean either the mutation trully had a frequency of zero or was low-frequency and was missed by chance. \n\nAs we cannot distinguish between these two possibilities, it is important to keep in mind that the absence of a mutation from our table does not necessarily mean the mutation was not present in the sample. \nIt only means that we were not able to detect it. \n:::\n\n\n## Summary\n\n:::{.callout-tip}\n## Key Points\n\n- The software _R_ can be used to import the mutations CSV file generated by `viralrecon`. \n- The mutations file can be _joined_ with metadata table, to assess the occurrence of mutations over time. \n- Barplots can be useful to visualise the counts of mutation types detected and the genes they occur in. \n- Heatmaps can be used to visualise mutation abundances over time. Mutations that become more frequent over time appear as \"blocks\" in the plot.\n:::\n",
"supporting": [
"04-ww_mutations_files"
],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 2f6959c

Please sign in to comment.