diff --git a/_freeze/materials/04-wastewater/03-ww_visualisation/execute-results/html.json b/_freeze/materials/04-wastewater/03-ww_visualisation/execute-results/html.json index 254d578..d7bf2ce 100644 --- a/_freeze/materials/04-wastewater/03-ww_visualisation/execute-results/html.json +++ b/_freeze/materials/04-wastewater/03-ww_visualisation/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "7225a412a1aabeae37a5c9d1b8813c25", + "hash": "ff044dacf53338ff560e12e5d9645c87", "result": { "engine": "knitr", - "markdown": "---\npagetitle: \"SARS Genomic Surveillance\"\n---\n\n\n# Abundance Visualisation\n\n\n::: {.cell}\n\n:::\n\n\n:::{.callout-tip}\n## Learning Objectives\n\n- Perform basic exploratory analysis of variant/lineage abundance data using the software _R_.\n- Generate several plots showing the change in variant abundance over time.\n- Explore which lineages are detected and assess the uncertainty in their estimates due to sequencing.\n- Recognise the pros and cons of analysing data from individual lineages or summarised across variants of concern.\n:::\n\n\n## Data exploration with R\n\nIn the previous section we have generated a CSV file that aggregated all the lineage/variant abundances into a single table. \nAlthough some basic exploration of these data can be done in _Excel_, we can perform more advanced and customised visualisations using the _R_ software (see the [R fundamentals](../appendices/quick_r.Rmd) appendix for a quick introduction to this software).\n\n:::{.callout-note collapse=true}\n#### Summary of R functions used\n\nThe main functions used in this section are: \n\n- Data import ([Posit cheatsheet](https://rstudio.github.io/cheatsheets/html/data-import.html)):\n - `read_csv()` to import a CSV file as a data.frame/tibble object.\n- Data manipulation ([Posit cheatsheet](https://rstudio.github.io/cheatsheets/html/data-transformation.html)):\n - `filter()` to subset rows of the table that match a particular condition.\n - `arrange()` to sort the table by the values of selected columns.\n - `count()` to count the unique values of selected columns.\n - `mutate()` to add new columns to the table or modify the values of existing ones. \n- Working with categorical variables ([Posit cheatsheet](https://rstudio.github.io/cheatsheets/html/factors.html)):\n - `fct_reorder()` to order categorical values based on a numeric variable (rather than the default alphabetical ordering).\n- Working with dates ([Posit cheatsheet](https://rstudio.github.io/cheatsheets/html/lubridate.html)):\n - `floor_date()` to round a date down to the time unit specified (e.g. \"week\" or \"month\").\n- Visualisation with `ggplot2` ([Posit cheatsheet](https://rstudio.github.io/cheatsheets/html/data-visualization.html)).\n:::\n\nWe start by loading the `tidyverse` package, which contains several functions for data manipulation and visualisation:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ntheme_set(theme_bw())\n```\n:::\n\n\nWe have also set a \"black-and-white\" theme for our `ggplot` plots, instead of the default \"grey\" theme.\n\nThe next step is to read our data in:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvocs <- read_csv(\"results/tidyfreyja/vocs_abundances.csv\")\n```\n:::\n\n\n\n\nWe can check the first few rows of our data, to check that is was imported correctly:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(vocs)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 10\n sample name abundance boot_lo boot_hi date country city latitude\n \n1 SRR18541074 Delta 0.964 0.960 0.968 2021-12-01 United… San … 32.7\n2 SRR18541074 Omicr… 0.0197 0.0132 0.0228 2021-12-01 United… San … 32.7\n3 SRR18541074 Other 0.00122 0 0.00872 2021-12-01 United… San … 32.7\n4 SRR18541043 Omicr… 0.876 0.873 0.881 2021-12-26 United… San … 32.7\n5 SRR18541043 Other 0.0786 0.0683 0.0807 2021-12-26 United… San … 32.7\n6 SRR18541043 Delta 0.0338 0.0343 0.0447 2021-12-26 United… San … 32.7\n# ℹ 1 more variable: longitude \n```\n\n\n:::\n:::\n\n\nWe can start with some basic exploration of our data, answering some simple questions.\nSome examples are given here.\n\nHow many VOCs have >75% frequency?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvocs |> \n # keep rows with >= 75% abundance\n filter(abundance >= 0.75) |> \n # sort them by date\n arrange(date)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 12 × 10\n sample name abundance boot_lo boot_hi date country city latitude\n \n 1 SRR18541092 Delta 0.966 0.936 0.971 2021-09-05 United… San … 32.7\n 2 SRR18541114 Delta 0.982 0.976 0.984 2021-09-06 United… San … 32.7\n 3 SRR18541096 Delta 0.905 0.898 0.923 2021-09-07 United… Caer… 40.1\n 4 SRR18541099 Delta 0.888 0.876 0.960 2021-09-07 United… Caer… 40.1\n 5 SRR18541098 Delta 0.973 0.961 0.977 2021-09-13 United… Caer… 40.1\n 6 SRR18541049 Delta 0.962 0.960 0.973 2021-11-28 United… San … 32.7\n 7 SRR18541074 Delta 0.964 0.960 0.968 2021-12-01 United… San … 32.7\n 8 SRR18541042 Delta 0.795 0.782 0.820 2021-12-06 United… San … 32.7\n 9 SRR18541043 Omic… 0.876 0.873 0.881 2021-12-26 United… San … 32.7\n10 SRR18541028 Omic… 0.932 0.909 0.945 2021-12-27 United… San … 32.7\n11 SRR18541027 Omic… 0.886 0.883 0.890 2022-01-04 United… San … 32.7\n12 SRR18541030 Omic… 0.915 0.905 0.946 2022-01-23 United… San … 32.7\n# ℹ 1 more variable: longitude \n```\n\n\n:::\n:::\n\n\nWhat was the count of each detected variant?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvocs |> \n count(name)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 8 × 2\n name n\n \n1 A 7\n2 Alpha 6\n3 Beta 7\n4 Delta 17\n5 Epsilon 1\n6 Gamma 8\n7 Omicron 9\n8 Other 15\n```\n\n\n:::\n:::\n\n\nWe can also start doing some visualisations. \nFor example, the previous question can be visualised with a barplot: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nvocs |> \n # count occurrence of each VOC\n count(name) |> \n # visualise\n ggplot(aes(x = n, y = name)) +\n geom_col()\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-7-1.png){width=672}\n:::\n:::\n\n\nEven better, we can sort our variants by count rather than alphabetically: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nvocs |> \n # count occurrence of each VOC\n count(name) |> \n # reorder the names by their count\n mutate(name = fct_reorder(name, n)) |>\n # visualise\n ggplot(aes(x = n, y = name)) +\n geom_col()\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\nWe may also want to break this down by city: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nvocs |> \n # count by city and name\n count(city, name) |> \n # reorder the names by their count\n mutate(name = fct_reorder(name, n)) |> \n # visualise\n ggplot(aes(x = n, y = name)) +\n geom_col(aes(fill = city))\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n\nIt seems like Omicron was only detected in San Diego. \nCould this be because of the dates when the samples were collected in each city?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvocs |> \n # create a variable with the start of the week for each sample\n mutate(week = floor_date(date, \"week\")) |> \n # count how many samples per week and city\n count(week, city) |> \n # barplot\n ggplot(aes(week, n)) +\n geom_col() +\n facet_grid(rows = vars(city))\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\nIndeed, it seems like San Diego has a wider coverage across time. \nIt is also clear from this plot that we have a time gap in our sampling, missing samples in Oct and Nov. \n\n:::{.callout-exercise}\n\nGiven that San Diego has better sampling through time, let's create a new table for our downstream visualisations.\n\n1. Create a new table called `sandiego`:\n - Retain observations from San Diego only.\n - Add a new column with the start of the month that each sample was collected in.\n - Order the sample IDs and variant IDs by their date of collection.\n2. Make a new barplot with the counts of each variant observed in this city.\n\n:::{.callout-hint}\nFor the first task, the following functions can be used: \n\n- `filter()` to subset rows\n- `mutate()` to add or modify columns\n- `fct_reorder()` to order categorical variables based on the date\n:::\n\n:::{.callout-answer}\n\nThe following code creates a new table as requested:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsandiego <- vocs |> \n filter(city == \"San Diego\") |> \n mutate(month = floor_date(date, \"month\"),\n sample = fct_reorder(sample, date),\n name = fct_reorder(name, date))\n```\n:::\n\n\nWe can produce a barplot of variant counts as we did before:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsandiego |> \n count(name) |> \n mutate(name = fct_reorder(name, n)) |> \n ggplot(aes(n, name)) +\n geom_col()\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\n:::\n:::\n\n## Variant abundances\n\nRelative lineage abundances can be visualised as a barplot, where the samples are ordered by their date of collection.\nUsing `ggplot`: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nsandiego |> \n ggplot(aes(x = sample, y = abundance)) +\n geom_col(aes(fill = name)) +\n scale_x_discrete(guide = guide_axis(angle = 45))\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-13-1.png){width=672}\n:::\n:::\n\n\nTo add further information about the date of collection, we can also \"facet\" the plot by the month when the samples were collected:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsandiego |> \n ggplot(aes(x = sample, y = abundance)) +\n geom_col(aes(fill = name)) +\n scale_x_discrete(guide = guide_axis(angle = 45)) +\n facet_grid(cols = vars(month), scales = \"free_x\", space = \"free_x\")\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\nYou may notice that some of the bars don't quite add up to 1. \nThis is simply a rounding issue from Freyja output.\n\nWe can also visualise the abundances in a heatmap-style plot, which may be useful if the number of samples is very large: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nsandiego |> \n ggplot(aes(sample, name)) +\n geom_tile(aes(fill = abundance)) +\n scale_fill_viridis_c(limits = c(0, 1)) +\n scale_x_discrete(guide = guide_axis(angle = 45)) +\n facet_grid(~ month, scales = \"free_x\", space = \"free_x\")\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-15-1.png){width=672}\n:::\n:::\n\n\nWe can clearly see the transition between Delta and Omicron. \nDelta is dominant in Sep and Nov, then samples start to appear more mixed in Dec, and finally replaced by Omicron by Jan.\n\nAnother way to visualise these data is using a line plot, with a line for each variant: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nsandiego |> \n ggplot(aes(date, abundance, colour = name)) +\n geom_point(size = 2) +\n geom_line(linewidth = 1)\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-16-1.png){width=672}\n:::\n:::\n\n\nThis plot also shows the clear shift between Delta and Omicron. \nWe can also see the other variants are much less frequent in these samples. \n\n## Lineage abundances\n\nSo far, we have been analysing the frequencies summarised by Variant of Concern (VOC).\nWe could do similar analyses using the individual lineages, however these tend to be less clear.\n\nLet's start by importing our data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlineages <- read_csv(\"results/tidyfreyja/lineage_abundances.csv\")\n```\n:::\n\n\n\n\nHere is an example of the heatmap-style visualisation for the San Diego samples: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlineages |> \n # keep only samples from San Diego\n filter(city == \"San Diego\") |> \n # order the samples and lineages by date\n mutate(sample = fct_reorder(sample, date),\n name = fct_reorder(name, date)) |> \n # visualise\n ggplot(aes(sample, name)) +\n geom_tile(aes(fill = abundance)) +\n scale_fill_viridis_c(limits = c(0, 1)) +\n theme_classic() +\n scale_x_discrete(guide = guide_axis(angle = 45)) +\n scale_y_discrete(guide = guide_axis(check.overlap = TRUE))\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-19-1.png){width=672}\n:::\n:::\n\n\nIn this case, there are too many lineages to be easily visible on a plot like this (not all lineage names are shown on the y-axis, as they were overlapping each other).\nTherefore, using the VOCs summaries is more suited for these types of visualisations. \n\nWe can also see that the abundance of these lineages is generally very low, most lineages have a low frequency.\nThis may be due to a mixture of sub-lineages being present in the sample, or even due to imprecisions in the estimates from our data.\nThis can happen for lineages that have very similar mutation profiles. \n\nHere is a histogram showing the distribution of lineage abundances:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlineages |> \n ggplot(aes(abundance)) +\n geom_histogram(binwidth = 0.01)\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-20-1.png){width=672}\n:::\n:::\n\n\nAs we can see, most lineages have an abundance of less than 1% (the first bar in the plot). \nSince many of these lineages are, in fact, part of the same clade, the summarised VOCs table gives a clearler picture for these types of visualisation.\n\nHowever, the lineage data may be useful to **investigate specific samples in more detail**. \nFor example, what were the lineages present in the latest sample collected?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlineages |> \n filter(sample == \"SRR18541030\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 10 × 10\n sample name abundance boot_lo boot_hi date country city latitude\n \n 1 SRR18541030 BA.1… 0.738 0.693 0.799 2022-01-23 United… San … 32.7\n 2 SRR18541030 BA.1… 0.0991 0.0674 0.102 2022-01-23 United… San … 32.7\n 3 SRR18541030 XD 0.0626 0.0282 0.0769 2022-01-23 United… San … 32.7\n 4 SRR18541030 BA.1… 0.0604 0.0513 0.0690 2022-01-23 United… San … 32.7\n 5 SRR18541030 BA.1… 0.0161 0 0.0349 2022-01-23 United… San … 32.7\n 6 SRR18541030 XS 0.00684 0 0.0137 2022-01-23 United… San … 32.7\n 7 SRR18541030 AY.1… 0.00260 0 0.00540 2022-01-23 United… San … 32.7\n 8 SRR18541030 BA.1… 0.00129 0 0.00389 2022-01-23 United… San … 32.7\n 9 SRR18541030 AY.2… 0.00128 0 0.00278 2022-01-23 United… San … 32.7\n10 SRR18541030 AY.1… 0.00125 0 0.00177 2022-01-23 United… San … 32.7\n# ℹ 1 more variable: longitude \n```\n\n\n:::\n:::\n\n\nWe can plot their frequency and bootstrap uncertainty interval:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlineages |> \n # keep rows for sample of interest only\n filter(sample == \"SRR18541030\") |> \n # sort lineage by abundance\n mutate(name = fct_reorder(name, abundance)) |> \n # make the plot\n ggplot(aes(x = name, y = abundance, colour = abundance < 0.05)) +\n geom_pointrange(aes(ymin = boot_lo, ymax = boot_hi)) +\n labs(x = \"Lineage\", y = \"Abundance (95% CI)\", colour = \"< 5%\")\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-22-1.png){width=672}\n:::\n:::\n\n\nWe can see several lineages with low frequency (less than 5%), which we should interpret carefully as those tend to less precise (see [Sutcliffe et al. 2023](https://doi.org/10.1101/2023.12.20.572426)). \nFor higher frequency lineages the confidence intervals are relatively narrow, suggesting the uncertainty due to sequencing depth is not problematic for this sample. \n\n\n### Exercise\n\n:::{.callout-exercise}\n#### Lineages abundance uncertainty\n\n- Based on the code just shown above, make a similar lineage abundance plot for the each of the following samples: \"SRR18541092\" and \"SRR18541114\". \n- What do you think about the uncertainty in the lineage estimates of these two samples? Can you hypothesise the reason for this difference? (Hint: go back to the _MultiQC_ report and check these samples' quality).\n- For both samples, make a similar plot but for the summaried VOC abundances. What do you think about the uncertainty in this case? Discuss with your neighbours.\n\n\n:::{.callout-answer}\n\nHere is the code for these two variants:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlineages |> \n filter(sample == \"SRR18541092\") |> \n mutate(name = fct_reorder(name, abundance)) |> \n ggplot(aes(x = name, y = abundance, colour = abundance < 0.05)) +\n geom_pointrange(aes(ymin = boot_lo, ymax = boot_hi)) +\n scale_x_discrete(guide = guide_axis(angle = 45)) +\n labs(x = \"Lineage\", y = \"Abundance (95% CI)\", colour = \"< 5%\", \n main = \"Sample: SRR18541092\")\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-23-1.png){width=672}\n:::\n\n```{.r .cell-code}\nlineages |> \n filter(sample == \"SRR18541114\") |> \n mutate(name = fct_reorder(name, abundance)) |> \n ggplot(aes(x = name, y = abundance, colour = abundance < 0.05)) +\n geom_pointrange(aes(ymin = boot_lo, ymax = boot_hi)) +\n scale_x_discrete(guide = guide_axis(angle = 45)) +\n labs(x = \"Lineage\", y = \"Abundance (95% CI)\", colour = \"< 5%\",\n main = \"Sample: SRR18541114\")\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-23-2.png){width=672}\n:::\n:::\n\n\nWe can see that the uncertainty in sample _SRR18541114_ is substantially higher compared to _SRR18541092_. \n\nThe reason is possibly the **difference in sequencing depth** between these samples.\nIf we go back to the _MultiQC_ report from `viralrecon` we will see that the median coverage for _SRR18541114_ is only 94x, compared to 377x for the other sample. \nLooking at the _Mosdepth_ cumulative coverage graph on the report is even more illustrative: \n\n![](images/ww_mosdepth.png)\n\nWe can see from this plot that, for example, only ~50% of the genome in _SRR18541114_ is covered at >100x, compared to ~70% on the other sample.\nThis likely leads to several mutations that are missed during sequencing, leading to less stable abundance estimates.\n\nA similar code could be used to visualise the abundance of the VOCs in these two samples. \nHowever, we show a modified version of the code, combining both samples in the same plot:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsandiego |> \n # keep rows for sample of interest only\n filter(sample %in% c(\"SRR18541092\", \"SRR18541114\")) |> \n # make the plot\n ggplot(aes(x = name, y = abundance, colour = sample)) +\n geom_pointrange(aes(ymin = boot_lo, ymax = boot_hi), \n position = position_dodge(width = 0.5))\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-24-1.png){width=672}\n:::\n:::\n\n\nWe've used a few tricks for this visualisation: \n\n- The `%in%` operator is used to select rows that match either of those samples.\n- We've added `colour = sample` aesthetic, to colour points based on the sample.\n- `position_dodge()` is used to shift the points of the two samples so they are not overlapping.\n\nThis visualisation reveals much less uncertainty when summarising the lineages into variants of concern. \nThis makes some sense, as multiple lineages are combined together for a given VOC, so even though the uncertainty of individual lineages may be high, this uncertainty is reduced when looking at the summarised results. \n\nThis reveals a clear tradeoff between looking at individual lineages compared to summarised VOC abundances. \nOn the one hand we get more information about specific lineages detected, but their abundances may have high uncertainty due to sequencing. \nOn the other hand we get more precise abundance estimates for VOCs, but loose the detail of which specific lineages of those variants were present in our samples. \n\nIn conclusion, looking at both results is useful. \nWe can start with an analysis of VOCs to get the bigger picture of which variants are circulating, and later explore individual lineages detected in the samples. \nWhen looking at individual lineages, we should be mindful of considering samples with higher sequencing depth, to get a more precise picture.\n:::\n:::\n\n\n## Freyja dashboard\n\nIn this chapter we covered the use of _R_ to generate these visualisations. \nThis is because of the flexibility of this language, which allows us to perform many different types of visualisation and data exploration. \n\nHowever, the _Freyja_ developers provide a command to generate a dashboard with a barplot of VOC abundances. \nWe have found this solution less flexible, and it requires configuring the metadata file in a very specific format.\n\nThere are two steps involved: \n\n- `freyja aggregate` is used to combine the results of multiple samples into a single file. \n- `freyja dash` is then used to create the dashboard.\n\nThe [_Freyja_ documentation](https://github.com/andersen-lab/Freyja#additional-options) gives more details about how to use these commands. \nYou will need to pay attention to the specific file format for metadata, which requires certain columns to be present in order for the commands to work. \n\n\n## Summary \n\n:::{.callout-tip}\n## Key points\n\n- There are several useful exploratory analyses that can be done on variant abundances:\n - Which variants were detected at high frequency.\n - How many times each variant was detected across all samples.\n - How many samples were collected weekly in different regions.\n- Variant abundance over time can be displayed as barplots, heatmaps or line plots.\n- Samples with higher sequencing depth usually have narrower confidence intervals compared to those with low sequencing depth. \n- Analysis of both lineage and VOC abundance is useful:\n - VOC abundance gives more stable estimates and provides a \"big picture\" of the variants circulating the population. \n - Lineage abundance gives more detail about which exact lineages of a variant are circulating in the population, however the estimates are less precise with wider confidence intervals. \n:::\n", + "markdown": "---\npagetitle: \"SARS Genomic Surveillance\"\n---\n\n\n# Abundance Visualisation\n\n\n::: {.cell}\n\n:::\n\n\n:::{.callout-tip}\n## Learning Objectives\n\n- Perform basic exploratory analysis of variant/lineage abundance data using the software _R_.\n- Generate several plots showing the change in variant abundance over time.\n- Explore which lineages are detected and assess the uncertainty in their estimates due to sequencing.\n- Recognise the pros and cons of analysing data from individual lineages or summarised across variants of concern.\n:::\n\n\n## Data exploration with R\n\nIn the previous section we have generated a CSV file that aggregated all the lineage/variant abundances into a single table. \nAlthough some basic exploration of these data can be done in _Excel_, we can perform more advanced and customised visualisations using the _R_ software (see the [R fundamentals](../appendices/quick_r.Rmd) appendix for a quick introduction to this software).\n\n:::{.callout-note collapse=true}\n#### Summary of R functions used\n\nThe main functions used in this section are: \n\n- Data import ([Posit cheatsheet](https://rstudio.github.io/cheatsheets/html/data-import.html)):\n - `read_csv()` to import a CSV file as a data.frame/tibble object.\n- Data manipulation ([Posit cheatsheet](https://rstudio.github.io/cheatsheets/html/data-transformation.html)):\n - `filter()` to subset rows of the table that match a particular condition.\n - `arrange()` to sort the table by the values of selected columns.\n - `count()` to count the unique values of selected columns.\n - `mutate()` to add new columns to the table or modify the values of existing ones. \n- Working with categorical variables ([Posit cheatsheet](https://rstudio.github.io/cheatsheets/html/factors.html)):\n - `fct_reorder()` to order categorical values based on a numeric variable (rather than the default alphabetical ordering).\n- Working with dates ([Posit cheatsheet](https://rstudio.github.io/cheatsheets/html/lubridate.html)):\n - `floor_date()` to round a date down to the time unit specified (e.g. \"week\" or \"month\").\n- Visualisation with `ggplot2` ([Posit cheatsheet](https://rstudio.github.io/cheatsheets/html/data-visualization.html)).\n:::\n\nWe start by loading the `tidyverse` package, which contains several functions for data manipulation and visualisation:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ntheme_set(theme_bw())\n```\n:::\n\n\nWe have also set a \"black-and-white\" theme for our `ggplot` plots, instead of the default \"grey\" theme.\n\nThe next step is to read our data in:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvocs <- read_csv(\"results/tidyfreyja/vocs_abundances.csv\")\n```\n:::\n\n\n\n\nWe can check the first few rows of our data, to check that is was imported correctly:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(vocs)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 10\n sample name abundance boot_lo boot_up date country city latitude\n \n1 SRR18541074 Delta 0.964 0.960 0.968 2021-12-01 United… San … 32.7\n2 SRR18541074 Omicr… 0.0197 0.0132 0.0228 2021-12-01 United… San … 32.7\n3 SRR18541074 Other 0.00122 0 0.00872 2021-12-01 United… San … 32.7\n4 SRR18541043 Omicr… 0.876 0.873 0.881 2021-12-26 United… San … 32.7\n5 SRR18541043 Other 0.0786 0.0683 0.0807 2021-12-26 United… San … 32.7\n6 SRR18541043 Delta 0.0338 0.0343 0.0447 2021-12-26 United… San … 32.7\n# ℹ 1 more variable: longitude \n```\n\n\n:::\n:::\n\n\nWe can start with some basic exploration of our data, answering some simple questions.\nSome examples are given here.\n\nHow many VOCs have >75% frequency?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvocs |> \n # keep rows with >= 75% abundance\n filter(abundance >= 0.75) |> \n # sort them by date\n arrange(date)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 12 × 10\n sample name abundance boot_lo boot_up date country city latitude\n \n 1 SRR18541092 Delta 0.966 0.936 0.971 2021-09-05 United… San … 32.7\n 2 SRR18541114 Delta 0.982 0.976 0.984 2021-09-06 United… San … 32.7\n 3 SRR18541096 Delta 0.905 0.898 0.923 2021-09-07 United… Caer… 40.1\n 4 SRR18541099 Delta 0.888 0.876 0.960 2021-09-07 United… Caer… 40.1\n 5 SRR18541098 Delta 0.973 0.961 0.977 2021-09-13 United… Caer… 40.1\n 6 SRR18541049 Delta 0.962 0.960 0.973 2021-11-28 United… San … 32.7\n 7 SRR18541074 Delta 0.964 0.960 0.968 2021-12-01 United… San … 32.7\n 8 SRR18541042 Delta 0.795 0.782 0.820 2021-12-06 United… San … 32.7\n 9 SRR18541043 Omic… 0.876 0.873 0.881 2021-12-26 United… San … 32.7\n10 SRR18541028 Omic… 0.932 0.909 0.945 2021-12-27 United… San … 32.7\n11 SRR18541027 Omic… 0.886 0.883 0.890 2022-01-04 United… San … 32.7\n12 SRR18541030 Omic… 0.915 0.905 0.946 2022-01-23 United… San … 32.7\n# ℹ 1 more variable: longitude \n```\n\n\n:::\n:::\n\n\nWhat was the count of each detected variant?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvocs |> \n count(name)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 8 × 2\n name n\n \n1 A 7\n2 Alpha 6\n3 Beta 7\n4 Delta 17\n5 Epsilon 1\n6 Gamma 8\n7 Omicron 9\n8 Other 15\n```\n\n\n:::\n:::\n\n\nWe can also start doing some visualisations. \nFor example, the previous question can be visualised with a barplot: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nvocs |> \n # count occurrence of each VOC\n count(name) |> \n # visualise\n ggplot(aes(x = n, y = name)) +\n geom_col()\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-7-1.png){width=672}\n:::\n:::\n\n\nEven better, we can sort our variants by count rather than alphabetically: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nvocs |> \n # count occurrence of each VOC\n count(name) |> \n # reorder the names by their count\n mutate(name = fct_reorder(name, n)) |>\n # visualise\n ggplot(aes(x = n, y = name)) +\n geom_col()\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\nWe may also want to break this down by city: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nvocs |> \n # count by city and name\n count(city, name) |> \n # reorder the names by their count\n mutate(name = fct_reorder(name, n)) |> \n # visualise\n ggplot(aes(x = n, y = name)) +\n geom_col(aes(fill = city))\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n\nIt seems like Omicron was only detected in San Diego. \nCould this be because of the dates when the samples were collected in each city?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvocs |> \n # create a variable with the start of the week for each sample\n mutate(week = floor_date(date, \"week\")) |> \n # count how many samples per week and city\n count(week, city) |> \n # barplot\n ggplot(aes(week, n)) +\n geom_col() +\n facet_grid(rows = vars(city))\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\nIndeed, it seems like San Diego has a wider coverage across time. \nIt is also clear from this plot that we have a time gap in our sampling, missing samples in Oct and Nov. \n\n:::{.callout-exercise}\n\nGiven that San Diego has better sampling through time, let's create a new table for our downstream visualisations.\n\n1. Create a new table called `sandiego`:\n - Retain observations from San Diego only.\n - Add a new column with the start of the month that each sample was collected in.\n - Order the sample IDs and variant IDs by their date of collection.\n2. Make a new barplot with the counts of each variant observed in this city.\n\n:::{.callout-hint}\nFor the first task, the following functions can be used: \n\n- `filter()` to subset rows\n- `mutate()` to add or modify columns\n- `fct_reorder()` to order categorical variables based on the date\n:::\n\n:::{.callout-answer}\n\nThe following code creates a new table as requested:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsandiego <- vocs |> \n filter(city == \"San Diego\") |> \n mutate(month = floor_date(date, \"month\"),\n sample = fct_reorder(sample, date),\n name = fct_reorder(name, date))\n```\n:::\n\n\nWe can produce a barplot of variant counts as we did before:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsandiego |> \n count(name) |> \n mutate(name = fct_reorder(name, n)) |> \n ggplot(aes(n, name)) +\n geom_col()\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\n:::\n:::\n\n## Variant abundances\n\nRelative lineage abundances can be visualised as a barplot, where the samples are ordered by their date of collection.\nUsing `ggplot`: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nsandiego |> \n ggplot(aes(x = sample, y = abundance)) +\n geom_col(aes(fill = name)) +\n scale_x_discrete(guide = guide_axis(angle = 45))\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-13-1.png){width=672}\n:::\n:::\n\n\nTo add further information about the date of collection, we can also \"facet\" the plot by the month when the samples were collected:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsandiego |> \n ggplot(aes(x = sample, y = abundance)) +\n geom_col(aes(fill = name)) +\n scale_x_discrete(guide = guide_axis(angle = 45)) +\n facet_grid(cols = vars(month), scales = \"free_x\", space = \"free_x\")\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\nYou may notice that some of the bars don't quite add up to 1. \nThis is simply a rounding issue from Freyja output.\n\nWe can also visualise the abundances in a heatmap-style plot, which may be useful if the number of samples is very large: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nsandiego |> \n ggplot(aes(sample, name)) +\n geom_tile(aes(fill = abundance)) +\n scale_fill_viridis_c(limits = c(0, 1)) +\n scale_x_discrete(guide = guide_axis(angle = 45)) +\n facet_grid(~ month, scales = \"free_x\", space = \"free_x\")\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-15-1.png){width=672}\n:::\n:::\n\n\nWe can clearly see the transition between Delta and Omicron. \nDelta is dominant in Sep and Nov, then samples start to appear more mixed in Dec, and finally replaced by Omicron by Jan.\n\nAnother way to visualise these data is using a line plot, with a line for each variant: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nsandiego |> \n ggplot(aes(date, abundance, colour = name)) +\n geom_point(size = 2) +\n geom_line(linewidth = 1)\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-16-1.png){width=672}\n:::\n:::\n\n\nThis plot also shows the clear shift between Delta and Omicron. \nWe can also see the other variants are much less frequent in these samples. \n\n## Lineage abundances\n\nSo far, we have been analysing the frequencies summarised by Variant of Concern (VOC).\nWe could do similar analyses using the individual lineages, however these tend to be less clear.\n\nLet's start by importing our data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlineages <- read_csv(\"results/tidyfreyja/lineage_abundances.csv\")\n```\n:::\n\n\n\n\nHere is an example of the heatmap-style visualisation for the San Diego samples: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlineages |> \n # keep only samples from San Diego\n filter(city == \"San Diego\") |> \n # order the samples and lineages by date\n mutate(sample = fct_reorder(sample, date),\n name = fct_reorder(name, date)) |> \n # visualise\n ggplot(aes(sample, name)) +\n geom_tile(aes(fill = abundance)) +\n scale_fill_viridis_c(limits = c(0, 1)) +\n theme_classic() +\n scale_x_discrete(guide = guide_axis(angle = 45)) +\n scale_y_discrete(guide = guide_axis(check.overlap = TRUE))\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-19-1.png){width=672}\n:::\n:::\n\n\nIn this case, there are too many lineages to be easily visible on a plot like this (not all lineage names are shown on the y-axis, as they were overlapping each other).\nTherefore, using the VOCs summaries is more suited for these types of visualisations. \n\nWe can also see that the abundance of these lineages is generally very low, most lineages have a low frequency.\nThis may be due to a mixture of sub-lineages being present in the sample, or even due to imprecisions in the estimates from our data.\nThis can happen for lineages that have very similar mutation profiles. \n\nHere is a histogram showing the distribution of lineage abundances:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlineages |> \n ggplot(aes(abundance)) +\n geom_histogram(binwidth = 0.01)\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-20-1.png){width=672}\n:::\n:::\n\n\nAs we can see, most lineages have an abundance of less than 1% (the first bar in the plot). \nSince many of these lineages are, in fact, part of the same clade, the summarised VOCs table gives a clearler picture for these types of visualisation.\n\nHowever, the lineage data may be useful to **investigate specific samples in more detail**. \nFor example, what were the lineages present in the latest sample collected?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlineages |> \n filter(sample == \"SRR18541030\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 10 × 10\n sample name abundance boot_lo boot_up date country city latitude\n \n 1 SRR18541030 BA.1… 0.738 0.693 0.799 2022-01-23 United… San … 32.7\n 2 SRR18541030 BA.1… 0.0991 0.0674 0.102 2022-01-23 United… San … 32.7\n 3 SRR18541030 XD 0.0626 0.0282 0.0769 2022-01-23 United… San … 32.7\n 4 SRR18541030 BA.1… 0.0604 0.0513 0.0690 2022-01-23 United… San … 32.7\n 5 SRR18541030 BA.1… 0.0161 0 0.0349 2022-01-23 United… San … 32.7\n 6 SRR18541030 XS 0.00684 0 0.0137 2022-01-23 United… San … 32.7\n 7 SRR18541030 AY.1… 0.00260 0 0.00540 2022-01-23 United… San … 32.7\n 8 SRR18541030 BA.1… 0.00129 0 0.00389 2022-01-23 United… San … 32.7\n 9 SRR18541030 AY.2… 0.00128 0 0.00278 2022-01-23 United… San … 32.7\n10 SRR18541030 AY.1… 0.00125 0 0.00177 2022-01-23 United… San … 32.7\n# ℹ 1 more variable: longitude \n```\n\n\n:::\n:::\n\n\nWe can plot their frequency and bootstrap uncertainty interval:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlineages |> \n # keep rows for sample of interest only\n filter(sample == \"SRR18541030\") |> \n # sort lineage by abundance\n mutate(name = fct_reorder(name, abundance)) |> \n # make the plot\n ggplot(aes(x = name, y = abundance, colour = abundance < 0.05)) +\n geom_pointrange(aes(ymin = boot_lo, ymax = boot_up)) +\n labs(x = \"Lineage\", y = \"Abundance (95% CI)\", colour = \"< 5%\")\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-22-1.png){width=672}\n:::\n:::\n\n\nWe can see several lineages with low frequency (less than 5%), which we should interpret carefully as those tend to less precise (see [Sutcliffe et al. 2023](https://doi.org/10.1101/2023.12.20.572426)). \nFor higher frequency lineages the confidence intervals are relatively narrow, suggesting the uncertainty due to sequencing depth is not problematic for this sample. \n\n\n### Exercise\n\n:::{.callout-exercise}\n#### Lineages abundance uncertainty\n\n- Based on the code just shown above, make a similar lineage abundance plot for the each of the following samples: \"SRR18541092\" and \"SRR18541114\". \n- What do you think about the uncertainty in the lineage estimates of these two samples? Can you hypothesise the reason for this difference? (Hint: go back to the _MultiQC_ report and check these samples' quality).\n- For both samples, make a similar plot but for the summaried VOC abundances. What do you think about the uncertainty in this case? Discuss with your neighbours.\n\n\n:::{.callout-answer}\n\nHere is the code for these two variants:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlineages |> \n filter(sample == \"SRR18541092\") |> \n mutate(name = fct_reorder(name, abundance)) |> \n ggplot(aes(x = name, y = abundance, colour = abundance < 0.05)) +\n geom_pointrange(aes(ymin = boot_lo, ymax = boot_up)) +\n scale_x_discrete(guide = guide_axis(angle = 45)) +\n labs(x = \"Lineage\", y = \"Abundance (95% CI)\", colour = \"< 5%\", \n main = \"Sample: SRR18541092\")\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-23-1.png){width=672}\n:::\n\n```{.r .cell-code}\nlineages |> \n filter(sample == \"SRR18541114\") |> \n mutate(name = fct_reorder(name, abundance)) |> \n ggplot(aes(x = name, y = abundance, colour = abundance < 0.05)) +\n geom_pointrange(aes(ymin = boot_lo, ymax = boot_up)) +\n scale_x_discrete(guide = guide_axis(angle = 45)) +\n labs(x = \"Lineage\", y = \"Abundance (95% CI)\", colour = \"< 5%\",\n main = \"Sample: SRR18541114\")\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-23-2.png){width=672}\n:::\n:::\n\n\nWe can see that the uncertainty in sample _SRR18541114_ is substantially higher compared to _SRR18541092_. \n\nThe reason is possibly the **difference in sequencing depth** between these samples.\nIf we go back to the _MultiQC_ report from `viralrecon` we will see that the median coverage for _SRR18541114_ is only 94x, compared to 377x for the other sample. \nLooking at the _Mosdepth_ cumulative coverage graph on the report is even more illustrative: \n\n![](images/ww_mosdepth.png)\n\nWe can see from this plot that, for example, only ~50% of the genome in _SRR18541114_ is covered at >100x, compared to ~70% on the other sample.\nThis likely leads to several mutations that are missed during sequencing, leading to less stable abundance estimates.\n\nA similar code could be used to visualise the abundance of the VOCs in these two samples. \nHowever, we show a modified version of the code, combining both samples in the same plot:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsandiego |> \n # keep rows for sample of interest only\n filter(sample %in% c(\"SRR18541092\", \"SRR18541114\")) |> \n # make the plot\n ggplot(aes(x = name, y = abundance, colour = sample)) +\n geom_pointrange(aes(ymin = boot_lo, ymax = boot_up), \n position = position_dodge(width = 0.5))\n```\n\n::: {.cell-output-display}\n![](03-ww_visualisation_files/figure-html/unnamed-chunk-24-1.png){width=672}\n:::\n:::\n\n\nWe've used a few tricks for this visualisation: \n\n- The `%in%` operator is used to select rows that match either of those samples.\n- We've added `colour = sample` aesthetic, to colour points based on the sample.\n- `position_dodge()` is used to shift the points of the two samples so they are not overlapping.\n\nThis visualisation reveals much less uncertainty when summarising the lineages into variants of concern. \nThis makes some sense, as multiple lineages are combined together for a given VOC, so even though the uncertainty of individual lineages may be high, this uncertainty is reduced when looking at the summarised results. \n\nThis reveals a clear tradeoff between looking at individual lineages compared to summarised VOC abundances. \nOn the one hand we get more information about specific lineages detected, but their abundances may have high uncertainty due to sequencing. \nOn the other hand we get more precise abundance estimates for VOCs, but loose the detail of which specific lineages of those variants were present in our samples. \n\nIn conclusion, looking at both results is useful. \nWe can start with an analysis of VOCs to get the bigger picture of which variants are circulating, and later explore individual lineages detected in the samples. \nWhen looking at individual lineages, we should be mindful of considering samples with higher sequencing depth, to get a more precise picture.\n:::\n:::\n\n\n## Freyja dashboard\n\nIn this chapter we covered the use of _R_ to generate these visualisations. \nThis is because of the flexibility of this language, which allows us to perform many different types of visualisation and data exploration. \n\nHowever, the _Freyja_ developers provide a command to generate a dashboard with a barplot of VOC abundances. \nWe have found this solution less flexible, and it requires configuring the metadata file in a very specific format.\n\nThere are two steps involved: \n\n- `freyja aggregate` is used to combine the results of multiple samples into a single file. \n- `freyja dash` is then used to create the dashboard.\n\nThe [_Freyja_ documentation](https://github.com/andersen-lab/Freyja#additional-options) gives more details about how to use these commands. \nYou will need to pay attention to the specific file format for metadata, which requires certain columns to be present in order for the commands to work. \n\n\n## Summary \n\n:::{.callout-tip}\n## Key points\n\n- There are several useful exploratory analyses that can be done on variant abundances:\n - Which variants were detected at high frequency.\n - How many times each variant was detected across all samples.\n - How many samples were collected weekly in different regions.\n- Variant abundance over time can be displayed as barplots, heatmaps or line plots.\n- Samples with higher sequencing depth usually have narrower confidence intervals compared to those with low sequencing depth. \n- Analysis of both lineage and VOC abundance is useful:\n - VOC abundance gives more stable estimates and provides a \"big picture\" of the variants circulating the population. \n - Lineage abundance gives more detail about which exact lineages of a variant are circulating in the population, however the estimates are less precise with wider confidence intervals. \n:::\n", "supporting": [ "03-ww_visualisation_files" ], diff --git a/_freeze/materials/04-wastewater/04-ww_mutations/execute-results/html.json b/_freeze/materials/04-wastewater/04-ww_mutations/execute-results/html.json index 7e50bcd..2017398 100644 --- a/_freeze/materials/04-wastewater/04-ww_mutations/execute-results/html.json +++ b/_freeze/materials/04-wastewater/04-ww_mutations/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "cefd3e3c107bf0e3726fcc6575106179", + "hash": "33e6bb9134c2d6871c2bb949401450f2", "result": { "engine": "knitr", - "markdown": "---\npagetitle: \"SARS Genomic Surveillance\"\n---\n\n\n# Mutation Analysis\n\n\n::: {.cell}\n\n:::\n\n\n:::{.callout-tip}\n## Learning Objectives\n\n- Perform exploratory analysis of mutations detected in wastewater samples.\n- Identify the most common types of mutations identified and their occurrence in different genes.\n- Produce a heatmap-style visualisation of mutation abundances over time.\n- Analyse individual mutations and estimate frequency confidence intervals from read counts.\n:::\n\n\n## Data import\n\nIn the previous chapter we investigated the variant/lineage abundances estimated by _Freyja_. \nA complementary analysis is to look at how the frequency of individual mutations changes over time. \nThis analysis is agnostic to which lineages those mutations occur in, and may even reveal new emerging mutations not yet characterised in the known _Pango_ lineages. \n\nFor this analysis, we can use the mutation table generated by `viralrecon`, which is saved in its output directory under `variants/ivar/variants_long_table.csv`.\n(Note: \"variants\" here is used to mean \"mutations\"; see the information box in @sec-lineages where this terminology is clarified. To avoid confusion we use the term \"mutation\" in this chapter). \n\nWe will do our analysis in R, so we start by loading the packages being used: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(janitor)\ntheme_set(theme_classic())\n```\n:::\n\n\nWe then import the data into R. \nIn this case we also \"clean\" the column names of this table, to simplify them.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutations <- read_csv(\"preprocessed/viralrecon/variants/ivar/variants_long_table.csv\")\nmutations <- clean_names(mutations)\n```\n:::\n\n\n\n\nThis is what the table looks like: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(mutations)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 16\n sample chrom pos ref alt filter dp ref_dp alt_dp af gene effect\n \n1 SRR1854… MN90… 14 A AC PASS 17 17 14 0.82 orf1… upstr…\n2 SRR1854… MN90… 241 C T PASS 45440 12 45428 1 orf1… upstr…\n3 SRR1854… MN90… 1549 C T PASS 2576 1747 827 0.32 orf1… synon…\n4 SRR1854… MN90… 1862 C T ft 10 7 3 0.3 orf1… misse…\n5 SRR1854… MN90… 2470 C T PASS 1105 641 464 0.42 orf1… synon…\n6 SRR1854… MN90… 2832 A G PASS 85 26 59 0.69 orf1… misse…\n# ℹ 4 more variables: hgvs_c , hgvs_p , hgvs_p_1letter ,\n# caller \n```\n\n\n:::\n:::\n\n\nThere are many columns in this table, their meaning is detailed in @sec-mutations. \nWe will only retain a few columns of interest to us: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutations <- mutations |> \n # select columns of interest\n select(sample, pos, dp, alt_dp, af, gene, effect, hgvs_p_1letter) |> \n # rename one of the columns\n rename(aa_change = hgvs_p_1letter)\n```\n:::\n\n\nThe columns we retained are: \n\n- `sample` contains the sample name.\n- `pos` the position of the mutation in the reference genome.\n- `dp` the depth of sequencing (number of reads) at that position.\n- `alt_dp` the depth of sequencing of the non-reference (alternative) allele at that position.\n- `af` the allele frequency of the non-reference allele (equal to `alt_dp`/`dp`).\n- `gene` is the gene name in the reference genome.\n- `effect` is the predicted mutation effect. \n- `aa_change` (which we renamed from `hgvs_p_1letter`) is the amino acid change at that position, for non-synonymous mutations, following the [HGVS Nomenclature](https://hgvs-nomenclature.org/stable/) system.\n\nOur next step is to _merge_ this table with our metadata table, so we have information about the date of collection of each sample. \nWe start by importing the metadata table: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmetadata <- read_csv(\"sample_info.csv\")\n\n# look at the top few rows\nhead(metadata)\n```\n:::\n\n\n\n\nAs both this table and the table of mutations contain a column called \"sample\", we will _join_ the two tables based on it: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutations <- full_join(mutations, metadata, by = \"sample\")\n\nhead(mutations)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 13\n sample pos dp alt_dp af gene effect aa_change date country\n \n1 SRR18541027 14 17 14 0.82 orf1… upstr… . 2022-01-04 United…\n2 SRR18541027 241 45440 45428 1 orf1… upstr… . 2022-01-04 United…\n3 SRR18541027 1549 2576 827 0.32 orf1… synon… p.S428S 2022-01-04 United…\n4 SRR18541027 1862 10 3 0.3 orf1… misse… p.L533F 2022-01-04 United…\n5 SRR18541027 2470 1105 464 0.42 orf1… synon… p.A735A 2022-01-04 United…\n6 SRR18541027 2832 85 59 0.69 orf1… misse… p.K856R 2022-01-04 United…\n# ℹ 3 more variables: city , latitude , longitude \n```\n\n\n:::\n:::\n\n\nWe now have our mutations along with the relevant metadata for each sample. \n\nFinally, we will give the values of our sample IDs and mutations an ordering based on the date (instead of the default alphabetical order): \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutations <- mutations |> \n mutate(sample = fct_reorder(sample, date), \n aa_change = fct_reorder(aa_change, date))\n```\n:::\n\n\n\n## Exploratory analysis\n\nWe start by exploring our data with some simple summaries. \nFor example, how many mutations do we have of each effect?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutations |> \n count(effect) |> \n mutate(effect = fct_reorder(effect, n)) |> \n ggplot(aes(x = n, y = effect)) + \n geom_col() +\n geom_label(aes(label = n))\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\nWe can see that the most common mutations are _missense_, i.e. causing an amino acid change. \nSeveral mutations are _synonymous_, which should be less impactful for the evolution of the virus. \nOther mutations are less common, and we will not focus on them in this analysis (although you may want to investigate them for other purposes). \n\nFor now, we will focus on missense mutations, as these have the potential to change the properties of the virus and new emerging lineages may be due to a novel adaptive mutation that changes an amino acid in one of the genes. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense <- mutations |> \n filter(effect == \"missense_variant\")\n```\n:::\n\n\nHow many of these mutations do we have in each gene?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense |> \n count(gene) |> \n mutate(gene = fct_reorder(gene, n)) |> \n ggplot(aes(x = n, y = gene)) + \n geom_col() +\n geom_label(aes(label = n))\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\nThe majority of mutations are in the _S_ and _ORF1ab_ genes. \nLet's investigate how mutations change over time.\n\n## Mutation frequency analysis\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense |> \n ggplot(aes(factor(date), aa_change, fill = af)) +\n geom_tile() +\n scale_x_discrete(guide = guide_axis(angle = 45)) +\n scale_y_discrete(guide = guide_axis(check.overlap = TRUE)) +\n scale_fill_viridis_c(limits = c(0, 1)) +\n labs(x = \"Sample (by date)\", y = \"AA change\", fill = \"Frequency\")\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-13-1.png){width=672}\n:::\n:::\n\n\nFrom this plot, we can see a \"step\" change in the observed mutations, which is likely due to the change in lineages over time. \nWe can also see some mutations that are quite frequent across many samples (they appear as horizontal strips in the plot). \nThese are likely mutations shared across several lineages. \nFinally, we can see a \"block\" of mutations appearing around Dec 2021, which are likely the Omicron mutations rising in frequency.\n\nNote that, unlike with _Freyja_, this analysis does not rely on prior knowledge of the lineages, making it suitable for detecting new emerging mutations.\nThis kind of visualisation is therefore useful to identify emerging mutations, as they would be visible as a new horizontal \"strip\" appearing on the plot. \n\n### Exercise\n\n:::{.callout-exercise}\n\nOne issue with the above plot is that we cannot see all the mutation names on the x-axis, as their names were overlapping. \n\n- Modify the code shown above to show the mutations present in the _Spike_ gene only.\n- Then do the same for _orf1ab_ gene.\n\nWhich of these genes do you think helps distinguish sub-lineages of Omicron more effectively?\n\n:::{.callout-hint}\nRemember that you can use the `filter()` function to subset the table to keep only rows of interest. \n:::\n\n:::{.callout-answer}\n\nTo look at the mutations in the _Spike_ gene only, we can use the `filter()` function, to retain the rows that match this gene only. \nWe then pipe the output to the same code we used before.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense |> \n filter(gene == \"S\") |> \n ggplot(aes(factor(date), aa_change, fill = af)) +\n geom_tile() +\n scale_x_discrete(guide = guide_axis(angle = 45)) +\n scale_y_discrete(guide = guide_axis(check.overlap = TRUE)) +\n scale_fill_viridis_c(limits = c(0, 1)) +\n labs(x = \"Sample (by date)\", y = \"AA change\", fill = \"Frequency\")\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\nThe same code can be reused to also look at the mutations from _orf1ab_:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense |> \n filter(gene == \"orf1ab\") |> \n ggplot(aes(factor(date), aa_change, fill = af)) +\n geom_tile() +\n scale_x_discrete(guide = guide_axis(angle = 45)) +\n scale_y_discrete(guide = guide_axis(check.overlap = TRUE)) +\n scale_fill_viridis_c(limits = c(0, 1)) +\n labs(x = \"Sample (by date)\", y = \"AA change\", fill = \"Frequency\")\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-15-1.png){width=672}\n:::\n:::\n\n\nWe can see that the mutations in _orf1ab_ change more often than those of the _Spike_ gene. \nThese mutations likely distinguish individual lineages, whereas the profile of the _Spike_ is generally more similar across all the lineages belonging to the Omicron variant.\n\n:::\n:::\n\n## Individual mutations\n\nWe may also be interested in looking at more details about the frequencies of individual mutations. \nFor this, it may help to calculate a _confidence interval_ for the mutation frequency, based on the counts of reads observed (i.e. the sequencing depth). \nOne way to calculate such a confidence interval is to use the so-called [Jeffreys interval](https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Jeffreys_interval), which is based on the _Beta_ distribution. \nIn R, we can calculate this as follows: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense <- missense |> \n mutate(af_lo = qbeta(0.025, alt_dp + 0.5, (dp - alt_dp) + 0.5),\n af_hi = qbeta(0.975, alt_dp + 0.5, (dp - alt_dp) + 0.5))\n```\n:::\n\n\nOne possible visualisation is to consider the mutations in individual samples, shown as a plot across the genome:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense |> \n filter(sample == \"SRR18541114\") |> \n ggplot(aes(pos, af, colour = gene)) + \n geom_pointrange(aes(ymin = af_lo, ymax = af_hi)) +\n scale_y_continuous(limits = c(0, 1))\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n\nOr, we can focus on an individual mutation and plot it over time:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense |> \n filter(aa_change == \"p.K856R\") |> \n ggplot(aes(date, af)) +\n geom_pointrange(aes(ymin = af_lo, ymax = af_hi)) +\n scale_y_continuous(limits = c(0, 1))\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-18-1.png){width=672}\n:::\n:::\n\n\n:::{.callout-note}\n#### Missing data\n\nThe way `viralrecon` performs variant (mutation) calling, does not allow distinguishing whether the absence of a mutation from the table is due to missing data or not. \nTwo things may have happened:\n\n- There was no coverage in a position (a \"gap\" in the sequencing), meaning there is missing data for that position.\n- All the reads mapped to that position carried the reference genome allele, so no mutation was reported. This could mean either the mutation trully had a frequency of zero or was low-frequency and was missed by chance. \n\nAs we cannot distinguish between these two possibilities, it is important to keep in mind that the absence of a mutation from our table does not necessarily mean the mutation was not present in the sample. \nIt only means that we were not able to detect it. \n:::\n\n\n## Summary\n\n:::{.callout-tip}\n## Key Points\n\n- The software _R_ can be used to import the mutations CSV file generated by `viralrecon`. \n- The mutations file can be _joined_ with metadata table, to assess the occurrence of mutations over time. \n- Barplots can be useful to visualise the counts of mutation types detected and the genes they occur in. \n- Heatmaps can be used to visualise mutation abundances over time. Mutations that become more frequent over time appear as \"blocks\" in the plot.\n:::\n", + "markdown": "---\npagetitle: \"SARS Genomic Surveillance\"\n---\n\n\n# Mutation Analysis\n\n\n::: {.cell}\n\n:::\n\n\n:::{.callout-tip}\n## Learning Objectives\n\n- Perform exploratory analysis of mutations detected in wastewater samples.\n- Identify the most common types of mutations identified and their occurrence in different genes.\n- Produce a heatmap-style visualisation of mutation abundances over time.\n- Analyse individual mutations and estimate frequency confidence intervals from read counts.\n:::\n\n\n## Data import\n\nIn the previous chapter we investigated the variant/lineage abundances estimated by _Freyja_. \nA complementary analysis is to look at how the frequency of individual mutations changes over time. \nThis analysis is agnostic to which lineages those mutations occur in, and may even reveal new emerging mutations not yet characterised in the known _Pango_ lineages. \n\nFor this analysis, we can use the mutation table generated by `viralrecon`, which is saved in its output directory under `variants/ivar/variants_long_table.csv`.\n(Note: \"variants\" here is used to mean \"mutations\"; see the information box in @sec-lineages where this terminology is clarified. To avoid confusion we use the term \"mutation\" in this chapter). \n\nWe will do our analysis in R, so we start by loading the packages being used: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(janitor)\ntheme_set(theme_classic())\n```\n:::\n\n\nWe then import the data into R. \nIn this case we also \"clean\" the column names of this table, to simplify them.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutations <- read_csv(\"preprocessed/viralrecon/variants/ivar/variants_long_table.csv\")\nmutations <- clean_names(mutations)\n```\n:::\n\n\n\n\nThis is what the table looks like: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(mutations)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 16\n sample chrom pos ref alt filter dp ref_dp alt_dp af gene effect\n \n1 SRR1854… MN90… 14 A AC PASS 17 17 14 0.82 orf1… upstr…\n2 SRR1854… MN90… 241 C T PASS 45440 12 45428 1 orf1… upstr…\n3 SRR1854… MN90… 1549 C T PASS 2576 1747 827 0.32 orf1… synon…\n4 SRR1854… MN90… 1862 C T ft 10 7 3 0.3 orf1… misse…\n5 SRR1854… MN90… 2470 C T PASS 1105 641 464 0.42 orf1… synon…\n6 SRR1854… MN90… 2832 A G PASS 85 26 59 0.69 orf1… misse…\n# ℹ 4 more variables: hgvs_c , hgvs_p , hgvs_p_1letter ,\n# caller \n```\n\n\n:::\n:::\n\n\nThere are many columns in this table, their meaning is detailed in @sec-mutations. \nWe will only retain a few columns of interest to us: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutations <- mutations |> \n # select columns of interest\n select(sample, pos, dp, alt_dp, af, gene, effect, hgvs_p_1letter) |> \n # rename one of the columns\n rename(aa_change = hgvs_p_1letter)\n```\n:::\n\n\nThe columns we retained are: \n\n- `sample` contains the sample name.\n- `pos` the position of the mutation in the reference genome.\n- `dp` the depth of sequencing (number of reads) at that position.\n- `alt_dp` the depth of sequencing of the non-reference (alternative) allele at that position.\n- `af` the allele frequency of the non-reference allele (equal to `alt_dp`/`dp`).\n- `gene` is the gene name in the reference genome.\n- `effect` is the predicted mutation effect. \n- `aa_change` (which we renamed from `hgvs_p_1letter`) is the amino acid change at that position, for non-synonymous mutations, following the [HGVS Nomenclature](https://hgvs-nomenclature.org/stable/) system.\n\nOur next step is to _merge_ this table with our metadata table, so we have information about the date of collection of each sample. \nWe start by importing the metadata table: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmetadata <- read_csv(\"sample_info.csv\")\n\n# look at the top few rows\nhead(metadata)\n```\n:::\n\n\n\n\nAs both this table and the table of mutations contain a column called \"sample\", we will _join_ the two tables based on it: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutations <- full_join(mutations, metadata, by = \"sample\")\n\nhead(mutations)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 13\n sample pos dp alt_dp af gene effect aa_change date country\n \n1 SRR18541027 14 17 14 0.82 orf1… upstr… . 2022-01-04 United…\n2 SRR18541027 241 45440 45428 1 orf1… upstr… . 2022-01-04 United…\n3 SRR18541027 1549 2576 827 0.32 orf1… synon… p.S428S 2022-01-04 United…\n4 SRR18541027 1862 10 3 0.3 orf1… misse… p.L533F 2022-01-04 United…\n5 SRR18541027 2470 1105 464 0.42 orf1… synon… p.A735A 2022-01-04 United…\n6 SRR18541027 2832 85 59 0.69 orf1… misse… p.K856R 2022-01-04 United…\n# ℹ 3 more variables: city , latitude , longitude \n```\n\n\n:::\n:::\n\n\nWe now have our mutations along with the relevant metadata for each sample. \n\nFinally, we will give the values of our sample IDs and mutations an ordering based on the date (instead of the default alphabetical order): \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutations <- mutations |> \n mutate(sample = fct_reorder(sample, date), \n aa_change = fct_reorder(aa_change, date))\n```\n:::\n\n\n\n## Exploratory analysis\n\nWe start by exploring our data with some simple summaries. \nFor example, how many mutations do we have of each effect?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutations |> \n count(effect) |> \n mutate(effect = fct_reorder(effect, n)) |> \n ggplot(aes(x = n, y = effect)) + \n geom_col() +\n geom_label(aes(label = n))\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\nWe can see that the most common mutations are _missense_, i.e. causing an amino acid change. \nSeveral mutations are _synonymous_, which should be less impactful for the evolution of the virus. \nOther mutations are less common, and we will not focus on them in this analysis (although you may want to investigate them for other purposes). \n\nFor now, we will focus on missense mutations, as these have the potential to change the properties of the virus and new emerging lineages may be due to a novel adaptive mutation that changes an amino acid in one of the genes. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense <- mutations |> \n filter(effect == \"missense_variant\")\n```\n:::\n\n\nHow many of these mutations do we have in each gene?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense |> \n count(gene) |> \n mutate(gene = fct_reorder(gene, n)) |> \n ggplot(aes(x = n, y = gene)) + \n geom_col() +\n geom_label(aes(label = n))\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\nThe majority of mutations are in the _S_ and _ORF1ab_ genes. \nLet's investigate how mutations change over time.\n\n## Mutation frequency analysis\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense |> \n ggplot(aes(factor(date), aa_change, fill = af)) +\n geom_tile() +\n scale_x_discrete(guide = guide_axis(angle = 45)) +\n scale_y_discrete(guide = guide_axis(check.overlap = TRUE)) +\n scale_fill_viridis_c(limits = c(0, 1)) +\n labs(x = \"Sample (by date)\", y = \"AA change\", fill = \"Frequency\")\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-13-1.png){width=672}\n:::\n:::\n\n\nFrom this plot, we can see a \"step\" change in the observed mutations, which is likely due to the change in lineages over time. \nWe can also see some mutations that are quite frequent across many samples (they appear as horizontal strips in the plot). \nThese are likely mutations shared across several lineages. \nFinally, we can see a \"block\" of mutations appearing around Dec 2021, which are likely the Omicron mutations rising in frequency.\n\nNote that, unlike with _Freyja_, this analysis does not rely on prior knowledge of the lineages, making it suitable for detecting new emerging mutations.\nThis kind of visualisation is therefore useful to identify emerging mutations, as they would be visible as a new horizontal \"strip\" appearing on the plot. \n\n### Exercise\n\n:::{.callout-exercise}\n\nOne issue with the above plot is that we cannot see all the mutation names on the x-axis, as their names were overlapping. \n\n- Modify the code shown above to show the mutations present in the _Spike_ gene only.\n- Then do the same for _orf1ab_ gene.\n\nWhich of these genes do you think helps distinguish sub-lineages of Omicron more effectively?\n\n:::{.callout-hint}\nRemember that you can use the `filter()` function to subset the table to keep only rows of interest. \n:::\n\n:::{.callout-answer}\n\nTo look at the mutations in the _Spike_ gene only, we can use the `filter()` function, to retain the rows that match this gene only. \nWe then pipe the output to the same code we used before.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense |> \n filter(gene == \"S\") |> \n ggplot(aes(factor(date), aa_change, fill = af)) +\n geom_tile() +\n scale_x_discrete(guide = guide_axis(angle = 45)) +\n scale_y_discrete(guide = guide_axis(check.overlap = TRUE)) +\n scale_fill_viridis_c(limits = c(0, 1)) +\n labs(x = \"Sample (by date)\", y = \"AA change\", fill = \"Frequency\")\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\nThe same code can be reused to also look at the mutations from _orf1ab_:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense |> \n filter(gene == \"orf1ab\") |> \n ggplot(aes(factor(date), aa_change, fill = af)) +\n geom_tile() +\n scale_x_discrete(guide = guide_axis(angle = 45)) +\n scale_y_discrete(guide = guide_axis(check.overlap = TRUE)) +\n scale_fill_viridis_c(limits = c(0, 1)) +\n labs(x = \"Sample (by date)\", y = \"AA change\", fill = \"Frequency\")\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-15-1.png){width=672}\n:::\n:::\n\n\nWe can see that the mutations in _orf1ab_ change more often than those of the _Spike_ gene. \nThese mutations likely distinguish individual lineages, whereas the profile of the _Spike_ is generally more similar across all the lineages belonging to the Omicron variant.\n\n:::\n:::\n\n## Individual mutations\n\nWe may also be interested in looking at more details about the frequencies of individual mutations. \nFor this, it may help to calculate a _confidence interval_ for the mutation frequency, based on the counts of reads observed (i.e. the sequencing depth). \nOne way to calculate such a confidence interval is to use the so-called [Jeffreys interval](https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Jeffreys_interval), which is based on the _Beta_ distribution. \nIn R, we can calculate this as follows: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense <- missense |> \n mutate(af_lo = qbeta(0.025, alt_dp + 0.5, (dp - alt_dp) + 0.5),\n af_up = qbeta(0.975, alt_dp + 0.5, (dp - alt_dp) + 0.5))\n```\n:::\n\n\nOne possible visualisation is to consider the mutations in individual samples, shown as a plot across the genome:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense |> \n filter(sample == \"SRR18541114\") |> \n ggplot(aes(pos, af, colour = gene)) + \n geom_pointrange(aes(ymin = af_lo, ymax = af_up)) +\n scale_y_continuous(limits = c(0, 1))\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n\nOr, we can focus on an individual mutation and plot it over time:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmissense |> \n filter(aa_change == \"p.K856R\") |> \n ggplot(aes(date, af)) +\n geom_pointrange(aes(ymin = af_lo, ymax = af_up)) +\n scale_y_continuous(limits = c(0, 1))\n```\n\n::: {.cell-output-display}\n![](04-ww_mutations_files/figure-html/unnamed-chunk-18-1.png){width=672}\n:::\n:::\n\n\n:::{.callout-note}\n#### Missing data\n\nThe way `viralrecon` performs variant (mutation) calling, does not allow distinguishing whether the absence of a mutation from the table is due to missing data or not. \nTwo things may have happened:\n\n- There was no coverage in a position (a \"gap\" in the sequencing), meaning there is missing data for that position.\n- All the reads mapped to that position carried the reference genome allele, so no mutation was reported. This could mean either the mutation trully had a frequency of zero or was low-frequency and was missed by chance. \n\nAs we cannot distinguish between these two possibilities, it is important to keep in mind that the absence of a mutation from our table does not necessarily mean the mutation was not present in the sample. \nIt only means that we were not able to detect it. \n:::\n\n\n## Summary\n\n:::{.callout-tip}\n## Key Points\n\n- The software _R_ can be used to import the mutations CSV file generated by `viralrecon`. \n- The mutations file can be _joined_ with metadata table, to assess the occurrence of mutations over time. \n- Barplots can be useful to visualise the counts of mutation types detected and the genes they occur in. \n- Heatmaps can be used to visualise mutation abundances over time. Mutations that become more frequent over time appear as \"blocks\" in the plot.\n:::\n", "supporting": [ "04-ww_mutations_files" ], diff --git a/course_files/r_demo/lineage_abundances.csv b/course_files/r_demo/lineage_abundances.csv index 4297fe8..f0d87cd 100644 --- a/course_files/r_demo/lineage_abundances.csv +++ b/course_files/r_demo/lineage_abundances.csv @@ -1,4 +1,4 @@ -sample,name,abundance,boot_lo,boot_hi,date,country,city,latitude,longitude +sample,name,abundance,boot_lo,boot_up,date,country,city,latitude,longitude SRR18541074,AY.44,0.26583286,0.22062022336636045,0.2971052371373705,2021-12-01,United States,San Diego,32.719875,-117.170082 SRR18541074,AY.3,0.17287380,0.15684817412629423,0.19134365770536377,2021-12-01,United States,San Diego,32.719875,-117.170082 SRR18541074,AY.100,0.06591169,0.0621177963944036,0.07690953917545072,2021-12-01,United States,San Diego,32.719875,-117.170082 diff --git a/course_files/r_demo/vocs_abundances.csv b/course_files/r_demo/vocs_abundances.csv index ac92082..2ed42e0 100644 --- a/course_files/r_demo/vocs_abundances.csv +++ b/course_files/r_demo/vocs_abundances.csv @@ -1,4 +1,4 @@ -sample,name,abundance,boot_lo,boot_hi,date,country,city,latitude,longitude +sample,name,abundance,boot_lo,boot_up,date,country,city,latitude,longitude SRR18541074,Delta,0.9640087907444956,0.9596984293917038,0.9682214940819226,2021-12-01,United States,San Diego,32.719875,-117.170082 SRR18541074,Omicron,0.019682409750709555,0.013233836060878133,0.02276906228234253,2021-12-01,United States,San Diego,32.719875,-117.170082 SRR18541074,Other,0.0012202492036713667,0.0,0.008722550206162935,2021-12-01,United States,San Diego,32.719875,-117.170082 diff --git a/materials/04-wastewater/03-ww_visualisation.qmd b/materials/04-wastewater/03-ww_visualisation.qmd index 55b9ea4..eae03a9 100644 --- a/materials/04-wastewater/03-ww_visualisation.qmd +++ b/materials/04-wastewater/03-ww_visualisation.qmd @@ -313,7 +313,7 @@ lineages |> mutate(name = fct_reorder(name, abundance)) |> # make the plot ggplot(aes(x = name, y = abundance, colour = abundance < 0.05)) + - geom_pointrange(aes(ymin = boot_lo, ymax = boot_hi)) + + geom_pointrange(aes(ymin = boot_lo, ymax = boot_up)) + labs(x = "Lineage", y = "Abundance (95% CI)", colour = "< 5%") ``` @@ -340,7 +340,7 @@ lineages |> filter(sample == "SRR18541092") |> mutate(name = fct_reorder(name, abundance)) |> ggplot(aes(x = name, y = abundance, colour = abundance < 0.05)) + - geom_pointrange(aes(ymin = boot_lo, ymax = boot_hi)) + + geom_pointrange(aes(ymin = boot_lo, ymax = boot_up)) + scale_x_discrete(guide = guide_axis(angle = 45)) + labs(x = "Lineage", y = "Abundance (95% CI)", colour = "< 5%", main = "Sample: SRR18541092") @@ -349,7 +349,7 @@ lineages |> filter(sample == "SRR18541114") |> mutate(name = fct_reorder(name, abundance)) |> ggplot(aes(x = name, y = abundance, colour = abundance < 0.05)) + - geom_pointrange(aes(ymin = boot_lo, ymax = boot_hi)) + + geom_pointrange(aes(ymin = boot_lo, ymax = boot_up)) + scale_x_discrete(guide = guide_axis(angle = 45)) + labs(x = "Lineage", y = "Abundance (95% CI)", colour = "< 5%", main = "Sample: SRR18541114") @@ -375,7 +375,7 @@ sandiego |> filter(sample %in% c("SRR18541092", "SRR18541114")) |> # make the plot ggplot(aes(x = name, y = abundance, colour = sample)) + - geom_pointrange(aes(ymin = boot_lo, ymax = boot_hi), + geom_pointrange(aes(ymin = boot_lo, ymax = boot_up), position = position_dodge(width = 0.5)) ``` diff --git a/materials/04-wastewater/04-ww_mutations.qmd b/materials/04-wastewater/04-ww_mutations.qmd index 812840b..63f36ef 100644 --- a/materials/04-wastewater/04-ww_mutations.qmd +++ b/materials/04-wastewater/04-ww_mutations.qmd @@ -236,7 +236,7 @@ In R, we can calculate this as follows: ```{r} missense <- missense |> mutate(af_lo = qbeta(0.025, alt_dp + 0.5, (dp - alt_dp) + 0.5), - af_hi = qbeta(0.975, alt_dp + 0.5, (dp - alt_dp) + 0.5)) + af_up = qbeta(0.975, alt_dp + 0.5, (dp - alt_dp) + 0.5)) ``` One possible visualisation is to consider the mutations in individual samples, shown as a plot across the genome: @@ -245,7 +245,7 @@ One possible visualisation is to consider the mutations in individual samples, s missense |> filter(sample == "SRR18541114") |> ggplot(aes(pos, af, colour = gene)) + - geom_pointrange(aes(ymin = af_lo, ymax = af_hi)) + + geom_pointrange(aes(ymin = af_lo, ymax = af_up)) + scale_y_continuous(limits = c(0, 1)) ``` @@ -255,7 +255,7 @@ Or, we can focus on an individual mutation and plot it over time: missense |> filter(aa_change == "p.K856R") |> ggplot(aes(date, af)) + - geom_pointrange(aes(ymin = af_lo, ymax = af_hi)) + + geom_pointrange(aes(ymin = af_lo, ymax = af_up)) + scale_y_continuous(limits = c(0, 1)) ``` diff --git a/materials/04-wastewater/ww_notes.md b/materials/04-wastewater/ww_notes.md index 5502ea8..382781e 100644 --- a/materials/04-wastewater/ww_notes.md +++ b/materials/04-wastewater/ww_notes.md @@ -199,4 +199,4 @@ pagetitle: "SARS Genomic Surveillance" Trainers: - Amanda Qvesel (bioinformatics) - amqv@ssi.dk -- \ No newline at end of file + diff --git a/utils/wastewater/tidy_freyja_viralrecon.py b/utils/wastewater/tidy_freyja_viralrecon.py index 3ec618e..3353e93 100644 --- a/utils/wastewater/tidy_freyja_viralrecon.py +++ b/utils/wastewater/tidy_freyja_viralrecon.py @@ -135,7 +135,7 @@ def parse_bootstrap(file_path): # zip them up boot = list(zip(lineage_names, q1, q7)) # convert to data frame - boot_df = pd.DataFrame(boot, columns=["name", "boot_lo", "boot_hi"]) + boot_df = pd.DataFrame(boot, columns=["name", "boot_lo", "boot_up"]) return boot_df