diff --git a/materials/04-wastewater/01-wastewater_surveillance.md b/materials/04-wastewater/00-ww_temp.md similarity index 95% rename from materials/04-wastewater/01-wastewater_surveillance.md rename to materials/04-wastewater/00-ww_temp.md index 0f7484c..f31e643 100644 --- a/materials/04-wastewater/01-wastewater_surveillance.md +++ b/materials/04-wastewater/00-ww_temp.md @@ -8,4 +8,5 @@ pagetitle: "SARS Genomic Surveillance" Under development -::: \ No newline at end of file +::: + diff --git a/materials/04-wastewater/01-ww_surveillance.md b/materials/04-wastewater/01-ww_surveillance.md new file mode 100644 index 0000000..6e53d6f --- /dev/null +++ b/materials/04-wastewater/01-ww_surveillance.md @@ -0,0 +1,62 @@ +--- +pagetitle: "SARS Genomic Surveillance" +--- + +# Wastewater surveillance + +:::{.callout-caution} + +Under development + +::: + +Several studies demonstrate a high correlation between wastewater incidence and clinical sampling: + +- [Predicting COVID-19 Incidence Using Wastewater Surveillance Data, Denmark, October 2021–June 2022](https://doi.org/10.3201/eid2908.221634) +- [Wastewater-Based Surveillance Is an Effective Tool for Trending COVID-19 Prevalence in Communities: A Study of 10 Major Communities for 17 Months in Alberta](https://doi.org/10.1021%2Facsestwater.2c00143) +- [Computational analysis of SARS-CoV-2/COVID-19 surveillance by wastewater-based epidemiology locally and globally: Feasibility, economy, opportunities and challenges](https://doi.org/10.1016/j.scitotenv.2020.138875) + - Computational modelling of factors affecting WBE + - + +For lineage relative abundance estimates: + +- [Correlation between Clinical and Wastewater SARS-CoV-2 Genomic Surveillance, Oregon, USA](https://doi.org/10.3201%2Feid2809.220938) +- + +Cons: + +- Heterogenous samples + - challenges in protocols for concentrating/cleaning the material +- RNA degradation + - During travel time in the sewage system + - Seasonal temperature - not adjusting for seasonal variation can lead to very misleading figures (that's why normalisation with an independent marker is essential) +- Requires access to central wastewater collection systems (wastewater treatment plants). Not suitable for populations using septic tanks or lacking access to sewage systems. + + +Two main methods of WBE: + +- Incidence: RT-qPCR +- Relative lineage abundance: WGS + +The two methods are complementary. +There is no easy way to estimate the incidence of the virus based on sequencing data alone. +Conversely, there is no way to determine the variants present in a sample using RT-qPCR. + +The two methods can be used independently (depending on the objectives of the monitoring) or in combination with each other. +For example, the relative lineage abundance can be converted to viral numbers if RT-qPCR data is available to do that calculation. + + +## Sample preparation + +The virus is very diluted. +Therefore, an important step is required where the virus is concentrated in the sample. +There are several methods, none of which is deemed the "best". +Different methods may work better or worse for different samples and situations. +Also depends on resources available (equipment and reagents). + + +## Glossary + +- WBE - wastewater-based epidemiology +- WWTP - wastewater treatment plants +- WGS - whole-genome sequencing \ No newline at end of file diff --git a/materials/04-wastewater/02-ww_abundance.md b/materials/04-wastewater/02-ww_abundance.md new file mode 100644 index 0000000..2096d6f --- /dev/null +++ b/materials/04-wastewater/02-ww_abundance.md @@ -0,0 +1,115 @@ +--- +pagetitle: "SARS Genomic Surveillance" +--- + +# Wastewater surveillance + +:::{.callout-caution} + +Under development + +::: + + +## Bioinformatic analysis + +![](images/workflow_wastewater_overview.svg) + +The command to run our analysis is very similar to what was explained in the [consensus assembly of clinical isolates](../02-isolates/01-consensus.md): + +```bash +nextflow run nf-core/viralrecon \ + -r dev -profile singularity \ + --platform illumina \ + --input SAMPLESHEET_CSV \ + --outdir OUTPUT_DIRECTORY \ + --protocol amplicon \ + --genome 'MN908947.3' \ + --primer_set artic \ + --primer_set_version PRIMER_VERSION \ + --skip_assembly --skip_asciigenome \ + --skip_consensus +``` + +The key differences are: + +- `-r dev`: we are using the **development version** of the pipeline, which is not yet released as an official version. + The reason is that Freyja is currently (Feb 2024) only available in the development version. + Once a new version of the pipleine is released (>= 2.7), it should no longer be necessary to use the development version. +- `--skip_consensus`: in this case we skip the generation of a consensus FASTA file. + Since wastewater samples are mixed, it doesn't make sense to generate a consensus FASTA file, as it would potentially include mutations from different lineages in the final consensus, creating artificial recombinant genomes. + +### Depth threshold + +There are [other options](https://nf-co.re/viralrecon/dev/parameters#skip_freyja) that can be specified with `viralrecon` to modify the default behaviour of Freyja. +A couple that are worth highlighting are: + +- `--freyja_depthcutoff` determines the minimum depth of sequencing for a site to be used in the Freyja analysis. The default '0', i.e. Freyja uses all the sites in its analysis, as long as at least 1 read is mapped to the reference. + This may seem insufficient, however recall that Freyja will use all the mutations across the genome, so as long as the average depth of sequencing is high enough (e.g. >100x), then even if some of the sites have low depth the accuracy of the final estimates might still be good. + +### Confidence intervals + +- `--freyja_repeats` determines the number of bootstrap repeats used to estimate confidence intervals for the abundance estimates. The default is 100 bootstrap replicates, which should be fine to get an idea of the uncertainty of those estimates. However, this step is very computationally demanding, substantially increasing the time to run the analysis (it is, essentially, running the Freyja step 100 times for each sample). Therefore, it can sometimes be advantageous to "turn off" this option by setting `--freyja_repeats 1`. In that case, make sure to ignore the confidence intervals that are output by the pipeline. + + +## Output Files + +After running the pipeline, we get several output files. +These are very similar to what has been described for [clinical isolates](../02-isolates/02-qc.md). +As a reminder, here are some files of interest: + +- `multiqc/multiqc_report.html`: a MultiQC quality report, including information about average depth of coverage, fraction of the genome covered, number of mutations identified, etc. +- `variants/bowtie2/`: directory with individual BAM files, which can be visualised with _IGV_ if we want to look at the reads mapped to the reference genome. +- `variants/ivar/variants_long_table.csv`: a CSV file with the aggregated results of all the mutations detected in each sample. + +The new directories of particular interest for wastewater analysis are: + +- `variants/freyja/demix`: directory containing the output files from the `freyja demix` command, used to estimate lineage abundances. +- `variants/freyja/bootstrap`: directory containing the output files from the `freyja boot` command, used to get confidence intervals for the estimated lineage abundances. +- `variants/freyja/variants`: directory containing the output files from the `freyja variants` command, which includes all the variants (mutations) considered by Freyja in its lineage abundance calculations. These are somewhat redundant with the files in `variants/ivar/variants_long_table.csv`. + + +## Freyja Output + +Freyja outputs files in a non-standard format. +Here is an example from one of our samples: + +``` + SRR18541029.variants.tsv +summarized [('Omicron', 0.9074885644836951), ('Delta', 0.04843510729864781), ('Other', 0.03151409250566481)] +lineages BA.1 BA.1.1.8 BA.1.20 AY.116 BA.1.8 B.1.1.529 XS BA.1.1.17 XP XD BA.1.1.5 BA.1.1.10 AY.39 AY.46.1 AY.3.1 +abundances 0.42763980 0.33859152 0.04909740 0.04170587 0.03905280 0.02662011 0.02634620 0.00904814 0.00811069 0.00516789 0.00476190 0.00456621 0.00277376 0.00206299 0.00189249 +resid 13.531715333760909 +coverage 98.86633448149016 +``` + +Here is the meaning of each line from this file: + +1. The name of the file +2. The frequency of variants of concern, which is added up based on the frequency of individual lineages. +3. The name of each individual lineage detected in the sample. +4. The corresponding frequencies of each lineage from the previous line, in descending order. +5. The "residual" variation left from the statistical model used to estimate the frequencies; this value does not have an easy interpretation. +6. The percentage of the genome covered at 10x depth. We can also obtain this information from the regular MultiQC report. + + +## Mutation co-occurrence + +As mentioned, low-frequency lineages pose a challenge in the analysis of mixed samples, as it becomes difficult to distinguish their presence from technical background noise (e.g. due to sequencing errors). +An approach to help distinguish errors from the true presence of a rare lineage is to look for mutations that co-occur in the same read or read pair. +As it would be very unlikely that two (or more) errors would occur in exactly the same read pair, the presence of co-occurring mutations provide a high-confidence signal that the respective lineage is present in the sample. + +This approach requires that the lineages of interest have enough distinguishing mutations within close proximity to each other (i.e. in the same PCR amplicon). + +Another application of co-occurring mutations is to provide a stronger evidence for new emerging lineages. +If the same pattern of co-occurring mutations is observed across multiple samples and at increasing frequencies over time, it may be cause for concern and require close monitoring. +See, for example, the study of [Jahn et al 2022](https://doi.org/10.1038/s41564-022-01185-x), where they demonstrate that co-occurring mutations characteristic of the Alpha variant were detected in wastewater samples from two Swiss cities around 2 weeks before its first report from clinical samples. + + + + +## Glossary + +- WBE - wastewater-based epidemiology +- WWTP - wastewater treatment plants +- WGS - whole-genome sequencing \ No newline at end of file diff --git a/materials/04-wastewater/sampling_variation.r b/materials/04-wastewater/sampling_variation.r new file mode 100644 index 0000000..20e76ce --- /dev/null +++ b/materials/04-wastewater/sampling_variation.r @@ -0,0 +1,188 @@ +# load packages +library(tidyverse) +library(patchwork) + +# real frequencies +x <- letters[c(rep(1, 5), rep(2, 10), rep(3, 10), rep(4, 25), rep(5, 50))] + +# Take 10 samples of size N each +set.seed(20240216) +N <- 10 +sim <- lapply(1:10, function(i){ + out <- x |> + sample(N, replace = TRUE) |> + table() |> prop.table() |> + as.data.frame() + out$sample <- as.character(stringr::str_pad(i, 2, pad = 0)) + names(out) <- c("variant", "frequency", "sample") + return(out) + }) +sim <- do.call(rbind, sim) + +# add the truth +temp <- x |> + table() |> + prop.table() |> + as.data.frame() |> + transform(rep = "truth") +names(temp) <- c("variant", "frequency", "sample") +sim <- rbind(temp, sim) + +# visualise +sim |> + ggplot(aes(sample, frequency)) + + geom_col(aes(fill = variant)) + + scale_fill_manual(values = c("#95AAD3", "#C195C4", "#3A68AE", "#C25757", "#94C47D")) + + theme_classic() + + +# Illustrating sampling +example <- data.frame(sample = x) +example <- crossing(x = 1:10, y = 1:10) |> + cbind(example) + +set.seed(20240216) +p1 <- example |> + ggplot(aes(x, y, colour = sample)) + + geom_point() + +p2 <- example |> + sample_frac(0.5) |> + ggplot(aes(x, y, colour = sample)) + + geom_point() + +p3 <- example |> + sample_frac(0.5) |> + sample_frac(0.8) |> + ggplot(aes(x, y, colour = sample)) + + geom_point() + +p4 <- example |> + sample_frac(0.5) |> + sample_frac(0.8) |> + sample_frac(0.8) |> + ggplot(aes(x, y, colour = sample)) + + geom_point() + +# assemble +(p1 | p2 | p3 | p4) + + plot_layout(guides = "collect") & + coord_equal(xlim = c(1, 10), ylim = c(1, 10)) & + scale_colour_manual(values = c("a" = "#95AAD3", + "b" = "#C195C4", + "c" = "#3A68AE", + "d" = "#C25757", + "e" = "#94C47D")) & + theme_void() & + theme(panel.border = element_rect(fill = NA)) + + +# uncertainty in the estimated frequencies +# using Jeffreys interval estimation using the Beta +crossing(freq = c(0.01, 0.05, 0.1, 0.5, 0.9), depth = seq(10, 1000, 1)) |> + mutate(lo = qbeta(0.025, depth*freq+0.5, depth*(1-freq)+0.5), + hi = qbeta(0.975, depth*freq+0.5, depth*(1-freq)+0.5)) |> + ggplot(aes(depth, freq)) + + geom_line(aes(colour = factor(freq))) + + geom_ribbon(aes(ymin = lo, ymax = hi, fill = factor(freq)), alpha = 0.3) + + theme_bw() + + scale_fill_viridis_d() + scale_colour_viridis_d() + + scale_x_continuous(breaks = seq(0, 10000, 200)) + + scale_y_continuous(breaks = seq(0, 1, 0.2)) + + labs(x = "Sequencing depth", y = "Estimated frequency\n(95% confidence interval)", + subtitle = "Decrease in uncertainty as sequencing depth increases", + colour = "Starting\nfrequency", fill = "Starting\nfrequency") + +# uncertainty interval for a variant at 50% frequency +crossing(freq = 0.5, depth = seq(10, 2000, 1)) |> + mutate(lo = qbeta(0.025, depth*freq+0.5, depth*(1-freq)+0.5), + hi = qbeta(0.975, depth*freq+0.5, depth*(1-freq)+0.5)) |> #View() + ggplot(aes(depth, hi - lo)) + + geom_line(aes(group = factor(freq)), linewidth = 2) + + theme_bw() + + scale_colour_viridis_d() + + scale_x_continuous(breaks = seq(0, 10000, 200)) + + labs(x = "Sequencing depth", y = "Uncertainty\n(95% interval range)", + subtitle = "Non-linear decrease in uncertainty as sequencing depth increases") + +# probability of detecting a variant with at least 5 reads +# assuming different starting frequencies and total read depth +crossing(freq = c(0.01, 0.05, 0.1, 0.5, 0.9), depth = seq(10, 2000, 1)) |> + mutate(detect = 1 - pbinom(4, depth, freq)) |> + # filter(depth < 400) |> + ggplot(aes(depth, detect)) + + geom_line(aes(colour = factor(freq)), linewidth = 2) + + theme_bw() + + scale_colour_viridis_d() + + scale_x_continuous(breaks = seq(0, 2000, 200)) + + scale_y_continuous(breaks = seq(0, 1, 0.2)) + + labs(x = "Sequencing depth", y = "Probability of detection\n(minimum 5 reads)", + subtitle = "High sequencing depth is required to detect rare variants", + colour = "Starting\nfrequency") + + +#### Simulation #### + +# real frequencies +x <- letters[c(rep(1, 1), rep(2, 5), rep(3, 10), rep(4, 25), rep(5, 59))] + +# Each sample of size N each +set.seed(20240216) +depths <- c(250, 500, 1000, 2000, 4000) + +# function to estimate the variant frequencies from a vector +# freqs is a vector of variant frequencies to sample from +# depth is the total sequencing depth - i.e. sample size +sample_variants <- function(freqs, depth){ + if (length(freqs) < 1 | sum(freqs) != 1){ + stop("freqs has to be of length > 1 and add up to 1.") + } + # variants is a vector of same length as the frequencies provided + variants <- letters[1:length(freqs)] + freqs <- sample(variants, depth, replace = TRUE, prob = freqs) |> + table() |> prop.table() |> + as.data.frame() + names(freqs) <- c("variant", "frequency") + return(freqs) +} + +# simulate +nsim <- 10000 # number of simulations +freqs <- c(0.01, 0.05, 0.1, 0.25, 0.59) # starting frequencies +depths <- c(50, 250, 500, 1000, 2000) +sim <- lapply(1:nsim, function(i){ + out <- lapply(depths, function(d){ + out <- sample_variants(freqs, d) + out$sim <- i + out$depth <- d + return(out) + }) + out <- do.call(rbind, out) +}) +sim <- do.call(rbind, sim) +sim$variant <- as.character(sim$variant) + +sim |> + ggplot(aes(frequency)) + + geom_density(aes(colour = factor(variant))) + + geom_vline(xintercept = freqs) + + facet_grid(~ depth) + +sim |> + group_by(depth, variant) |> + summarise(detected = sum(frequency > 0)/1000, + lo = quantile(frequency, 0.025), + hi = quantile(frequency, 0.975)) |> + ggplot(aes(factor(depth), detected)) + + geom_col(aes(fill = factor(variant)), position = "dodge") + +sim |> + group_by(depth, variant) |> + summarise(detected = sum(frequency > 0)/1000, + lo = quantile(frequency, 0.025), + hi = quantile(frequency, 0.975)) |> + ggplot(aes(factor(depth))) + + geom_errorbar(aes(ymin = lo, ymax = hi, colour = factor(variant)), + position = "dodge") + + geom_hline(yintercept = freqs, colour = "grey") + + theme_classic() diff --git a/materials/_chapters.yml b/materials/_chapters.yml index 793a190..2ec2ce3 100644 --- a/materials/_chapters.yml +++ b/materials/_chapters.yml @@ -18,7 +18,7 @@ book: - materials/03-case_studies/03-eqa.md - part: "Wastewater Samples" chapters: - - materials/04-wastewater/01-wastewater_surveillance.md + - materials/04-wastewater/00-ww_temp.md - part: "Software" chapters: - materials/05-software/01-managing_software.md diff --git a/materials/images/workflow_ww_overview.svg b/materials/images/workflow_ww_overview.svg new file mode 100644 index 0000000..53b1983 --- /dev/null +++ b/materials/images/workflow_ww_overview.svg @@ -0,0 +1,1724 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + Reference Genome + + + Mapping + Frequency and depth + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + SNP in a primer site + + Sequencing error + + + + + + + + + SNP within anamplicon + + + Reference Genome + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + SNP frequency will be correctly estimated after primer trimming + + + + + + + + + Primer Trimming + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Grey shows the primer site used in PCR amplification + Sequencing Reads + 0.5 + 0.9 + + + + + + + + + + + + 0.5 + + + + + + + + + + + + Infer lineage abundance + + 6x + 10x + 6x + + + + + + + + + + + + + + + Lineage 1 ~ 50% + Lineage 2 ~ 40% + Lineage 3 ~ 10% + +