-
Notifications
You must be signed in to change notification settings - Fork 2
Interpreting Wochenende output
Our pipeline includes flexible quality control, trimming, mapping, read filtering and normalization. The overall flow of information through the Wochenende pipeline including all modular steps and tools is illustrated in the following figure:
The core Wochenende pipeline produces many output files:
-
.ndp.
# Removal of non-unique reads with Perldup -
.lc.
# Removal of low-complexity sequences by Prinseq -
.trm.
# Trimmed with Trimmomatic -
.trm.bam
# Alignment: Initial, unsorted BAM. Can usually be deleted. -
.unmapped.fastq
# get unmapped reads from BAM in fastq format, can be further analyzed -
.s.
# Sorted BAM output file -
.nosec.
# No secondary alignments, i.e secondary alignments from minimap2 have been removed from the bam file -
flagstat.txt
# Information on number of QC-passed and -failed, duplicate and mapped reads
-
.ns.
# Paired end (PE) files only, not sorted. Sorting follows the .fix step -
.fix.
# PE files only, fixed Paired End reads
-
.bam.txt
# Samtools index stats - chromosome/species counts -
.mm.
# Mismatch filter applied as configured, keep only reads with less than x mismatches -
.dup.
# Samtools mark duplicates, duplicates excluded -
.mq20.
/.mq30.
# Mapping quality cutoff, retain only MQ20+/MQ30+ reads -
.calmd.
# Samtools calmd, calculate MD tags on a BAM file (to enable easily viewing SNPs in some genome browsers like JBrowse).
The core Wochenende pipeline is followed by optional normalization, integration and visualization steps (haybaler) as well as by three optional modules (raspir, growth rate and plots).
Sequencing reads are normalized to the microbial length and million sequencing reads in the experiment.
The reporting
tool reports chromosome length
, GC
content of the sequence, raw counts
of reads attributed to the species and various normalized
read count parameters. Normalizations
are for:
- 1 - reads normalized to the idealized length of a bacterial chromosome (
normalization to 1 million reference base pairs
) - 2 - total reads in the sequencing library (
normalization to 1 million reads
) - 3 - the above two normalizations combined (
RPMM, so Reads Per Million reads per Million base pairs
) - 4 - (bacterial)
reads per human cell
(only works for metagenomes from human hosts. An estimate of absolute abundance.)
As an example for normalization for case 1
(chromosome length normalization), we expect more reads to be sampled from bacteria with larger genomes, and genome sizes can vary widely, from roughly 1 Mbp
to 12 Mbp
or more (average roughly 2.5 Mbp). So if there are 600 reads mapped to a genome of 6 Mbp (we need to divide by 6) in size, then the normalized reads per 1 million base pairs will be 600 / 6 = 100
.
Code for this function is here: https://github.com/MHH-RCUG/nf_wochenende/blob/d23803a06366861e5a7a259cff12a89db8630e74/reporting/basic_reporting.py#L210
For case 2
or library normalization, if we sequence more reads, than we expect to find more reads from each bacterium. Each of our samples will naturally have differing numbers of total reads due to sample or sequencer output variability. Therefore we normalize to a situation where we expect just 1m reads per sequencing library. If we have two samples, each with 1000 normalized reads of the bacterium E. coli, but library 1 has 4m reads sequenced and library 2 8m reads, then we need to correct for this. Normalized E. coli reads in library 1 are thus 1000 * (1m / 4m) = 250
, whereas library 2 E. coli normalized reads are then 1000 * (1m / 8m) = 125
.
Normalization case 3 - RPMM normalization which combines chromosome length and library size normalization
RPMM, so Reads Per Million reads per Million base pairs
is the combination by multiplication of the above two normalizations, so for chromosome length and library size. It is important to note that RPMM
is a relative normalization
, whereas the fourth normalization option, bacteria per human cell
, is an estimate of an absolute normalization
.
RPMM = read_count / (chromosome_length_normalization) * ( library_size_normalization)
Or more precisely:
RPMM = read_count / (chromosome_length / 1000000) * ( sum of read_counts / 1000000 )
RPMM, so Reads Per Million reads per Million base pairs
is literally only
This normalization is best described at length in our preprint at https://www.biorxiv.org/content/10.1101/2022.03.18.484377v3.full in the section normalization or paper at https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08985-9.
Our integration module Haybaler
compiles the various samples into comprehensive tables per normalization statistic. The haybaler
output is ordered by read count: the organisms/chromosomes with the highest read count among all samples will be at the top of the haybaler output file. You can easily compute your own graphs or perform analyses using R, python, or your favourite graphing tool using the haybaler outputs, where all samples are present.
Heatmaps and heat trees are produced in R from the calculated abundance information.
Genome coverage plots. Wochenende_plot creates one subdirectory per input file. These contain png images of taxa which are probabably (high score, largely based on consistent evenness of coverage and high mean coverage) or perhaps present (need manual review). Confident attributions to taxa depends strongly on the number of reads assigned to bacterial taxa (low in airway metagenomes, higher in for example stool samples).
5. Raspir:
Mathematical approach based on Fourier transforms and spectral comparisons to discern rare but present species from false positives (closely related) species. It works on BAM files created by Wochenende and creates another (more conservative) estimation of which species are present in the metagenomic reads. All bacteria found in the raspir output are considered to be in the metagenome, so you don't need to further filter these data. In other words, bacteria that are not considered present are not output.
Species | r_value | p_value | stError | euclidean | distribution |
---|---|---|---|---|---|
1_AP011540_1_Rothia_mucilaginosa_DY_18 | 0.9443 | 0 | 0 | 0.005 | uniform |
NZ_CP040006.1_Schaalia_odontolytica_strain_XH001 | 0.9314 | 0 | 1.00E-05 | 0.007 | uniform |
NZ_CP031251.1_Neisseria_subflava_strain_M18660 | 0.8985 | 0 | 1.00E-05 | 0.009 | uniform |
NZ_LR134375.1_Veillonella_dispar_strain_NCTC11831 | 0.8874 | 0 | 1.00E-05 | 0.015 | uniform |
1_CP002843_1_Streptococcus_parasanguinis_ATCC_15912 | 0.8913 | 0 | 1.00E-05 | 0.015 | uniform |
NZ_CP046313.1_Gemella_sanguinis_strain_FDAARGOS_742 | 0.9204 | 0 | 0 | 0.035 | uniform |
1_FQ312002_1_Haemophilus_parainfluenzae_T3T1 | 0.8484 | 0 | 0 | 0.03 | uniform |
1_FN568063_1_Streptococcus_mitis_B6 | 0.6834 | 0 | 1.00E-05 | 0.029 | uniform |
1_CP002925_1_Streptococcus_pseudopneumoniae_IS7493 | 0.6703 | 0 | 1.00E-05 | 0.028 | uniform |
1_FR720602_1_Streptococcus_oralis_Uo5 | 0.8151 | 0 | 1.00E-05 | 0.016 | uniform |
1_CP024698_1_Fusobacterium_periodonticum_strain_KCOM_1283 | 0.9936 | 0 | 0 | 0.028 | uniform |
1_CP020566_1_Veillonella_atypica_strain_OK5 | 0.8512 | 0 | 0 | 0.03 | uniform |
The tools in the subfolder growth_rate estimate the speed at which bacteria are growing
. Possible values are no growth
, slow
, medium
and fast
. This is based on the observation by Korem et al 2015, namely that the ratio of read copy number at the bacterial genomic origin (ori) to the read copy number at the terminus (ter) can be used to infer growing species in a microbiome. Growth rate is determined for bacteria which are attributed sufficient numbers of reads during the alignment process. Results are obtained in a csv format.
Error codes explanation:
[] : no error
-1 : insufficient median coverage
-2 : not enough reads left after filtering
-3 : not enough bins left after filtering
-4 : growth rate way too high, probably corrupt fitting
-5 : fit error too large
Name | Growth_class | Growth_Rate | No_Reads | Initial_Bins | Used_Bins | Fit_Err | Error_Codes |
---|---|---|---|---|---|---|---|
1_AP011540_1_Rothia_mucilaginosa_DY_18 | moderate | 1.42 | 895141 | 50 | 48 | 2.11 | [] |
NZ_CP040006.1_Schaalia_odontolytica_strain_XH001 | slow | 1.14 | 727909 | 50 | 45 | 3.18 | [] |
NZ_CP031251.1_Neisseria_subflava_strain_M18660 | slow | 1.12 | 655831 | 50 | 36 | 4.81 | [] |
NZ_LR134375.1_Veillonella_dispar_strain_NCTC11831 | slow | 1.22 | 317693 | 50 | 37 | 5.86 | [] |
1_CP002843_1_Streptococcus_parasanguinis_ATCC_15912 | slow | 1.11 | 313012 | 50 | 37 | 4.9 | [] |
NZ_CP046313.1_Gemella_sanguinis_strain_FDAARGOS_742 | no growth | 1.07 | 131906 | 50 | 46 | 2.59 | [] |
1_FQ312002_1_Haemophilus_parainfluenzae_T3T1 | moderate | 1.46 | 104090 | 50 | 44 | 2.06 | [] |
1_FN568063_1_Streptococcus_mitis_B6_ | moderate | 1.93 | 73600 | 50 | 36 | 3.54 | [] |
1_CP002925_1_Streptococcus_pseudopneumoniae_IS7493 | moderate | 1.39 | 38169 | 50 | 34 | 4.8 | [] |
1_FR720602_1_Streptococcus_oralis_Uo5 | failed | 1.25 | 31721 | 50 | 24 | 7.54 | [-3] |
1_CP024698_1_Fusobacterium_periodonticum_strain_KCOM_1283 | moderate | 1.46 | 29803 | 50 | 48 | 2.46 | [] |
1_CP020566_1_Veillonella_atypica_strain_OK5 | failed | 3.06 | 24904 | 50 | 25 | 9.34 | [-5] |