Interpreting Wochenende output

Wochenende output files

Our pipeline includes flexible quality control, trimming, mapping, read filtering and normalization. The overall flow of information through the Wochenende pipeline including all modular steps and tools is illustrated in the following figure:

The core Wochenende pipeline produces many output files:

.ndp. # Removal of non-unique reads with Perldup
.lc. # Removal of low-complexity sequences by Prinseq
.trm. # Trimmed with Trimmomatic
.trm.bam # Alignment: Initial, unsorted BAM. Can usually be deleted.
.unmapped.fastq # get unmapped reads from BAM in fastq format, can be further analyzed
.s. # Sorted BAM output file
.nosec. # No secondary alignments, i.e secondary alignments from minimap2 have been removed from the bam file
flagstat.txt # Information on number of QC-passed and -failed, duplicate and mapped reads

.ns. # Paired end (PE) files only, not sorted. Sorting follows the .fix step
.fix. # PE files only, fixed Paired End reads

.bam.txt # Samtools index stats - chromosome/species counts
.mm. # Mismatch filter applied as configured, keep only reads with less than x mismatches
.dup. # Samtools mark duplicates, duplicates excluded
.mq20./.mq30. # Mapping quality cutoff, retain only MQ20+/MQ30+ reads
.calmd. # Samtools calmd, calculate MD tags on a BAM file (to enable easily viewing SNPs in some genome browsers like JBrowse).

The core Wochenende pipeline is followed by optional normalization, integration and visualization steps (haybaler) as well as by three optional modules (raspir, growth rate and plots).

1. Normalization:

Sequencing reads are normalized to the microbial length and million sequencing reads in the experiment. The reporting tool reports chromosome length, GC content of the sequence, raw counts of reads attributed to the species and various normalized read count parameters. Normalizations are for:

1 - reads normalized to the idealized length of a bacterial chromosome (normalization to 1 million reference base pairs)
2 - total reads in the sequencing library (normalization to 1 million reads)
3 - the above two normalizations combined (RPMM, so Reads Per Million reads per Million base pairs)
4 - (bacterial) reads per human cell (only works for metagenomes from human hosts. An estimate of absolute abundance.)

Normalization case 1 - chromosome length normalization

As an example for normalization for case 1 (chromosome length normalization), we expect more reads to be sampled from bacteria with larger genomes, and genome sizes can vary widely, from roughly 1 Mbp to 12 Mbp or more (average roughly 2.5 Mbp). So if there are 600 reads mapped to a genome of 6 Mbp (we need to divide by 6) in size, then the normalized reads per 1 million base pairs will be 600 / 6 = 100.

Code for this function is here: https://github.com/MHH-RCUG/nf_wochenende/blob/d23803a06366861e5a7a259cff12a89db8630e74/reporting/basic_reporting.py#L210

Normalization case 2 - library size normalization

For case 2 or library normalization, if we sequence more reads, than we expect to find more reads from each bacterium. Each of our samples will naturally have differing numbers of total reads due to sample or sequencer output variability. Therefore we normalize to a situation where we expect just 1m reads per sequencing library. If we have two samples, each with 1000 normalized reads of the bacterium E. coli, but library 1 has 4m reads sequenced and library 2 8m reads, then we need to correct for this. Normalized E. coli reads in library 1 are thus 1000 * (1m / 4m) = 250, whereas library 2 E. coli normalized reads are then 1000 * (1m / 8m) = 125.

Code: https://github.com/MHH-RCUG/nf_wochenende/blob/d23803a06366861e5a7a259cff12a89db8630e74/reporting/basic_reporting.py#L215

Normalization case 3 - RPMM normalization which combines chromosome length and library size normalization

RPMM, so Reads Per Million reads per Million base pairs is the combination by multiplication of the above two normalizations, so for chromosome length and library size. It is important to note that RPMM is a relative normalization, whereas the fourth normalization option, bacteria per human cell, is an estimate of an absolute normalization.

RPMM = read_count / (chromosome_length_normalization) * ( library_size_normalization)

Or more precisely: RPMM = read_count / (chromosome_length / 1000000) * ( sum of read_counts / 1000000 )

Code: https://github.com/MHH-RCUG/nf_wochenende/blob/d23803a06366861e5a7a259cff12a89db8630e74/reporting/basic_reporting.py#L220

RPMM, so Reads Per Million reads per Million base pairs is literally only

Normalization case 4 - Bacteria per human cell

This normalization is best described at length in our preprint at https://www.biorxiv.org/content/10.1101/2022.03.18.484377v3.full in the section normalization or paper at https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08985-9.

2. Integration:

Our integration module Haybaler compiles the various samples into comprehensive tables per normalization statistic. The haybaler output is ordered by read count: the organisms/chromosomes with the highest read count among all samples will be at the top of the haybaler output file. You can easily compute your own graphs or perform analyses using R, python, or your favourite graphing tool using the haybaler outputs, where all samples are present.

3. Visualization:

Heatmaps and heat trees are produced in R from the calculated abundance information.

4. Plots:

Genome coverage plots. Wochenende_plot creates one subdirectory per input file. These contain png images of taxa which are probabably (high score, largely based on consistent evenness of coverage and high mean coverage) or perhaps present (need manual review). Confident attributions to taxa depends strongly on the number of reads assigned to bacterial taxa (low in airway metagenomes, higher in for example stool samples).

5. Raspir:

Mathematical approach based on Fourier transforms and spectral comparisons to discern rare but present species from false positives (closely related) species. It works on BAM files created by Wochenende and creates another (more conservative) estimation of which species are present in the metagenomic reads. All bacteria found in the raspir output are considered to be in the metagenome, so you don't need to further filter these data. In other words, bacteria that are not considered present are not output.

Species	r_value	stError	euclidean	distribution
1_AP011540_1_Rothia_mucilaginosa_DY_18	0.9443	0	0.005	uniform
NZ_CP040006.1_Schaalia_odontolytica_strain_XH001	0.9314	1.00E-05	0.007	uniform
NZ_CP031251.1_Neisseria_subflava_strain_M18660	0.8985	1.00E-05	0.009	uniform
NZ_LR134375.1_Veillonella_dispar_strain_NCTC11831	0.8874	1.00E-05	0.015	uniform
1_CP002843_1_Streptococcus_parasanguinis_ATCC_15912	0.8913	1.00E-05	0.015	uniform
NZ_CP046313.1_Gemella_sanguinis_strain_FDAARGOS_742	0.9204	0	0.035	uniform
1_FQ312002_1_Haemophilus_parainfluenzae_T3T1	0.8484	0	0.03	uniform
1_FN568063_1_Streptococcus_mitis_B6	0.6834	1.00E-05	0.029	uniform
1_CP002925_1_Streptococcus_pseudopneumoniae_IS7493	0.6703	1.00E-05	0.028	uniform
1_FR720602_1_Streptococcus_oralis_Uo5	0.8151	1.00E-05	0.016	uniform
1_CP024698_1_Fusobacterium_periodonticum_strain_KCOM_1283	0.9936	0	0.028	uniform
1_CP020566_1_Veillonella_atypica_strain_OK5	0.8512	0	0.03	uniform

6. Growth rate

The tools in the subfolder growth_rate estimate the speed at which bacteria are growing. Possible values are no growth, slow, medium and fast. This is based on the observation by Korem et al 2015, namely that the ratio of read copy number at the bacterial genomic origin (ori) to the read copy number at the terminus (ter) can be used to infer growing species in a microbiome. Growth rate is determined for bacteria which are attributed sufficient numbers of reads during the alignment process. Results are obtained in a csv format.

Error codes explanation:
[] : no error
-1 : insufficient median coverage
-2 : not enough reads left after filtering
-3 : not enough bins left after filtering
-4 : growth rate way too high, probably corrupt fitting
-5 : fit error too large

Name	Growth_class	Growth_Rate	No_Reads	Initial_Bins	Used_Bins	Fit_Err	Error_Codes
1_AP011540_1_Rothia_mucilaginosa_DY_18	moderate	1.42	895141	50	48	2.11	[]
NZ_CP040006.1_Schaalia_odontolytica_strain_XH001	slow	1.14	727909	50	45	3.18	[]
NZ_CP031251.1_Neisseria_subflava_strain_M18660	slow	1.12	655831	50	36	4.81	[]
NZ_LR134375.1_Veillonella_dispar_strain_NCTC11831	slow	1.22	317693	50	37	5.86	[]
1_CP002843_1_Streptococcus_parasanguinis_ATCC_15912	slow	1.11	313012	50	37	4.9	[]
NZ_CP046313.1_Gemella_sanguinis_strain_FDAARGOS_742	no growth	1.07	131906	50	46	2.59	[]
1_FQ312002_1_Haemophilus_parainfluenzae_T3T1	moderate	1.46	104090	50	44	2.06	[]
1_FN568063_1_Streptococcus_mitis_B6_	moderate	1.93	73600	50	36	3.54	[]
1_CP002925_1_Streptococcus_pseudopneumoniae_IS7493	moderate	1.39	38169	50	34	4.8	[]
1_FR720602_1_Streptococcus_oralis_Uo5	failed	1.25	31721	50	24	7.54	[-3]
1_CP024698_1_Fusobacterium_periodonticum_strain_KCOM_1283	moderate	1.46	29803	50	48	2.46	[]
1_CP020566_1_Veillonella_atypica_strain_OK5	failed	3.06	24904	50	25	9.34	[-5]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly