Skip to content

Interpreting Wochenende output

Colin Davenport edited this page May 13, 2024 · 15 revisions

Wochenende output files

Our pipeline includes flexible quality control, trimming, mapping, read filtering and normalization. The overall flow of information through the Wochenende pipeline including all modular steps and tools is illustrated in the following figure:

The core Wochenende pipeline produces many output files:

  • .ndp. # Removal of non-unique reads with Perldup
  • .lc. # Removal of low-complexity sequences by Prinseq
  • .trm. # Trimmed with Trimmomatic
  • .trm.bam # Alignment: Initial, unsorted BAM. Can usually be deleted.
  • .unmapped.fastq # get unmapped reads from BAM in fastq format, can be further analyzed
  • .s. # Sorted BAM output file
  • .nosec. # No secondary alignments, i.e secondary alignments from minimap2 have been removed from the bam file
  • flagstat.txt # Information on number of QC-passed and -failed, duplicate and mapped reads
  • .ns. # Paired end (PE) files only, not sorted. Sorting follows the .fix step
  • .fix. # PE files only, fixed Paired End reads
  • .bam.txt # Samtools index stats - chromosome/species counts
  • .mm. # Mismatch filter applied as configured, keep only reads with less than x mismatches
  • .dup. # Samtools mark duplicates, duplicates excluded
  • .mq20./.mq30. # Mapping quality cutoff, retain only MQ20+/MQ30+ reads
  • .calmd. # Samtools calmd, calculate MD tags on a BAM file (to enable easily viewing SNPs in some genome browsers like JBrowse).

The core Wochenende pipeline is followed by optional normalization, integration and visualization steps (haybaler) as well as by three optional modules (raspir, growth rate and plots).

1. Normalization:

Sequencing reads are normalized to the microbial length and million sequencing reads in the experiment. The reporting tool reports chromosome length, GC content of the sequence, raw counts of reads attributed to the species and various normalized read count parameters. Normalizations are for:

  • 1 - reads normalized to the idealized length of a bacterial chromosome (normalization to 1 million reference base pairs)
  • 2 - total reads in the sequencing library (normalization to 1 million reads)
  • 3 - the above two normalizations combined (RPMM, so Reads Per Million reads per Million base pairs)
  • 4 - (bacterial) reads per human cell (only works for metagenomes from human hosts. An estimate of absolute abundance.)

Normalization case 1 - chromosome length normalization

As an example for normalization for case 1 (chromosome length normalization), we expect more reads to be sampled from bacteria with larger genomes, and genome sizes can vary widely, from roughly 1 Mbp to 12 Mbp or more (average roughly 2.5 Mbp). So if there are 600 reads mapped to a genome of 6 Mbp (we need to divide by 6) in size, then the normalized reads per 1 million base pairs will be 600 / 6 = 100.

Code for this function is here: https://github.com/MHH-RCUG/nf_wochenende/blob/d23803a06366861e5a7a259cff12a89db8630e74/reporting/basic_reporting.py#L210

Normalization case 2 - library size normalization

For case 2 or library normalization, if we sequence more reads, than we expect to find more reads from each bacterium. Each of our samples will naturally have differing numbers of total reads due to sample or sequencer output variability. Therefore we normalize to a situation where we expect just 1m reads per sequencing library. If we have two samples, each with 1000 normalized reads of the bacterium E. coli, but library 1 has 4m reads sequenced and library 2 8m reads, then we need to correct for this. Normalized E. coli reads in library 1 are thus 1000 * (1m / 4m) = 250, whereas library 2 E. coli normalized reads are then 1000 * (1m / 8m) = 125.

Code: https://github.com/MHH-RCUG/nf_wochenende/blob/d23803a06366861e5a7a259cff12a89db8630e74/reporting/basic_reporting.py#L215

Normalization case 3 - RPMM normalization which combines chromosome length and library size normalization

RPMM, so Reads Per Million reads per Million base pairs is the combination by multiplication of the above two normalizations, so for chromosome length and library size. It is important to note that RPMM is a relative normalization, whereas the fourth normalization option, bacteria per human cell, is an estimate of an absolute normalization.

RPMM = read_count / (chromosome_length_normalization) * ( library_size_normalization)

Or more precisely: RPMM = read_count / (chromosome_length / 1000000) * ( sum of read_counts / 1000000 )

Code: https://github.com/MHH-RCUG/nf_wochenende/blob/d23803a06366861e5a7a259cff12a89db8630e74/reporting/basic_reporting.py#L220

RPMM, so Reads Per Million reads per Million base pairs is literally only

Normalization case 4 - Bacteria per human cell

This normalization is best described at length in our preprint at https://www.biorxiv.org/content/10.1101/2022.03.18.484377v3.full in the section normalization or paper at https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08985-9.

2. Integration:

Our integration module Haybaler compiles the various samples into comprehensive tables per normalization statistic. The haybaler output is ordered by read count: the organisms/chromosomes with the highest read count among all samples will be at the top of the haybaler output file. You can easily compute your own graphs or perform analyses using R, python, or your favourite graphing tool using the haybaler outputs, where all samples are present.

3. Visualization:

Heatmaps and heat trees are produced in R from the calculated abundance information.

4. Plots:

Genome coverage plots. Wochenende_plot creates one subdirectory per input file. These contain png images of taxa which are probabably (high score, largely based on consistent evenness of coverage and high mean coverage) or perhaps present (need manual review). Confident attributions to taxa depends strongly on the number of reads assigned to bacterial taxa (low in airway metagenomes, higher in for example stool samples).

5. Raspir:

Mathematical approach based on Fourier transforms and spectral comparisons to discern rare but present species from false positives (closely related) species. It works on BAM files created by Wochenende and creates another (more conservative) estimation of which species are present in the metagenomic reads. All bacteria found in the raspir output are considered to be in the metagenome, so you don't need to further filter these data. In other words, bacteria that are not considered present are not output.

Species r_value p_value stError euclidean distribution
1_AP011540_1_Rothia_mucilaginosa_DY_18 0.9443 0 0 0.005 uniform
NZ_CP040006.1_Schaalia_odontolytica_strain_XH001 0.9314 0 1.00E-05 0.007 uniform
NZ_CP031251.1_Neisseria_subflava_strain_M18660 0.8985 0 1.00E-05 0.009 uniform
NZ_LR134375.1_Veillonella_dispar_strain_NCTC11831 0.8874 0 1.00E-05 0.015 uniform
1_CP002843_1_Streptococcus_parasanguinis_ATCC_15912 0.8913 0 1.00E-05 0.015 uniform
NZ_CP046313.1_Gemella_sanguinis_strain_FDAARGOS_742 0.9204 0 0 0.035 uniform
1_FQ312002_1_Haemophilus_parainfluenzae_T3T1 0.8484 0 0 0.03 uniform
1_FN568063_1_Streptococcus_mitis_B6 0.6834 0 1.00E-05 0.029 uniform
1_CP002925_1_Streptococcus_pseudopneumoniae_IS7493 0.6703 0 1.00E-05 0.028 uniform
1_FR720602_1_Streptococcus_oralis_Uo5 0.8151 0 1.00E-05 0.016 uniform
1_CP024698_1_Fusobacterium_periodonticum_strain_KCOM_1283 0.9936 0 0 0.028 uniform
1_CP020566_1_Veillonella_atypica_strain_OK5 0.8512 0 0 0.03 uniform

6. Growth rate

The tools in the subfolder growth_rate estimate the speed at which bacteria are growing. Possible values are no growth, slow, medium and fast. This is based on the observation by Korem et al 2015, namely that the ratio of read copy number at the bacterial genomic origin (ori) to the read copy number at the terminus (ter) can be used to infer growing species in a microbiome. Growth rate is determined for bacteria which are attributed sufficient numbers of reads during the alignment process. Results are obtained in a csv format.

Error codes explanation:
[] : no error
-1 : insufficient median coverage
-2 : not enough reads left after filtering
-3 : not enough bins left after filtering
-4 : growth rate way too high, probably corrupt fitting
-5 : fit error too large
Name Growth_class Growth_Rate No_Reads Initial_Bins Used_Bins Fit_Err Error_Codes
1_AP011540_1_Rothia_mucilaginosa_DY_18 moderate 1.42 895141 50 48 2.11 []
NZ_CP040006.1_Schaalia_odontolytica_strain_XH001 slow 1.14 727909 50 45 3.18 []
NZ_CP031251.1_Neisseria_subflava_strain_M18660 slow 1.12 655831 50 36 4.81 []
NZ_LR134375.1_Veillonella_dispar_strain_NCTC11831 slow 1.22 317693 50 37 5.86 []
1_CP002843_1_Streptococcus_parasanguinis_ATCC_15912 slow 1.11 313012 50 37 4.9 []
NZ_CP046313.1_Gemella_sanguinis_strain_FDAARGOS_742 no growth 1.07 131906 50 46 2.59 []
1_FQ312002_1_Haemophilus_parainfluenzae_T3T1 moderate 1.46 104090 50 44 2.06 []
1_FN568063_1_Streptococcus_mitis_B6_ moderate 1.93 73600 50 36 3.54 []
1_CP002925_1_Streptococcus_pseudopneumoniae_IS7493 moderate 1.39 38169 50 34 4.8 []
1_FR720602_1_Streptococcus_oralis_Uo5 failed 1.25 31721 50 24 7.54 [-3]
1_CP024698_1_Fusobacterium_periodonticum_strain_KCOM_1283 moderate 1.46 29803 50 48 2.46 []
1_CP020566_1_Veillonella_atypica_strain_OK5 failed 3.06 24904 50 25 9.34 [-5]