Releases: phac-nml/ecoli_serotyping
Releases · phac-nml/ecoli_serotyping
E.coli serotyping with QC module and adaptive thresholding
Major improvements:
- Incorporation of Quality Control module allowing for easier results interpretation and any need for correction measure (re-sequencing, wet-lab serotyping). Unique thresholding at allele level allowing to determine if a given allele and query quality parameters (
%identity
and%coverage
) are sufficient to resolve an antigen call unambiguously. - Cluster friendly behaviour supporting multiple instances via a
.lock
file preventing racing conditions and simultaneous database update via several instances - An updated database of alleles with the removal of duplicated or truncated alleles (e.g. O157 antigen)
- Improved species identification resolution for highly similar non-Ecoli species such as Shigella and E.albertii. Now species identification is only done via MASH NCBI RefSeq sketch (https://gembox.cbcb.umd.edu/mash/refseq.genomes.k21s1000.msh)
- Users can add new alleles to an existing allele database and make serotype predictions via custom allele database thanks to
--dbpath
parameter - Improved O and H antigens call rates and accuracy thanks to decoupling of
%identity
and%coverage
thresholds for each antigen. Now global thresholds could be specified separately. This is especially important if one of the antigen genes (e.g.wzx
/wzy
or fliC, etc) is truncated or has low coverage - Improved adaptive O antigen calling rates if only a single O antigen candidate in preliminary BLAST results is available making accurate O antigen call even in poorly sequenced samples with minimal coverage.
- Addition of mixed O antigen calls for highly similar O antigens (e.g. O17/O77)
- Allele names/keys used to make antigen calls are also reported making easier troubleshooting for dubious alleles and alleles database cleaning
- More detailed error messages and support for 16 high similarity O-antigens (%identity > 99%) based on the reference publication PMID: 25428893
Minor bugs correction in species identification and increased robustness of the --verify switch
Merge pull request #78 from kbessonov1984/master Version 0.9.1 addressing minor issues on species identification and fasta files handling
E.coli serotyping with ability to differentiate between Shigella and other Escherichia cryptic species
- improved O-antigen serotyping coverage of complex samples that lack some O-antigen signatures
- better complex cases handling and error recovery in cases of poor reference allele coverage
- improved O-antigen identification precision favoring the presence of both alleles (e.g.
wzx
andwzy
) to support the final call. The sum of scores for both alleles of the same antigen is used in ranking now - automatic download and update of RefSeq genome sketches every 6 months
- addition of Quality Control flags in the output (as an extra column in the results.tsv) for ease of results interpretation
- improved species identification for the FASTQ files. All raw reads are used for species identification
- query length coverage default threshold lowered from 50% to 10% to account for truncated alleles. This greatly improved the sensitivity of the tool while not changing significantly specificity
- wrote additional unit tests to cover all aspects of the program
- file lock application when updating RefSeq sketch and assembly stats files