NGS data processing pipeline designed to perform species-agnostic and species-specific analysis of short paired-end (PE) bacterial reads.
Pipeline is structured in terms of modules. Each module corresponds to a Snakemake script (snakefile) and a python module object. Snakefiles are used to define data processing rules.
Module class objects store information about inputs and expected outputs for these rules, handle file placement operations, module configuration, information transfer between modules, and other processes that are not directly involved in running analysis on files.
Module | Type | Description | Tools |
---|---|---|---|
bact_core | agnostic | QC, host filtering, denovo assembly, taxonomic classification | fastp, kraken2, shovill, krakentools, krona |
bact_shell | agnostic | assembly QC, resistance profiling, plasmid reconstruction & typing | quast, rgi-card, amr++v2.0, resfinder, mob-suite |
bact_tip | specific | species-dependent sub-typing | hicap, meningotype, legsta, Kleborate, AgrVATE, spaTyper, Staphopia-sccmec, emmtyper, seqsero, sistr, lissero, PubMLST database API, Institute Pasteur MLST database API, Legionella pneumophila in silico Serogroup Prediction, ectyper, seroba |
Pipeline is designed to be run by NMRL users on RTU HPC, where HPC-level configuration is available out-of-the-box.
Configuration level | Dependencies |
---|---|
HPC | Torque/PBS, Conda, Singularity |
Conda | Snakemake (should be installed for each user using -s flag to the ardetype.py script) |
Python | numpy==1.22.3, pandas==1.4.2, PyYAML==6.0, requests==2.27.1, bs4==0.0.1 |
Kraken2 | Pre-built or custom databases for human and bacteria |
Resfinder4 | Database |
To install from scratch, you will need a Linux system with root access and installed singularity to build containers. WSL or Virtual Machine should also work.
Clone the repository to your local machine and use singularity recipe files to build containers,
then copy to HPC cluster so that they can be accessed by the pipeline scripts.
Clone the repository to the cluster and edit files found in config_files folder to match your local setup:
File | Scope |
---|---|
module_data.json | paths to cluster_config file and snakefiles |
config_modular.yaml | paths to singularity image files, kraken2 databases, resfinder database, path to Legionella pneumophila in silico Serogroup Prediction tool |
-
Note: pipeline accepts only fastq files that are named according to illumina conventions (sample_id_R{1,2}_001.fastq.gz).
-
Testing (to see what jobs will be executed):
python ardetype.py -t -i path_to_folder_with_fastq/ -o path_to_output_folder -m all
-
Running:
python ardetype.py -i path_to_folder_with_fastq/ -o path_to_output_folder -m all