Instantaneous Metagenomic Taxonomic Profiling with MetaKSSD

MetaKSSD is the second version of KSSD (K-mer Substring Space Sampling/Shuffling Decomposition), designed for instantaneous metagenome taxonomic profiling using WGS fastq data.

K-mer Substring Space Decomposition (KSSD) facilitates highly efficient genome sketching and enables lossless sketch operations, including union, intersection, and subtraction [doi.org/10.1186/s13059-021-02303-4]. Building upon the KSSD framework, MetaKSSD introduces a novel feature that tracks k-mer counts within the sketch. Leveraging these foundational functionalities, MetaKSSD further innovates methods for constructing a taxonomic marker database (MarkerDB), metagenome taxonomic profiling, and profile searching.

The philosophy behind MetaKSSD: https://github.com/yhg926/MetaKSSD/wiki/Philosophy-behind-MetaKSSD

1. Installation

git clone https://github.com/yhg926/MetaKSSD.git &&
cd MetaKSSD &&
make
export PATH=$PATH:$(pwd)/bin

1.1 (Optional) Get pre-built MarkerDB (L3K11)

wget https://zenodo.org/records/14353354/files/markerdb.gtdb_r214v2_L3K11_sketch.tar.gz
tar xf markerdb.gtdb_r214v2_L3K11_sketch.tar.gz

This step is only needed when you do not have a MarkerDB. You can also prepare your own MarkerDB, see build custom MarkerDB.

1.2 (Optional) Prepare gtdbr214 to ncbi taxonomy convertion tables

gunzip -d data/best.gtdbr214_psid2ncbi_specid.tsv.gz;
gunzip -d data/scienficaname.ncbitaxid_rank_parentnode_name.gtdbr214_pseudoidrelated.tsv.gz

These files are only needed when you have to convert gtdb to ncbi taxonomy.

1.3 (Optional) Get pre-build Abundance Vector Database (L3K11)

wget https://zenodo.org/records/11437234/files/markerdb.abvdb231227.L3K11_gtdb_r214.tar.gz
tar xf markerdb.abvdb231227.L3K11_gtdb_r214.tar.gz

The Abundance Vector Database is only needed when you perform abundance vector searching. You can also prepare your own Abundance Vector Database, see Index abundance vector database.

2. Metagenome profiling

#sketching with k-mer counts tracking
metakssd dist -L shuf_files/L3K11.shuf -A -o <sample1_sketch> <sample1.fastq>
#generate raw profile
metakssd composite -r <markerdb> -q <sample1_sketch> > <species_coverage.tsv>
#abundance normalization
perl src/possion.kssd2out.pl <species_coverage.tsv> <minimum overlapped k-mer S (default:18)> > <species_relative_abundance_profile>

If need to covert species abundaces to full gtdb taxonomy profile, using

perl src/kssd2out2gtdb_taxonomy_profile.pl <species_relative_abundance_profile> data/gtdbr214_psid2krona_taxonomy.tsv

If need to covert species_coverage.tsv to CAMI format profile with NCBI taxonomy, using

#format convertion 
perl src/possion.kssdcomposite2taxonomy_profilefmt.pl <species_coverage.tsv> data/best.gtdbr214_psid2ncbi_specid.tsv data/scienficaname.ncbitaxid_rank_parentnode_name.gtdbr214_pseudoidrelated.tsv 18 > sample1.profile

If need to covert species_coverage.tsv to Krona format profile, using

perl src/kssdcomposite2gtdb_tax_kronafmt.pl <species_coverage.tsv> data/gtdbr214_psid2krona_taxonomy.tsv <outdir>

3. Abundance Vector Searching

3.1 Generate your abundance vector in a given path

metakssd composite -r <markerdb> -q <metagenome sketch> -b -o <path>

3.2 Abundance Vector Searching

To retrieve abundance vectors similar to an abundance vector "input.abv" from a markerdb with indexed abv:

metakssd composite -r <markerdb.abvdb> -s<0 or 1> <path/input.abv>

Here, the options -s0 and -s1 enable searching based on L1 norm and cosine similarity, respectively.

4. Index abundance vector database

Suppose you have many *.abv generated following section 3.1. Then the abundance vector database could be indexed as follow:

#make folder named abundance_Vec under your markerdb path
mkdir -p <markerdb path>/abundance_Vec
#collect all *.abv to the folder
cp *.abv <markerdb path>/abundance_Vec
#index 
metakssd composite -r <markerdb path> -i

5. build custom MarkerDB

# sketching reference genomes
metakssd dist -L <*.shuf> -o <L3K11_sketch> <all genomes Dir>
# print genome name
metakssd set -P <L3K11_sketch> > <genome_name.txt>

The genome names within the sketch were print line by line and direct the output to a file named "genome_name.txt" . Then a grouping file named "group_name.txt" were prepared as follow: The "group_name.txt" file should contain the same lines as "genome_name.txt", where each line specifies the species name of the genome corresponding to the line in "genome_name.txt". Each line follows the format: "IDspecies_name\n", where the ID is a non-negative integer uniquely labeling the species_name, and the species name of the genome can be found in the metadata files ".*_metadata_r214.tar.gz". For example, "1 Escherichia_coli\n" represents the species name "Escherichia coli" labeled with ID 1. The ID 0 is reserved for excluding genomes from the resulting sketch. Once "group_name.txt" is prepared, genomes can be grouped by their originating species using the following command:

metakssd set -g <group_name.txt> -o <L3K11_pan-sketch> <L3K11_sketch>

Here, "L3K11_pan-sketch" represents the consolidated ‘pangenome’ sketches for all species.

Subsequently, the union sketch of all species-specific marker, named "L3K11_union_sp-sketch", was obtained using this command:

metakssd set -q -o <L3K11_union_sp-sketch> <L3K11_pan-sketch>

Finally, the MarkerDB named “markerdb_L3K11” was generated by overlapping "L3K11_union_sp-sketch" with "L3K11_pan-sketch" using this command:

metakssd set -i <L3K11_union_sp-sketch> -o <markerdb_L3K11> <L3K11_pan-sketch>

Alternatively, we have provided an all-in-one workflow for MarKerDB construction based on GTDB r214 genomes.

sh build_MarkerDB.sh <all_gtdbr214_genomes_dir>

6. MetaKSSD benchmarking results

OPAL benchmarking results on five datasets are available:

7. Related papers

Yi, H. MetaKSSD: Boosting the Scalability of Reference Taxonomic Marker Database and the Performance of Metagenomic Profiling Using Sketch Operations. bioRxiv 2024.06.21.600011 (2024) doi:10.1101/2024.06.21.600011.

Name		Name	Last commit message	Last commit date
Latest commit History 287 Commits
bin		bin
data		data
scripts		scripts
shuf_files		shuf_files
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
COPYRIGHT.txt		COPYRIGHT.txt
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
build_MarkerDB.sh		build_MarkerDB.sh
gitlfs.log		gitlfs.log
run_profiling.sh		run_profiling.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Instantaneous Metagenomic Taxonomic Profiling with MetaKSSD

1. Installation

1.1 (Optional) Get pre-built MarkerDB (L3K11)

1.2 (Optional) Prepare gtdbr214 to ncbi taxonomy convertion tables

1.3 (Optional) Get pre-build Abundance Vector Database (L3K11)

2. Metagenome profiling

3. Abundance Vector Searching

3.1 Generate your abundance vector in a given path

3.2 Abundance Vector Searching

4. Index abundance vector database

5. build custom MarkerDB

6. MetaKSSD benchmarking results

7. Related papers

About

Releases 2

Packages

Languages

License

yhg926/MetaKSSD

Folders and files

Latest commit

History

Repository files navigation

Instantaneous Metagenomic Taxonomic Profiling with MetaKSSD

1. Installation

1.1 (Optional) Get pre-built MarkerDB (L3K11)

1.2 (Optional) Prepare gtdbr214 to ncbi taxonomy convertion tables

1.3 (Optional) Get pre-build Abundance Vector Database (L3K11)

2. Metagenome profiling

3. Abundance Vector Searching

3.1 Generate your abundance vector in a given path

3.2 Abundance Vector Searching

4. Index abundance vector database

5. build custom MarkerDB

6. MetaKSSD benchmarking results

7. Related papers

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages