diff --git a/download/index.html b/download/index.html
index c7a3058..5dd8e98 100644
--- a/download/index.html
+++ b/download/index.html
@@ -602,8 +602,8 @@
Current Version
TaxonKit v0.15.0
taxonkit reformat
:
-- For lineages with more than one node, if it fails to query TaxId with the parent-child pair, use the last child only. #82
- - The flag
-T/--trim
also does not add the prefix for missing ranks lower than the current rank. #82
+- For lineages with more than one node, if it fails to query TaxId with the parent-child pair, use the last child only. #82
+- The flag
-T/--trim
also does not add the prefix for missing ranks lower than the current rank. #82
- New flag
-s/--miss-rank-repl-suffix
to set the suffix for estimated taxon names. #85
diff --git a/search/search_index.json b/search/search_index.json
index 5c914ad..30f8cb9 100644
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"TaxonKit - A Practical and Efficient NCBI Taxonomy Toolkit Documents: https://bioinf.shenwei.me/taxonkit ( Usage&Examples , Tutorial , \u4e2d\u6587\u4ecb\u7ecd ) Source code: https://github.com/shenwei356/taxonkit Latest version: Please cite : https://doi.org/10.1016/j.jgg.2021.03.006 pytaxonkit , Python bindings for TaxonKit. Related projects: Taxid-Changelog : Tracking all changes of TaxIds, including deletion, new adding, merge, reuse, and rank/name changes. GTDB taxdump : GTDB taxonomy taxdump files with trackable TaxIds. ICTV taxdump : NCBI-style taxdump files for International Committee on Taxonomy of Viruses (ICTV) Table of Contents Features Subcommands Benchmark Dataset Installation Command-line completion Citation Contact License Features Easy to install ( download ) Statically linked executable binaries for multiple platforms (Linux/Windows/macOS, amd64/arm64) Light weight and out-of-the-box, no dependencies, no compilation, no configuration No database building, just download NCBI taxonomy data and uncompress to $HOME/.taxonkit Easy to use ( usages and examples ) Supporting bash-completion Fast (see benchmark ), multiple-CPUs supported, most operations cost 2-10s. Detailed usages and examples Supporting STDIN and (gzipped) input/output file, easily integrated in pipe Versatile commands Usage and examples Featured command: tracking monthly changelog of all TaxIds Featured command: reformating lineage into format of seven-level (\"superkingdom/kingdom, phylum, class, order, family, genus, species\" Featured command: filtering taxiDs by a rank range , e.g., at or below genus rank. Featured command: Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV Subcommands Subcommand Function list List taxonomic subtrees (TaxIds) bellow given TaxIds lineage Query taxonomic lineage of given TaxIds reformat Reformat lineage in canonical ranks name2taxid Convert scientific names to TaxIds filter Filter TaxIds by taxonomic rank range lca Compute lowest common ancestor (LCA) for TaxIds taxid-changelog Create TaxId changelog from dump archives profile2cami * Convert metagenomic profile table to CAMI format cami-filter * Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile create-taxdump * Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV Note: * New commands since the publication. Benchmark Getting complete lineage for given TaxIds Versions: ETE=3.1.2, taxopy=0.5.0 ( faster since 0.6.0 ), TaxonKit=0.7.2. Dataset Download and uncompress taxdump.tar.gz : ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz Copy names.dmp , nodes.dmp , delnodes.dmp and merged.dmp to data directory: $HOME/.taxonkit , e.g., /home/shenwei/.taxonkit , Optionally copy to some other directories, and later you can refer to using flag --data-dir , or environment variable TAXONKIT_DB . All-in-one command: wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz mkdir -p $HOME/.taxonkit cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit Update dataset : Simply re-download the taxdump files, uncompress and override old ones. Installation Go to Download Page for more download options and changelogs. TaxonKit is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page. Method 1: Download binaries (latest stable/dev version) Just download compressed executable file of your operating system, and uncompress it with tar -zxvf *.tar.gz command or other tools. And then: For Linux-like systems If you have root privilege simply copy it to /usr/local/bin : sudo cp taxonkit /usr/local/bin/ Or copy to anywhere in the environment variable PATH : mkdir -p $HOME/bin/; cp taxonkit $HOME/bin/ For Windows , just copy taxonkit.exe to C:\\WINDOWS\\system32 . Method 2: Install via conda (latest stable version) conda install -c bioconda taxonkit Method 3: Install via homebrew (out of date) brew install brewsci/bio/taxonkit Method 4: Compile from source (latest stable/dev version) Install go wget https://go.dev/dl/go1.17.13.linux-amd64.tar.gz tar -zxf go1.17.13.linux-amd64.tar.gz -C $HOME/ # or # echo \"export PATH=$PATH:$HOME/go/bin\" >> ~/.bashrc # source ~/.bashrc export PATH=$PATH:$HOME/go/bin Compile TaxonKit # ------------- the latest stable version ------------- go get -v -u github.com/shenwei356/taxonkit/taxonkit # The executable binary file is located in: # ~/go/bin/taxonkit # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ~/go/bin/taxonkit $HOME/bin/ # --------------- the development version -------------- git clone https://github.com/shenwei356/taxonkit cd taxonkit/taxonkit/ go build # The executable binary file is located in: # ./taxonkit # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ./taxonkit $HOME/bin/ Bash-completion Supported shell: bash|zsh|fish|powershell Bash: # generate completion shell taxonkit genautocomplete --shell bash # configure if never did. # install bash-completion if the \"complete\" command is not found. echo \"for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\" >> ~/.bash_completion echo \"source ~/.bash_completion\" >> ~/.bashrc Zsh: # generate completion shell taxonkit genautocomplete --shell zsh --file ~/.zfunc/_taxonkit # configure if never did echo 'fpath=( ~/.zfunc \"${fpath[@]}\" )' >> ~/.zshrc echo \"autoload -U compinit; compinit\" >> ~/.zshrc fish: taxonkit genautocomplete --shell fish --file ~/.config/fish/completions/taxonkit.fish Citation If you use TaxonKit in your work, please cite: Shen, W., Ren, H., TaxonKit: a practical and efficient NCBI Taxonomy toolkit, Journal of Genetics and Genomics, https://doi.org/10.1016/j.jgg.2021.03.006 Contact Create an issue to report bugs, propose new functions or ask for help. License MIT License Starchart","title":"Home"},{"location":"#taxonkit-a-practical-and-efficient-ncbi-taxonomy-toolkit","text":"Documents: https://bioinf.shenwei.me/taxonkit ( Usage&Examples , Tutorial , \u4e2d\u6587\u4ecb\u7ecd ) Source code: https://github.com/shenwei356/taxonkit Latest version: Please cite : https://doi.org/10.1016/j.jgg.2021.03.006 pytaxonkit , Python bindings for TaxonKit. Related projects: Taxid-Changelog : Tracking all changes of TaxIds, including deletion, new adding, merge, reuse, and rank/name changes. GTDB taxdump : GTDB taxonomy taxdump files with trackable TaxIds. ICTV taxdump : NCBI-style taxdump files for International Committee on Taxonomy of Viruses (ICTV)","title":"TaxonKit - A Practical and Efficient NCBI Taxonomy Toolkit"},{"location":"#table-of-contents","text":"Features Subcommands Benchmark Dataset Installation Command-line completion Citation Contact License","title":"Table of Contents"},{"location":"#features","text":"Easy to install ( download ) Statically linked executable binaries for multiple platforms (Linux/Windows/macOS, amd64/arm64) Light weight and out-of-the-box, no dependencies, no compilation, no configuration No database building, just download NCBI taxonomy data and uncompress to $HOME/.taxonkit Easy to use ( usages and examples ) Supporting bash-completion Fast (see benchmark ), multiple-CPUs supported, most operations cost 2-10s. Detailed usages and examples Supporting STDIN and (gzipped) input/output file, easily integrated in pipe Versatile commands Usage and examples Featured command: tracking monthly changelog of all TaxIds Featured command: reformating lineage into format of seven-level (\"superkingdom/kingdom, phylum, class, order, family, genus, species\" Featured command: filtering taxiDs by a rank range , e.g., at or below genus rank. Featured command: Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV","title":"Features"},{"location":"#subcommands","text":"Subcommand Function list List taxonomic subtrees (TaxIds) bellow given TaxIds lineage Query taxonomic lineage of given TaxIds reformat Reformat lineage in canonical ranks name2taxid Convert scientific names to TaxIds filter Filter TaxIds by taxonomic rank range lca Compute lowest common ancestor (LCA) for TaxIds taxid-changelog Create TaxId changelog from dump archives profile2cami * Convert metagenomic profile table to CAMI format cami-filter * Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile create-taxdump * Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV Note: * New commands since the publication.","title":"Subcommands"},{"location":"#benchmark","text":"Getting complete lineage for given TaxIds Versions: ETE=3.1.2, taxopy=0.5.0 ( faster since 0.6.0 ), TaxonKit=0.7.2.","title":"Benchmark"},{"location":"#dataset","text":"Download and uncompress taxdump.tar.gz : ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz Copy names.dmp , nodes.dmp , delnodes.dmp and merged.dmp to data directory: $HOME/.taxonkit , e.g., /home/shenwei/.taxonkit , Optionally copy to some other directories, and later you can refer to using flag --data-dir , or environment variable TAXONKIT_DB . All-in-one command: wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz mkdir -p $HOME/.taxonkit cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit Update dataset : Simply re-download the taxdump files, uncompress and override old ones.","title":"Dataset"},{"location":"#installation","text":"Go to Download Page for more download options and changelogs. TaxonKit is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.","title":"Installation"},{"location":"#method-1-download-binaries-latest-stabledev-version","text":"Just download compressed executable file of your operating system, and uncompress it with tar -zxvf *.tar.gz command or other tools. And then: For Linux-like systems If you have root privilege simply copy it to /usr/local/bin : sudo cp taxonkit /usr/local/bin/ Or copy to anywhere in the environment variable PATH : mkdir -p $HOME/bin/; cp taxonkit $HOME/bin/ For Windows , just copy taxonkit.exe to C:\\WINDOWS\\system32 .","title":"Method 1: Download binaries (latest stable/dev version)"},{"location":"#method-2-install-via-conda-latest-stable-version","text":"conda install -c bioconda taxonkit","title":"Method 2: Install via conda (latest stable version)"},{"location":"#method-3-install-via-homebrew-out-of-date","text":"brew install brewsci/bio/taxonkit","title":"Method 3: Install via homebrew (out of date)"},{"location":"#method-4-compile-from-source-latest-stabledev-version","text":"Install go wget https://go.dev/dl/go1.17.13.linux-amd64.tar.gz tar -zxf go1.17.13.linux-amd64.tar.gz -C $HOME/ # or # echo \"export PATH=$PATH:$HOME/go/bin\" >> ~/.bashrc # source ~/.bashrc export PATH=$PATH:$HOME/go/bin Compile TaxonKit # ------------- the latest stable version ------------- go get -v -u github.com/shenwei356/taxonkit/taxonkit # The executable binary file is located in: # ~/go/bin/taxonkit # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ~/go/bin/taxonkit $HOME/bin/ # --------------- the development version -------------- git clone https://github.com/shenwei356/taxonkit cd taxonkit/taxonkit/ go build # The executable binary file is located in: # ./taxonkit # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ./taxonkit $HOME/bin/","title":"Method 4: Compile from source (latest stable/dev version)"},{"location":"#bash-completion","text":"Supported shell: bash|zsh|fish|powershell Bash: # generate completion shell taxonkit genautocomplete --shell bash # configure if never did. # install bash-completion if the \"complete\" command is not found. echo \"for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\" >> ~/.bash_completion echo \"source ~/.bash_completion\" >> ~/.bashrc Zsh: # generate completion shell taxonkit genautocomplete --shell zsh --file ~/.zfunc/_taxonkit # configure if never did echo 'fpath=( ~/.zfunc \"${fpath[@]}\" )' >> ~/.zshrc echo \"autoload -U compinit; compinit\" >> ~/.zshrc fish: taxonkit genautocomplete --shell fish --file ~/.config/fish/completions/taxonkit.fish","title":"Bash-completion"},{"location":"#citation","text":"If you use TaxonKit in your work, please cite: Shen, W., Ren, H., TaxonKit: a practical and efficient NCBI Taxonomy toolkit, Journal of Genetics and Genomics, https://doi.org/10.1016/j.jgg.2021.03.006","title":"Citation"},{"location":"#contact","text":"Create an issue to report bugs, propose new functions or ask for help.","title":"Contact"},{"location":"#license","text":"MIT License","title":"License"},{"location":"#starchart","text":"","title":"Starchart"},{"location":"bioinf/","text":"","title":"Bioinf"},{"location":"chinese-dev/","text":"\u73b0\u6709\u5de5\u5177\u6bd4\u8f83 \u60f3\u8981\u4eceNCBI\u83b7\u53d6\u751f\u7269\u7684\u8c31\u7cfb\u4fe1\u606f\uff0c\u53ef\u4ee5\u5728 NCBI Taxonomy\u7f51\u7ad9\u4e0a\u7528TaxID\u6216\u8005\u540d\u79f0\u67e5\u8be2\u3002 \u6bd4\u5982\u53ef\u4ee5\u7528 Homo sapiens \u6216 9606 \u641c\u7d22\u201c\u4eba\u201d\u7684\u5206\u7c7b\u5b66\u4fe1\u606f\uff0c\u4ee5\u53ca\u5bc6\u7801\u5b50\u8868\uff0cEntrez\u8bb0\u5f55\u7edf\u8ba1\u7b49\u3002 \u540c\u65f6\u4e5f\u53ef\u4ee5\u901a\u8fc7NCBI\u7684\u5b98\u65b9\u5de5\u5177\u5305 E-utilities ( ftp )\u3002 $ esearch -db taxonomy -query \"txid9606 [Organism]\" \\ | efetch -format xml \\ | xtract -pattern Lineage -element Lineage \u6b64\u5916\u4e5f\u6709\u4e00\u4e9b\u5de5\u5177\u63d0\u4f9b\u7c7b\u4f3c\u7684\u529f\u80fd\uff0c\u90e8\u5206\u8f6f\u4ef6\uff1a \u5de5\u5177 \u7f16\u7a0b\u8bed\u8a00 \u6570\u636e\u83b7\u53d6\u65b9\u5f0f \u4f7f\u7528\u65b9\u5f0f \u5907\u6ce8 E-utilities shell/Perl/C++ \u8fdc\u7a0bWeb\u8c03\u7528 \u547d\u4ee4\u884c \u5b98\u65b9\u7a0b\u5e8f\uff0cTaxonomy\u64cd\u4f5c\u4ec5\u4e3a\u5176\u90e8\u5206\u529f\u80fd BioPython Python \u8fdc\u7a0bWeb\u8c03\u7528 \u811a\u672c \u5305\u88c5entrez\u63a5\u53e3\uff0cTaxonomy\u64cd\u4f5c\u4ec5\u4e3a\u5176\u90e8\u5206\u529f\u80fd ETE Toolkit Python \u672c\u5730\u6570\u636e\u5e93 \u811a\u672c/\u547d\u4ee4\u884c Taxonomy\u64cd\u4f5c\u4ec5\u4e3a\u5176\u90e8\u5206\u529f\u80fd Taxize R \u8fdc\u7a0bWeb\u8c03\u7528 \u811a\u672c ropensci\uff1b\u652f\u6301\u591a\u79cd\u6570\u636e\u5e93\uff1b\u529f\u80fd\u8f83\u4e30\u5bcc Taxopy Python \u672c\u5730\u6570\u636e\u6587\u4ef6 \u811a\u672c/\u547d\u4ee4\u884c \u4ec5\u57fa\u672c\u529f\u80fd \u9009\u62e9\u5de5\u5177\u4e00\u822c\u8003\u8651\u51e0\u4e2a\u65b9\u9762\uff1a \u662f\u5426\u6ee1\u8db3\u529f\u80fd\u9700\u6c42\u3002\u5927\u591a\u5de5\u5177\u4ec5\u6709\u57fa\u672c\u7684\u67e5\u8be2\u3001\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb\u7684\u529f\u80fd\uff0c\u90fd\u6ca1\u6cd5\u5c06\u5b8c\u6574\u8c31\u7cfb\u683c\u5f0f\u5316\u4e3a\"\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\"\u7684\u683c\u5f0f\uff1b \u8f6f\u4ef6\u5b89\u88c5\u4fbf\u5229\u6027\u3002\u4e0a\u8ff0\u5de5\u5177\u90fd\u4e0d\u9700\u8981\u624b\u52a8\u7f16\u8bd1\u5b89\u88c5\uff0c\u9664\u4e86E-utilities\u7684\u90e8\u5206\u7ec4\u4ef6\u9700\u8981\u624b\u52a8\u5b8c\u6210\uff0c\u5176\u5b83\u57fa\u672c\u90fd\u80fd\u7528\u5bf9\u5e94\u7f16\u7a0b\u8bed\u8a00\u7684\u5305\u7ba1\u7406\u5de5\u5177\u5b89\u88c5\uff1b \u914d\u7f6e\u4fbf\u5229\u6027\u3002\u90e8\u5206\u5efa\u7acb\u672c\u5730\u6570\u636e\u5e93\u7684\u8f6f\u4ef6\u5219\u9700\u8981\u5148\u6784\u5efa\u6570\u636e\u5e93\uff0c\u4e0d\u8fc7\u57fa\u672c\u90fd\u662f\u5d4c\u5165\u5f0f\u7684sqlite\uff0c\u6bd4\u8f83\u7b80\u5355\u5feb\u6377\uff0c\u7a7a\u95f4\u5360\u7528\u4e5f\u80fd\u63a5\u53d7\uff1b \u4f7f\u7528\u4fbf\u5229\u6027\u3002\u63d0\u4f9b\u547d\u4ee4\u884c\u63a5\u53e3\u7684\u5de5\u5177\u5b9e\u7528\u8f83\u4e3a\u4fbf\u6377\uff0c\u4e5f\u4fbf\u4e8e\u6574\u5408\u5230\u5206\u6790\u6d41\u7a0b\uff1b \u800c\u4ec5\u63d0\u4f9b\u5305/\u5e93\u7684\u5de5\u5177\uff0c\u9700\u8981\u4f7f\u7528\u8005\u5728\u8bed\u8a00\u7ec8\u7aef\u6216\u7f16\u5199\u811a\u672c\u8fdb\u884c\u8c03\u7528\uff0c\u7075\u6d3b\u4f46\u9700\u8981\u4e00\u5b9a\u7f16\u7a0b\u57fa\u7840\u3002 \u8ba1\u7b97\u6548\u7387\u3002\u901a\u8fc7\u7f51\u7edc\u8c03\u7528\u7684\u8f6f\u4ef6\u53d7\u7f51\u7edc\u72b6\u6001\u5f71\u54cd\u5927\uff0c\u4e14\u5728\u5927\u6279\u91cf\u8c03\u7528\u7684\u65f6\u5019\u901f\u5ea6\u8f83\u6162\uff1b\u5b9e\u7528\u672c\u5730\u6570\u636e\u5e93\u5219\u8f83\u4e3a\u9ad8\u6548\u3002 \u6700\u521d\u6211\u60f3\u8981\u7684\u529f\u80fd\u53ea\u662f\u6839\u636e\u83b7\u53d6\"\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\"\u683c\u5f0f\u7684\u8c31\u7cfb\uff0c\u53d1\u73b0\u6ca1\u6709\u73b0\u6210\u5de5\u5177\uff0c\u800c\u540e\u53c8\u6709\u65b0\u7684\u9700\u6c42\u65e0\u6cd5\u6ee1\u8db3\uff0c\u5373\u83b7\u53d6\u67d0\u4e2a\u7c7b\u522b\u6240\u6709\u7684TaxID\u3002 \u6545\u5f00\u59cb\u7f16\u5199\u5de5\u5177\u6765\u5b9e\u73b0\uff0c\u5e76\u9010\u6b65\u6269\u5c55\u5176\u529f\u80fd\u3002 \u5176\u5b9e\u6700\u7b80\u5355\u7684\u65b9\u6cd5\u5c31\u662f\u81ea\u5df1\u4e0b\u8f7d\u6570\u636e\u6587\u4ef6\u8fdb\u884c\u89e3\u6790\u3002 NCBI Taxonomy \u6570\u636e\u6587\u4ef6 NCBI Taxonomy\u6570\u636e\u5e93\u5c06\u6240\u6709\u751f\u7269\u7684 \u5206\u7c7b\u5b66\u5173\u7cfb \u7ec4\u7ec7\u4e3a\u4e00\u68f5\u201c\u6709\u6839\u6811\u201d\uff08rooted tree\uff09, \u4e0e\u8fdb\u5316\u6811\uff08Phylogenetic tree\uff09\u4e0d\u540c: \u8fdb\u5316\u6811\u662f\u6309 \u8fdb\u5316\u5173\u7cfb \u201d\u7ec4\u7ec7\uff0c\u4e14\u53ef\u4ee5\u4e3a\u201c\u65e0\u6839\u6811\u201d(unrooted tree)\u3002 NCBI Taxonomy\u516c\u5f00\u6570\u636e\u683c\u5f0f\u6709\u4e24\u79cd\uff0c\u65e7\u7684\u540d\u79f0\u4e3a taxdump.tar.gz \uff0c\u6587\u4ef6\u5927\u5c0f\u7ea650Mb\uff0c\u5185\u542b\u4ee5\u4e0b\u6587\u4ef6\u3002 nodes.dmp # [\u5f53\u524d\u7248\u672c] \u8282\u70b9\u4fe1\u606f # \u91cd\u8981\u5185\u5bb9\uff1a tax_id, parent tax_id, rank names.dmp # [\u5f53\u524d\u7248\u672c] \u540d\u79f0\u4fe1\u606f # \u91cd\u8981\u5185\u5bb9\uff1a tax_id, name_txt merged.dmp # [\u76ee\u524d\u4e3a\u6b62] \u88ab\u5408\u5e76\u7684\u8282\u70b9\u4fe1\u606f # \u91cd\u8981\u5185\u5bb9\uff1a old_tax_id, new_tax_id delnodes.dmp # [\u76ee\u524d\u4e3a\u6b62] \u88ab\u5220\u9664\u7684nodes\u4fe1\u606f # \u91cd\u8981\u5185\u5bb9\uff1a tax_id citations.dmp # \u5f15\u7528\u4fe1\u606f division.dmp # division\u4fe1\u606f gencode.dmp # \u9057\u4f20\u7f16\u7801\u4fe1\u606f gc.prt # \u9057\u4f20\u7f16\u7801\u8868 readme.txt # \u8bf4\u660e\u6587\u6863 \u5176\u4e2d\u6700\u4e3b\u8981\u7684\u662f\u524d4\u4e2a\u6587\u4ef6\uff1a nodes.dmp \u4e3b\u8981\u5305\u542b\u5f53\u524d\u7248\u672c\u7684\u6240\u6709\u5206\u7c7b\u5b66\u5355\u5143\u8282\u70b9\uff08taxon\uff09 \u7684\u552f\u4e00\u6807\u8bc6\u7b26\uff08taxonomic identifier, \u7b80\u79f0TaxId, taxid, tax_id)\uff0c \u5206\u7c7b\u5b66\u6c34\u5e73(rank\uff09\uff0c\u53ca\u5176\u7236\u8282\u70b9\u7684TaxID\u3002 names.dmp \u4e3b\u8981\u5305\u542b\u5305\u542b\u5f53\u524d\u7248\u672c\u7684\u6240\u6709TaxID\u53ca\u5176\u7edf\u4e00\u79d1\u5b66\u540d\u79f0\uff08scientific name\uff09\u548c\u522b\u540d\u3002 merged.dmp \u5305\u542b\u4e86\u5230\u5f53\u524d\u7248\u672c\u4e3a\u6b62\uff0c\u6240\u6709\u88ab\u5408\u5e76\u7684TaxID\u4e0e\u5408\u5e76\u5230\u7684\u65b0TaxID\u3002 delnodes.dmp \u5305\u542b\u4e86\u5230\u5f53\u524d\u7248\u672c\u4e3a\u6b62\uff0c\u6240\u6709\u88ab\u5220\u9664\u7684TaxID\u3002 \u57282018\u5e742\u6708\u7684\u65f6\u5019\uff0c \u63a8\u51fa\u4e86\u65b0\u7684\u683c\u5f0f \uff0c \u989d\u5916\u5305\u542b\u4e86\u8c31\u7cfb\uff08lineage\uff09\uff0c\u7c7b\u578b\uff08type\uff09\u548c\u5bbf\u4e3b\uff08host\uff09\u4fe1\u606f\u3002 \u6587\u4ef6\u540d\u79f0\u4e3a new_taxdump.tar.gz \uff0c\u6587\u4ef6\u5927\u5c0f\u7ea6110Mb\u3002 \u76f8\u5bf9\u65e7\u7248\uff0c\u65b0\u7248\u672c\u6587\u4ef6\u6570\u91cf\u548c\u5185\u5bb9\u66f4\u591a\uff0c\u4e3b\u8981\u662f\u56e0\u4e3a\u589e\u52a0\u4e86lineage\u548c\u7c7b\u578b\u4fe1\u606f\u3002 \u4e8b\u5b9e\u4e0alineage\u662f\u53ef\u4ee5\u4ece nodes.dmp \u548c names.dmp \u8ba1\u7b97\u800c\u6765\u3002 \u65b0\u7248\u683c\u5f0f\u6240\u542b\u6587\u4ef6\u5982\u4e0b\uff1a nodes.dmp names.dmp merged.dmp delnodes.dmp fullnamelineage.dmp TaxIDlineage.dmp rankedlineage.dmp host.dmp typeoftype.dmp typematerial.dmp citations.dmp division.dmp gencode.dmp readme.txt NCBI Taxonomy\u6570\u636e\u6bcf\u5929\u90fd\u5728\u66f4\u65b0\uff0c\u6bcf\u6708\u521d\uff08\u5927\u591a\u4e3a1\u53f7\uff09\u7684\u6570\u636e\u4f5c\u4e3a\u5b58\u6863\u4fdd\u5b58\u5728 taxdump_archive/ \u76ee\u5f55\uff0c \u65e7\u7248\u672c\u6700\u65e9\u6570\u636e\u52302014\u5e748\u6708\uff0c\u65b0\u7248\u672c\u53ea\u52302018\u5e7412\u6708\u3002 TaxonKit \u5f00\u53d1\u601d\u8def \u5927\u5bb6\u5e94\u8be5\u90fd\u6709\u5b89\u88c5\u751f\u7269\u4fe1\u606f\u8f6f\u4ef6\u7684\u75db\u82e6\u56de\u5fc6\uff0c\u5728conda\u51fa\u73b0\u4e4b\u524d\uff0c\u5f88\u591a\u8f6f\u4ef6\u90fd\u9700\u8981\u624b\u52a8\u5b89\u88c5\u4f9d\u8d56\u3001\u518d\u7f16\u8bd1\u5b89\u88c5\u3002 \u4e0d\u540c\u64cd\u4f5c\u7cfb\u7edf\uff0c\u64cd\u4f5c\u7cfb\u7edf\u7248\u672c\uff0c\u7f16\u8bd1\u5668\u7248\u672c\u7ed9\u8f6f\u4ef6\u5b89\u88c5\u5e26\u6765\u4e86\u5de8\u5927\u7684\u56f0\u96be\u3002 \u5982\u679c\u5f00\u53d1\u8005\u6ca1\u6ce8\u610f\u8f6f\u4ef6\u7684\u8de8\u5e73\u53f0\u3001\u53ef\u79fb\u690d\u6027\u66f4\u662f\u5982\u6b64\u3002 \u597d\u7684\u8f6f\u4ef6\u4e00\u5b9a\u8981\u8003\u8651\u4ee5\u4e0b\u51e0\u4e2a\u65b9\u9762\uff1a \u5b89\u88c5\u4fbf\u5229\u6027\u3002 \u5c3d\u53ef\u80fd\u7b80\u5316\u5b89\u88c5\u6b65\u9aa4\uff0c\u751a\u81f3\u4e00\u952e/\u4e00\u6761\u547d\u4ee4\u5b89\u88c5\u3002 \u51cf\u5c11\u5bf9\u5916\u90e8\u8f6f\u4ef6/\u5305\u7684\u4f9d\u8d56\u3002 \u5bf9\u591a\u5e73\u53f0\uff08windows/linux\uff09\u7684\u517c\u5bb9\u6027\u3002 \u5c3d\u91cf\u63d0\u4f9b\u7f16\u8bd1\u597d\u7684 \u9759\u6001\u94fe\u63a5\u53ef\u6267\u884c\u7a0b\u5e8f\uff08Statically linked executable binaries\uff09\u3002 \u914d\u7f6e\u4fbf\u5229\u6027\u3002 \u5c3d\u53ef\u80fd\u7b80\u5316\u914d\u7f6e\uff0c\u81ea\u52a8\u5316\u914d\u7f6e\uff0c\u751a\u81f3\u96f6\u914d\u7f6e\u3002 \u4f7f\u7528\u4fbf\u5229\u6027\u3002 \u4e30\u5bcc\u7684\u6587\u6863\uff1a\u5b89\u88c5\uff0c\u4f7f\u7528\uff0c\u4f8b\u5b50\u3002 \u8f6f\u4ef6\u7ed3\u6784\u5408\u7406\uff0c\u6a21\u5757\u5316\u3002 \u53cb\u597d\u7684\u62a5\u9519\u4fe1\u606f\uff0c\u6307\u51fa\u8be6\u7ec6\u7684\u9519\u8bef\u539f\u56e0\uff0c\u800c\u4e0d\u662f\u53ea\u62a5segmentation fault\uff0c\u6216\u6254\u51fa\u4e00\u5806\u9519\u8bef\u4fe1\u606f\u3002 \u4e30\u5bcc\u7684\u547d\u4ee4\u884c\u53c2\u6570\uff0c\u6ee1\u8db3\u4e0d\u540c\u529f\u80fd\u9700\u6c42\u3002 \u652f\u6301\u6807\u51c6\u8f93\u5165/\u8f93\u51fa\uff0c\u4ece\u800c\u4fbf\u4e8e\u6574\u5408\u5230\u5206\u6790\u6d41\u7a0b\u3002 \u53ef\u9009\u652f\u6301shell\u8865\u5168\uff0c\u4fbf\u4e8e\u5feb\u901f\u8c03\u7528\u5b50\u547d\u4ee4\u548c\u53c2\u6570\u3002 \u8ba1\u7b97\u6548\u7387\u3002 \u5c3d\u53ef\u80fd\u5360\u7528\u4f4e\u5185\u5b58\u3001\u4f4e\u5b58\u50a8\u3002 \u5c3d\u91cf\u51cf\u5c11\u8ba1\u7b97\u65f6\u95f4\uff0c\u5145\u5206\u5229\u7528\u591aCPU\u3002 \u6301\u7eed\u7684\u652f\u6301\u3002 \u6839\u636e\u7528\u6237\u9700\u6c42\u4fee\u590dbug\u3001\u589e\u52a0\u65b0\u529f\u80fd\u3002 \u5b9a\u671f\u66f4\u65b0\u53d1\u5e03\u65b0\u7248\u672c\u3002 \u5728\u5b9e\u73b0TaxonKit\u7684\u65f6\u5019\uff0c\u6211\u5df2\u7ecf\u5f00\u59cb\u7f16\u5199seqkit\u548ccsvtk\u8f6f\u4ef6\uff0c\u6709\u4e86\u4e00\u5b9a\u7684\u7ecf\u9a8c\uff0c\u4e5f\u57fa\u672c\u80fd\u8fbe\u5230\u4e0a\u8ff0\u6240\u6709\u8981\u6c42\u3002 TaxonKit\u4f7f\u7528Go\u8bed\u8a00\u7f16\u5199\uff0c\u8fd9\u6837\u53ef\u4ee5\u8f7b\u677e\u7f16\u8bd1\u51fa\u652f\u6301Linux, Windows, macOS\u7b49\u64cd\u4f5c\u7cfb\u7edf\u7684\u4e0d\u540c\u67b6\u6784\uff08x86/arm\uff09\u7684\u53ef\u6267\u884c\u4e8c\u8fdb\u5236\u6587\u4ef6\u3002 \u7531\u4e8eGo\u662f\u7f16\u8bd1\u578b\u8bed\u8a00\uff0c\u5728\u8fd0\u884c\u6548\u7387\u4e0a\u4e5f\u6709\u4fdd\u8bc1\u3002 \u81f3\u4e8e\u914d\u7f6e\u3001\u4f7f\u7528\u7b49\u4fbf\u5229\u6027\u5219\u4f9d\u8d56\u4e8e\u5f00\u53d1\u8005\u3002 \u5206\u7c7b\u5b66\u6570\u636e\u4f7f\u7528NCBI taxonomy\u7684\u516c\u5f00\u6570\u636e\u3002 \u6570\u636e\u8bbf\u95ee\u65b9\u5f0f\u7684\u9009\u62e9\uff1a\u901a\u8fc7\u7f51\u7edc\u8bbf\u95ee\u5b98\u65b9Web\u63a5\u53e3\u7684\u65b9\u5f0f\u592a\u6162\uff0c\u53ea\u8003\u8651\u672c\u5730\u8bbf\u95ee\u3002 \u672c\u5730\u8bbf\u95ee\u6709\u51e0\u79cd\u65b9\u5f0f\uff1a \u76f4\u63a5\u8bbf\u95ee\u6570\u636e\u5e93\uff1a\u53c8\u5206\u5d4c\u5165\u5f0f\u6570\u636e\u5e93\u5982SQLite\uff0c\u7b2c\u4e09\u65b9\u6570\u636e\u5e93\u5165MySQL\u3002\u540e\u8005\u4e0d\u8003\u8651\uff0c\u914d\u7f6e\u592a\u9ebb\u70e6\u3002 Client-Server\u6a21\u5f0f\uff1a Web\u63a5\u53e3\uff1a\u670d\u52a1\u7aef\u542f\u52a8\u5b88\u62a4\u8fdb\u7a0b\uff0c\u957f\u671f\u4fdd\u6301\u6570\u636e\u5e93\u8fde\u63a5\uff0c\u5bf9\u5916\u63d0\u4f9bWeb\uff08RESTful\uff09\u63a5\u53e3\uff0c \u5ba2\u6237\u7aef\u672c\u5730\u6216\u8fdc\u7a0b\u8c03\u7528\u3002\u5148\u524d\u5df2\u7ecf\u5f00\u53d1\u4e86\u4e00\u4e2a\u539f\u578b\uff08https://github.com/shenwei356/gtaxon\uff09\uff0c \u4f46\u901a\u8fc7RESTful\u63a5\u53e3\uff08HTTP\uff09\u5927\u6279\u91cf\u8c03\u7528\uff0c\u8bbf\u95ee\u901f\u5ea6\u8f83\u6162\u3002 Socket\u63a5\u53e3\uff1a\u4e0eWeb\u501f\u53e3\u7c7b\u4f3c\uff0c\u56e0\u4e3a\u6ca1\u6709\u4f7f\u7528http\u534f\u8bae\uff0c\u901f\u5ea6\u5e94\u8be5\u4f1a\u9ad8\u4e00\u4e9b\u3002\u4f46\u6ca1\u6709\u5c1d\u8bd5\u3002 \u6700\u540e\u6d4b\u8bd5\u53d1\u73b0\uff0c\u76f4\u63a5\u89e3\u6790\u6570\u636e\u6587\u4ef6\u7684\u901f\u5ea6\u4e5f\u5f88\u5feb\uff0c5\u79d2\u5de6\u53f3\uff08\u5b58\u50a8\u4e3aNVMe SSD\uff09\uff0c\u5b8c\u5168\u6ee1\u8db3\u8981\u6c42\u3002 \u5b8c\u5168\u4e0d\u7528\u642d\u5efa\u6570\u636e\u5e93\uff0c\u914d\u7f6e\u66f4\u5bb9\u6613\uff0c\u4e5f\u4fbf\u4e8e\u66f4\u65b0\u6570\u636e\u3002 \u8fd1\u65e5\u53c8\u8fdb\u4e00\u6b65\u4f18\u5316\u52302\u79d2\u5de6\u53f3\uff0c\u975e\u5e38\u5feb\u901f\u3002\u5185\u5b58\u4e5f\u5728500Mb-1.5G\u5de6\u53f3\uff0c\u5b8c\u5168\u53ef\u4ee5\u63a5\u53d7\u3002 TaxonKit\u4e3a\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u91c7\u7528\u5b50\u547d\u4ee4\u7684\u65b9\u5f0f\u6765\u6267\u884c\u4e0d\u540c\u529f\u80fd\uff0c\u5927\u591a\u6570\u5b50\u547d\u4ee4\u652f\u6301\u6807\u51c6\u8f93\u5165/\u8f93\u51fa\uff0c\u4fbf\u4e8e\u4f7f\u7528\u547d\u4ee4\u884c\u7ba1\u9053\u8fdb\u884c\u6d41\u6c34\u4f5c\u4e1a\u3002 \u5c40\u9650\u6027 \u5206\u7c7b\u5b66\u6570\u636e\u5e93\u6709\u5f88\u591a\uff0cTaxonKit\u76ee\u524d\u53ea\u652f\u6301\u5e94\u7528\u6700\u5e7f\u6cdb\u7684NCBI Taxonomy\u3002 \u5bf9\u4e8eGTDB Taxonomy\uff0c\u53ef\u4ee5\u901a\u8fc7\u73b0\u6709\u5de5\u5177\uff0c\u5982 gtdb_to_taxdump \uff0c \u5c06\u5176\u6570\u636e\u8f6c\u6362\u4e3aNCBI taxdump\u6587\u4ef6\u3002","title":"\u5f00\u53d1\u7b14\u8bb0"},{"location":"chinese-dev/#_1","text":"\u60f3\u8981\u4eceNCBI\u83b7\u53d6\u751f\u7269\u7684\u8c31\u7cfb\u4fe1\u606f\uff0c\u53ef\u4ee5\u5728 NCBI Taxonomy\u7f51\u7ad9\u4e0a\u7528TaxID\u6216\u8005\u540d\u79f0\u67e5\u8be2\u3002 \u6bd4\u5982\u53ef\u4ee5\u7528 Homo sapiens \u6216 9606 \u641c\u7d22\u201c\u4eba\u201d\u7684\u5206\u7c7b\u5b66\u4fe1\u606f\uff0c\u4ee5\u53ca\u5bc6\u7801\u5b50\u8868\uff0cEntrez\u8bb0\u5f55\u7edf\u8ba1\u7b49\u3002 \u540c\u65f6\u4e5f\u53ef\u4ee5\u901a\u8fc7NCBI\u7684\u5b98\u65b9\u5de5\u5177\u5305 E-utilities ( ftp )\u3002 $ esearch -db taxonomy -query \"txid9606 [Organism]\" \\ | efetch -format xml \\ | xtract -pattern Lineage -element Lineage \u6b64\u5916\u4e5f\u6709\u4e00\u4e9b\u5de5\u5177\u63d0\u4f9b\u7c7b\u4f3c\u7684\u529f\u80fd\uff0c\u90e8\u5206\u8f6f\u4ef6\uff1a \u5de5\u5177 \u7f16\u7a0b\u8bed\u8a00 \u6570\u636e\u83b7\u53d6\u65b9\u5f0f \u4f7f\u7528\u65b9\u5f0f \u5907\u6ce8 E-utilities shell/Perl/C++ \u8fdc\u7a0bWeb\u8c03\u7528 \u547d\u4ee4\u884c \u5b98\u65b9\u7a0b\u5e8f\uff0cTaxonomy\u64cd\u4f5c\u4ec5\u4e3a\u5176\u90e8\u5206\u529f\u80fd BioPython Python \u8fdc\u7a0bWeb\u8c03\u7528 \u811a\u672c \u5305\u88c5entrez\u63a5\u53e3\uff0cTaxonomy\u64cd\u4f5c\u4ec5\u4e3a\u5176\u90e8\u5206\u529f\u80fd ETE Toolkit Python \u672c\u5730\u6570\u636e\u5e93 \u811a\u672c/\u547d\u4ee4\u884c Taxonomy\u64cd\u4f5c\u4ec5\u4e3a\u5176\u90e8\u5206\u529f\u80fd Taxize R \u8fdc\u7a0bWeb\u8c03\u7528 \u811a\u672c ropensci\uff1b\u652f\u6301\u591a\u79cd\u6570\u636e\u5e93\uff1b\u529f\u80fd\u8f83\u4e30\u5bcc Taxopy Python \u672c\u5730\u6570\u636e\u6587\u4ef6 \u811a\u672c/\u547d\u4ee4\u884c \u4ec5\u57fa\u672c\u529f\u80fd \u9009\u62e9\u5de5\u5177\u4e00\u822c\u8003\u8651\u51e0\u4e2a\u65b9\u9762\uff1a \u662f\u5426\u6ee1\u8db3\u529f\u80fd\u9700\u6c42\u3002\u5927\u591a\u5de5\u5177\u4ec5\u6709\u57fa\u672c\u7684\u67e5\u8be2\u3001\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb\u7684\u529f\u80fd\uff0c\u90fd\u6ca1\u6cd5\u5c06\u5b8c\u6574\u8c31\u7cfb\u683c\u5f0f\u5316\u4e3a\"\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\"\u7684\u683c\u5f0f\uff1b \u8f6f\u4ef6\u5b89\u88c5\u4fbf\u5229\u6027\u3002\u4e0a\u8ff0\u5de5\u5177\u90fd\u4e0d\u9700\u8981\u624b\u52a8\u7f16\u8bd1\u5b89\u88c5\uff0c\u9664\u4e86E-utilities\u7684\u90e8\u5206\u7ec4\u4ef6\u9700\u8981\u624b\u52a8\u5b8c\u6210\uff0c\u5176\u5b83\u57fa\u672c\u90fd\u80fd\u7528\u5bf9\u5e94\u7f16\u7a0b\u8bed\u8a00\u7684\u5305\u7ba1\u7406\u5de5\u5177\u5b89\u88c5\uff1b \u914d\u7f6e\u4fbf\u5229\u6027\u3002\u90e8\u5206\u5efa\u7acb\u672c\u5730\u6570\u636e\u5e93\u7684\u8f6f\u4ef6\u5219\u9700\u8981\u5148\u6784\u5efa\u6570\u636e\u5e93\uff0c\u4e0d\u8fc7\u57fa\u672c\u90fd\u662f\u5d4c\u5165\u5f0f\u7684sqlite\uff0c\u6bd4\u8f83\u7b80\u5355\u5feb\u6377\uff0c\u7a7a\u95f4\u5360\u7528\u4e5f\u80fd\u63a5\u53d7\uff1b \u4f7f\u7528\u4fbf\u5229\u6027\u3002\u63d0\u4f9b\u547d\u4ee4\u884c\u63a5\u53e3\u7684\u5de5\u5177\u5b9e\u7528\u8f83\u4e3a\u4fbf\u6377\uff0c\u4e5f\u4fbf\u4e8e\u6574\u5408\u5230\u5206\u6790\u6d41\u7a0b\uff1b \u800c\u4ec5\u63d0\u4f9b\u5305/\u5e93\u7684\u5de5\u5177\uff0c\u9700\u8981\u4f7f\u7528\u8005\u5728\u8bed\u8a00\u7ec8\u7aef\u6216\u7f16\u5199\u811a\u672c\u8fdb\u884c\u8c03\u7528\uff0c\u7075\u6d3b\u4f46\u9700\u8981\u4e00\u5b9a\u7f16\u7a0b\u57fa\u7840\u3002 \u8ba1\u7b97\u6548\u7387\u3002\u901a\u8fc7\u7f51\u7edc\u8c03\u7528\u7684\u8f6f\u4ef6\u53d7\u7f51\u7edc\u72b6\u6001\u5f71\u54cd\u5927\uff0c\u4e14\u5728\u5927\u6279\u91cf\u8c03\u7528\u7684\u65f6\u5019\u901f\u5ea6\u8f83\u6162\uff1b\u5b9e\u7528\u672c\u5730\u6570\u636e\u5e93\u5219\u8f83\u4e3a\u9ad8\u6548\u3002 \u6700\u521d\u6211\u60f3\u8981\u7684\u529f\u80fd\u53ea\u662f\u6839\u636e\u83b7\u53d6\"\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\"\u683c\u5f0f\u7684\u8c31\u7cfb\uff0c\u53d1\u73b0\u6ca1\u6709\u73b0\u6210\u5de5\u5177\uff0c\u800c\u540e\u53c8\u6709\u65b0\u7684\u9700\u6c42\u65e0\u6cd5\u6ee1\u8db3\uff0c\u5373\u83b7\u53d6\u67d0\u4e2a\u7c7b\u522b\u6240\u6709\u7684TaxID\u3002 \u6545\u5f00\u59cb\u7f16\u5199\u5de5\u5177\u6765\u5b9e\u73b0\uff0c\u5e76\u9010\u6b65\u6269\u5c55\u5176\u529f\u80fd\u3002 \u5176\u5b9e\u6700\u7b80\u5355\u7684\u65b9\u6cd5\u5c31\u662f\u81ea\u5df1\u4e0b\u8f7d\u6570\u636e\u6587\u4ef6\u8fdb\u884c\u89e3\u6790\u3002","title":"\u73b0\u6709\u5de5\u5177\u6bd4\u8f83"},{"location":"chinese-dev/#ncbi-taxonomy","text":"NCBI Taxonomy\u6570\u636e\u5e93\u5c06\u6240\u6709\u751f\u7269\u7684 \u5206\u7c7b\u5b66\u5173\u7cfb \u7ec4\u7ec7\u4e3a\u4e00\u68f5\u201c\u6709\u6839\u6811\u201d\uff08rooted tree\uff09, \u4e0e\u8fdb\u5316\u6811\uff08Phylogenetic tree\uff09\u4e0d\u540c: \u8fdb\u5316\u6811\u662f\u6309 \u8fdb\u5316\u5173\u7cfb \u201d\u7ec4\u7ec7\uff0c\u4e14\u53ef\u4ee5\u4e3a\u201c\u65e0\u6839\u6811\u201d(unrooted tree)\u3002 NCBI Taxonomy\u516c\u5f00\u6570\u636e\u683c\u5f0f\u6709\u4e24\u79cd\uff0c\u65e7\u7684\u540d\u79f0\u4e3a taxdump.tar.gz \uff0c\u6587\u4ef6\u5927\u5c0f\u7ea650Mb\uff0c\u5185\u542b\u4ee5\u4e0b\u6587\u4ef6\u3002 nodes.dmp # [\u5f53\u524d\u7248\u672c] \u8282\u70b9\u4fe1\u606f # \u91cd\u8981\u5185\u5bb9\uff1a tax_id, parent tax_id, rank names.dmp # [\u5f53\u524d\u7248\u672c] \u540d\u79f0\u4fe1\u606f # \u91cd\u8981\u5185\u5bb9\uff1a tax_id, name_txt merged.dmp # [\u76ee\u524d\u4e3a\u6b62] \u88ab\u5408\u5e76\u7684\u8282\u70b9\u4fe1\u606f # \u91cd\u8981\u5185\u5bb9\uff1a old_tax_id, new_tax_id delnodes.dmp # [\u76ee\u524d\u4e3a\u6b62] \u88ab\u5220\u9664\u7684nodes\u4fe1\u606f # \u91cd\u8981\u5185\u5bb9\uff1a tax_id citations.dmp # \u5f15\u7528\u4fe1\u606f division.dmp # division\u4fe1\u606f gencode.dmp # \u9057\u4f20\u7f16\u7801\u4fe1\u606f gc.prt # \u9057\u4f20\u7f16\u7801\u8868 readme.txt # \u8bf4\u660e\u6587\u6863 \u5176\u4e2d\u6700\u4e3b\u8981\u7684\u662f\u524d4\u4e2a\u6587\u4ef6\uff1a nodes.dmp \u4e3b\u8981\u5305\u542b\u5f53\u524d\u7248\u672c\u7684\u6240\u6709\u5206\u7c7b\u5b66\u5355\u5143\u8282\u70b9\uff08taxon\uff09 \u7684\u552f\u4e00\u6807\u8bc6\u7b26\uff08taxonomic identifier, \u7b80\u79f0TaxId, taxid, tax_id)\uff0c \u5206\u7c7b\u5b66\u6c34\u5e73(rank\uff09\uff0c\u53ca\u5176\u7236\u8282\u70b9\u7684TaxID\u3002 names.dmp \u4e3b\u8981\u5305\u542b\u5305\u542b\u5f53\u524d\u7248\u672c\u7684\u6240\u6709TaxID\u53ca\u5176\u7edf\u4e00\u79d1\u5b66\u540d\u79f0\uff08scientific name\uff09\u548c\u522b\u540d\u3002 merged.dmp \u5305\u542b\u4e86\u5230\u5f53\u524d\u7248\u672c\u4e3a\u6b62\uff0c\u6240\u6709\u88ab\u5408\u5e76\u7684TaxID\u4e0e\u5408\u5e76\u5230\u7684\u65b0TaxID\u3002 delnodes.dmp \u5305\u542b\u4e86\u5230\u5f53\u524d\u7248\u672c\u4e3a\u6b62\uff0c\u6240\u6709\u88ab\u5220\u9664\u7684TaxID\u3002 \u57282018\u5e742\u6708\u7684\u65f6\u5019\uff0c \u63a8\u51fa\u4e86\u65b0\u7684\u683c\u5f0f \uff0c \u989d\u5916\u5305\u542b\u4e86\u8c31\u7cfb\uff08lineage\uff09\uff0c\u7c7b\u578b\uff08type\uff09\u548c\u5bbf\u4e3b\uff08host\uff09\u4fe1\u606f\u3002 \u6587\u4ef6\u540d\u79f0\u4e3a new_taxdump.tar.gz \uff0c\u6587\u4ef6\u5927\u5c0f\u7ea6110Mb\u3002 \u76f8\u5bf9\u65e7\u7248\uff0c\u65b0\u7248\u672c\u6587\u4ef6\u6570\u91cf\u548c\u5185\u5bb9\u66f4\u591a\uff0c\u4e3b\u8981\u662f\u56e0\u4e3a\u589e\u52a0\u4e86lineage\u548c\u7c7b\u578b\u4fe1\u606f\u3002 \u4e8b\u5b9e\u4e0alineage\u662f\u53ef\u4ee5\u4ece nodes.dmp \u548c names.dmp \u8ba1\u7b97\u800c\u6765\u3002 \u65b0\u7248\u683c\u5f0f\u6240\u542b\u6587\u4ef6\u5982\u4e0b\uff1a nodes.dmp names.dmp merged.dmp delnodes.dmp fullnamelineage.dmp TaxIDlineage.dmp rankedlineage.dmp host.dmp typeoftype.dmp typematerial.dmp citations.dmp division.dmp gencode.dmp readme.txt NCBI Taxonomy\u6570\u636e\u6bcf\u5929\u90fd\u5728\u66f4\u65b0\uff0c\u6bcf\u6708\u521d\uff08\u5927\u591a\u4e3a1\u53f7\uff09\u7684\u6570\u636e\u4f5c\u4e3a\u5b58\u6863\u4fdd\u5b58\u5728 taxdump_archive/ \u76ee\u5f55\uff0c \u65e7\u7248\u672c\u6700\u65e9\u6570\u636e\u52302014\u5e748\u6708\uff0c\u65b0\u7248\u672c\u53ea\u52302018\u5e7412\u6708\u3002","title":"NCBI Taxonomy \u6570\u636e\u6587\u4ef6"},{"location":"chinese-dev/#taxonkit","text":"\u5927\u5bb6\u5e94\u8be5\u90fd\u6709\u5b89\u88c5\u751f\u7269\u4fe1\u606f\u8f6f\u4ef6\u7684\u75db\u82e6\u56de\u5fc6\uff0c\u5728conda\u51fa\u73b0\u4e4b\u524d\uff0c\u5f88\u591a\u8f6f\u4ef6\u90fd\u9700\u8981\u624b\u52a8\u5b89\u88c5\u4f9d\u8d56\u3001\u518d\u7f16\u8bd1\u5b89\u88c5\u3002 \u4e0d\u540c\u64cd\u4f5c\u7cfb\u7edf\uff0c\u64cd\u4f5c\u7cfb\u7edf\u7248\u672c\uff0c\u7f16\u8bd1\u5668\u7248\u672c\u7ed9\u8f6f\u4ef6\u5b89\u88c5\u5e26\u6765\u4e86\u5de8\u5927\u7684\u56f0\u96be\u3002 \u5982\u679c\u5f00\u53d1\u8005\u6ca1\u6ce8\u610f\u8f6f\u4ef6\u7684\u8de8\u5e73\u53f0\u3001\u53ef\u79fb\u690d\u6027\u66f4\u662f\u5982\u6b64\u3002 \u597d\u7684\u8f6f\u4ef6\u4e00\u5b9a\u8981\u8003\u8651\u4ee5\u4e0b\u51e0\u4e2a\u65b9\u9762\uff1a \u5b89\u88c5\u4fbf\u5229\u6027\u3002 \u5c3d\u53ef\u80fd\u7b80\u5316\u5b89\u88c5\u6b65\u9aa4\uff0c\u751a\u81f3\u4e00\u952e/\u4e00\u6761\u547d\u4ee4\u5b89\u88c5\u3002 \u51cf\u5c11\u5bf9\u5916\u90e8\u8f6f\u4ef6/\u5305\u7684\u4f9d\u8d56\u3002 \u5bf9\u591a\u5e73\u53f0\uff08windows/linux\uff09\u7684\u517c\u5bb9\u6027\u3002 \u5c3d\u91cf\u63d0\u4f9b\u7f16\u8bd1\u597d\u7684 \u9759\u6001\u94fe\u63a5\u53ef\u6267\u884c\u7a0b\u5e8f\uff08Statically linked executable binaries\uff09\u3002 \u914d\u7f6e\u4fbf\u5229\u6027\u3002 \u5c3d\u53ef\u80fd\u7b80\u5316\u914d\u7f6e\uff0c\u81ea\u52a8\u5316\u914d\u7f6e\uff0c\u751a\u81f3\u96f6\u914d\u7f6e\u3002 \u4f7f\u7528\u4fbf\u5229\u6027\u3002 \u4e30\u5bcc\u7684\u6587\u6863\uff1a\u5b89\u88c5\uff0c\u4f7f\u7528\uff0c\u4f8b\u5b50\u3002 \u8f6f\u4ef6\u7ed3\u6784\u5408\u7406\uff0c\u6a21\u5757\u5316\u3002 \u53cb\u597d\u7684\u62a5\u9519\u4fe1\u606f\uff0c\u6307\u51fa\u8be6\u7ec6\u7684\u9519\u8bef\u539f\u56e0\uff0c\u800c\u4e0d\u662f\u53ea\u62a5segmentation fault\uff0c\u6216\u6254\u51fa\u4e00\u5806\u9519\u8bef\u4fe1\u606f\u3002 \u4e30\u5bcc\u7684\u547d\u4ee4\u884c\u53c2\u6570\uff0c\u6ee1\u8db3\u4e0d\u540c\u529f\u80fd\u9700\u6c42\u3002 \u652f\u6301\u6807\u51c6\u8f93\u5165/\u8f93\u51fa\uff0c\u4ece\u800c\u4fbf\u4e8e\u6574\u5408\u5230\u5206\u6790\u6d41\u7a0b\u3002 \u53ef\u9009\u652f\u6301shell\u8865\u5168\uff0c\u4fbf\u4e8e\u5feb\u901f\u8c03\u7528\u5b50\u547d\u4ee4\u548c\u53c2\u6570\u3002 \u8ba1\u7b97\u6548\u7387\u3002 \u5c3d\u53ef\u80fd\u5360\u7528\u4f4e\u5185\u5b58\u3001\u4f4e\u5b58\u50a8\u3002 \u5c3d\u91cf\u51cf\u5c11\u8ba1\u7b97\u65f6\u95f4\uff0c\u5145\u5206\u5229\u7528\u591aCPU\u3002 \u6301\u7eed\u7684\u652f\u6301\u3002 \u6839\u636e\u7528\u6237\u9700\u6c42\u4fee\u590dbug\u3001\u589e\u52a0\u65b0\u529f\u80fd\u3002 \u5b9a\u671f\u66f4\u65b0\u53d1\u5e03\u65b0\u7248\u672c\u3002 \u5728\u5b9e\u73b0TaxonKit\u7684\u65f6\u5019\uff0c\u6211\u5df2\u7ecf\u5f00\u59cb\u7f16\u5199seqkit\u548ccsvtk\u8f6f\u4ef6\uff0c\u6709\u4e86\u4e00\u5b9a\u7684\u7ecf\u9a8c\uff0c\u4e5f\u57fa\u672c\u80fd\u8fbe\u5230\u4e0a\u8ff0\u6240\u6709\u8981\u6c42\u3002 TaxonKit\u4f7f\u7528Go\u8bed\u8a00\u7f16\u5199\uff0c\u8fd9\u6837\u53ef\u4ee5\u8f7b\u677e\u7f16\u8bd1\u51fa\u652f\u6301Linux, Windows, macOS\u7b49\u64cd\u4f5c\u7cfb\u7edf\u7684\u4e0d\u540c\u67b6\u6784\uff08x86/arm\uff09\u7684\u53ef\u6267\u884c\u4e8c\u8fdb\u5236\u6587\u4ef6\u3002 \u7531\u4e8eGo\u662f\u7f16\u8bd1\u578b\u8bed\u8a00\uff0c\u5728\u8fd0\u884c\u6548\u7387\u4e0a\u4e5f\u6709\u4fdd\u8bc1\u3002 \u81f3\u4e8e\u914d\u7f6e\u3001\u4f7f\u7528\u7b49\u4fbf\u5229\u6027\u5219\u4f9d\u8d56\u4e8e\u5f00\u53d1\u8005\u3002 \u5206\u7c7b\u5b66\u6570\u636e\u4f7f\u7528NCBI taxonomy\u7684\u516c\u5f00\u6570\u636e\u3002 \u6570\u636e\u8bbf\u95ee\u65b9\u5f0f\u7684\u9009\u62e9\uff1a\u901a\u8fc7\u7f51\u7edc\u8bbf\u95ee\u5b98\u65b9Web\u63a5\u53e3\u7684\u65b9\u5f0f\u592a\u6162\uff0c\u53ea\u8003\u8651\u672c\u5730\u8bbf\u95ee\u3002 \u672c\u5730\u8bbf\u95ee\u6709\u51e0\u79cd\u65b9\u5f0f\uff1a \u76f4\u63a5\u8bbf\u95ee\u6570\u636e\u5e93\uff1a\u53c8\u5206\u5d4c\u5165\u5f0f\u6570\u636e\u5e93\u5982SQLite\uff0c\u7b2c\u4e09\u65b9\u6570\u636e\u5e93\u5165MySQL\u3002\u540e\u8005\u4e0d\u8003\u8651\uff0c\u914d\u7f6e\u592a\u9ebb\u70e6\u3002 Client-Server\u6a21\u5f0f\uff1a Web\u63a5\u53e3\uff1a\u670d\u52a1\u7aef\u542f\u52a8\u5b88\u62a4\u8fdb\u7a0b\uff0c\u957f\u671f\u4fdd\u6301\u6570\u636e\u5e93\u8fde\u63a5\uff0c\u5bf9\u5916\u63d0\u4f9bWeb\uff08RESTful\uff09\u63a5\u53e3\uff0c \u5ba2\u6237\u7aef\u672c\u5730\u6216\u8fdc\u7a0b\u8c03\u7528\u3002\u5148\u524d\u5df2\u7ecf\u5f00\u53d1\u4e86\u4e00\u4e2a\u539f\u578b\uff08https://github.com/shenwei356/gtaxon\uff09\uff0c \u4f46\u901a\u8fc7RESTful\u63a5\u53e3\uff08HTTP\uff09\u5927\u6279\u91cf\u8c03\u7528\uff0c\u8bbf\u95ee\u901f\u5ea6\u8f83\u6162\u3002 Socket\u63a5\u53e3\uff1a\u4e0eWeb\u501f\u53e3\u7c7b\u4f3c\uff0c\u56e0\u4e3a\u6ca1\u6709\u4f7f\u7528http\u534f\u8bae\uff0c\u901f\u5ea6\u5e94\u8be5\u4f1a\u9ad8\u4e00\u4e9b\u3002\u4f46\u6ca1\u6709\u5c1d\u8bd5\u3002 \u6700\u540e\u6d4b\u8bd5\u53d1\u73b0\uff0c\u76f4\u63a5\u89e3\u6790\u6570\u636e\u6587\u4ef6\u7684\u901f\u5ea6\u4e5f\u5f88\u5feb\uff0c5\u79d2\u5de6\u53f3\uff08\u5b58\u50a8\u4e3aNVMe SSD\uff09\uff0c\u5b8c\u5168\u6ee1\u8db3\u8981\u6c42\u3002 \u5b8c\u5168\u4e0d\u7528\u642d\u5efa\u6570\u636e\u5e93\uff0c\u914d\u7f6e\u66f4\u5bb9\u6613\uff0c\u4e5f\u4fbf\u4e8e\u66f4\u65b0\u6570\u636e\u3002 \u8fd1\u65e5\u53c8\u8fdb\u4e00\u6b65\u4f18\u5316\u52302\u79d2\u5de6\u53f3\uff0c\u975e\u5e38\u5feb\u901f\u3002\u5185\u5b58\u4e5f\u5728500Mb-1.5G\u5de6\u53f3\uff0c\u5b8c\u5168\u53ef\u4ee5\u63a5\u53d7\u3002 TaxonKit\u4e3a\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u91c7\u7528\u5b50\u547d\u4ee4\u7684\u65b9\u5f0f\u6765\u6267\u884c\u4e0d\u540c\u529f\u80fd\uff0c\u5927\u591a\u6570\u5b50\u547d\u4ee4\u652f\u6301\u6807\u51c6\u8f93\u5165/\u8f93\u51fa\uff0c\u4fbf\u4e8e\u4f7f\u7528\u547d\u4ee4\u884c\u7ba1\u9053\u8fdb\u884c\u6d41\u6c34\u4f5c\u4e1a\u3002","title":"TaxonKit \u5f00\u53d1\u601d\u8def"},{"location":"chinese-dev/#_2","text":"\u5206\u7c7b\u5b66\u6570\u636e\u5e93\u6709\u5f88\u591a\uff0cTaxonKit\u76ee\u524d\u53ea\u652f\u6301\u5e94\u7528\u6700\u5e7f\u6cdb\u7684NCBI Taxonomy\u3002 \u5bf9\u4e8eGTDB Taxonomy\uff0c\u53ef\u4ee5\u901a\u8fc7\u73b0\u6709\u5de5\u5177\uff0c\u5982 gtdb_to_taxdump \uff0c \u5c06\u5176\u6570\u636e\u8f6c\u6362\u4e3aNCBI taxdump\u6587\u4ef6\u3002","title":"\u5c40\u9650\u6027"},{"location":"chinese/","text":"TaxonKit: \u5c0f\u5de7\u3001\u9ad8\u6548\u3001\u5b9e\u7528\u7684NCBI\u5206\u7c7b\u5b66\u6570\u636e\u547d\u4ee4\u884c\u5de5\u5177\u96c6 NCBI Taxonomy \u6570\u636e\u5e93 \u4ece\u4e8b\u751f\u7269\u591a\u6837\u6027\u7684\u7814\u7a76\u8005\u5bf9NCBI Taxonomy\u6570\u636e\u5e93\u4e00\u5b9a\u4e0d\u4f1a\u964c\u751f\uff0c \u5b83\u5305\u542b\u4e86NCBI\u6240\u6709\u6838\u9178\u548c\u86cb\u767d\u5e8f\u5217\u6570\u636e\u5e93\u4e2d\u6bcf\u6761\u5e8f\u5217\u5bf9\u5e94\u7684\u7269\u79cd\u540d\u79f0\u4e0e\u5206\u7c7b\u5b66\u4fe1\u606f\u3002 \u5927\u591a\u6570\u751f\u6001\u5b66\u7814\u7a76\u5bf9\u7269\u79cd\u7ec4\u6210\u7684\u63cf\u8ff0\u90fd\u662f\u57fa\u4e8eNCBI Taxonomy\u6570\u636e\u5e93\uff0c \u5f53\u7136\u76ee\u524d\u4e5f\u5f00\u59cb\u4f7f\u7528\u5176\u4ed6\u6570\u636e\u5e93\uff0c\u5982GTDB\u7b49\u3002 NCBI Taxonomy\u6570\u636e\u5e93\u59cb\u4e8e1991\u5e74\uff0c\u4e00\u76f4\u968f\u7740Entrez\u6570\u636e\u5e93\u548c\u5176\u4ed6\u6570\u636e\u5e93\u66f4\u65b0\uff0c 1996\u5e74\u63a8\u51fa\u7f51\u9875\u7248\u3002NCBI Taxonomy\u6570\u636e\u5e93\u5b98\u65b9\u5730\u5740\u4e3a https://www.ncbi.nlm.nih.gov/taxonomy \uff0c \u516c\u5f00\u6570\u636e\u4e0b\u8f7d\u5730\u5740\u4e3a https://ftp.ncbi.nih.gov/pub/taxonomy/ \uff0c \u6570\u636e\u6bcf\u5c0f\u65f6\u66f4\u65b0\uff0c\u6bcf\u4e2a\u6708\u521d\u751f\u6210\u4e00\u4efd\u6570\u636e\u5f52\u6863\u5b58\u4e8e taxdump_archive \u76ee\u5f55\uff0c\u6700\u65e9\u53ef\u8ffd\u6eaf\u52302014\u5e748\u6708\u3002 TaxonKit \u4f7f\u7528 TaxonKit\u662f\u91c7\u7528Go\u8bed\u8a00\u7f16\u5199\u7684\u547d\u4ee4\u884c\u5de5\u5177\uff0c \u63d0\u4f9bLinux, Windows, macOS\u64cd\u4f5c\u7cfb\u7edf\u4e0d\u540c\u67b6\u6784\uff08x86-64/arm64\uff09\u7684\u9759\u6001\u7f16\u8bd1\u7684\u53ef\u6267\u884c\u4e8c\u8fdb\u5236\u6587\u4ef6\u3002 \u53d1\u5e03\u7684\u538b\u7f29\u5305\u4e0d\u8db33Mb\uff0c\u9664\u4e86Github\u6258\u7ba1\u5916\uff0c\u8fd8\u63d0\u4f9b\u56fd\u5185\u955c\u50cf\u4f9b\u4e0b\u8f7d\uff0c\u540c\u65f6\u8fd8\u652f\u6301conda\u548chomebrew\u5b89\u88c5\u3002 \u7528\u6237\u53ea\u9700\u8981 \u4e0b\u8f7d\u3001\u89e3\u538b\uff0c\u5f00\u7bb1\u5373\u7528\uff0c\u65e0\u9700\u914d\u7f6e \uff0c\u4ec5\u9700\u4e0b\u8f7d\u89e3\u538bNCBI Taxonomy\u6570\u636e\u6587\u4ef6\u89e3\u538b\u5230\u6307\u5b9a\u76ee\u5f55\u5373\u53ef\u3002 \u6e90\u4ee3\u7801 https://github.com/shenwei356/taxonkit \uff0c \u6587\u6863 http://bioinf.shenwei.me/taxonkit \uff08\u4ecb\u7ecd\u3001\u4f7f\u7528\u8bf4\u660e\u3001\u4f8b\u5b50\u3001\u6559\u7a0b\uff09 \u9009\u62e9\u7cfb\u7edf\u5bf9\u5e94\u7684\u7248\u672c\u4e0b\u8f7d\u6700\u65b0\u7248 https://github.com/shenwei356/taxonkit/releases \uff0c\u89e3\u538b\u540e\u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u5373\u53ef\u4f7f\u7528\u3002\u6216\u53ef\u9009conda\u5b89\u88c5 conda install taxonkit -c bioconda -y # \u8868\u683c\u6570\u636e\u5904\u7406\uff0c\u63a8\u8350\u4f7f\u7528 csvtk \u66f4\u9ad8\u6548 conda install csvtk -c bioconda -y \u6d4b\u8bd5\u6570\u636e\u4e0b\u8f7d\u53ef\u76f4\u63a5 https://github.com/shenwei356/taxonkit \u4e0b\u8f7d\u9879\u76ee\u538b\u7f29\u5305\uff0c\u6216\u4f7f\u7528git clone\u4e0b\u8f7d\u9879\u76ee\u6587\u4ef6\u5939\uff0c\u5176\u4e2d\u7684example\u4e3a\u6d4b\u8bd5\u6570\u636e git clone https://github.com/shenwei356/taxonkit TaxonKit\u4e3a\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u91c7\u7528\u5b50\u547d\u4ee4\u7684\u65b9\u5f0f\u6765\u6267\u884c\u4e0d\u540c\u529f\u80fd\uff0c \u5927\u591a\u6570\u5b50\u547d\u4ee4\u652f\u6301\u6807\u51c6\u8f93\u5165/\u8f93\u51fa\uff0c\u4fbf\u4e8e\u4f7f\u7528\u547d\u4ee4\u884c\u7ba1\u9053\u8fdb\u884c\u6d41\u6c34\u4f5c\u4e1a\uff0c \u8f7b\u677e\u6574\u5408\u8fdb\u5206\u6790\u6d41\u7a0b\u4e2d\u3002 \u5b50\u547d\u4ee4 \u529f\u80fd list \u5217\u51fa\u6307\u5b9aTaxId\u4e0b\u6240\u6709\u5b50\u5355\u5143\u7684\u7684TaxID lineage \u6839\u636eTaxID\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb\uff08lineage\uff09 reformat \u5c06\u5b8c\u6574\u8c31\u7cfb\u8f6c\u5316\u4e3a\u201c\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\u682a\"\u7684\u81ea\u5b9a\u4e49\u683c\u5f0f name2taxid \u5c06\u5206\u7c7b\u5355\u5143\u540d\u79f0\u8f6c\u5316\u4e3aTaxID filter \u6309\u5206\u7c7b\u5b66\u6c34\u5e73\u8303\u56f4\u8fc7\u6ee4TaxIDs lca \u8ba1\u7b97\u6700\u4f4e\u516c\u5171\u7956\u5148(LCA) taxid-changelog \u8ffd\u8e2aTaxID\u53d8\u66f4\u8bb0\u5f55 version \u663e\u793a\u7248\u672c\u4fe1\u606f\u3001\u68c0\u6d4b\u65b0\u7248\u672c genautocomplete \u751f\u6210shell\u81ea\u52a8\u8865\u5168\u914d\u7f6e\u811a\u672c \u5907\u6ce8\uff1a \u8f93\u51fa\uff1a \u6240\u6709\u547d\u4ee4\u8f93\u51fa\u4e2d\u5305\u542b\u8f93\u5165\u6570\u636e\u5185\u5bb9\uff0c\u5728\u6b64\u57fa\u7840\u4e0a\u589e\u52a0\u5217\u3002 \u6240\u6709\u547d\u4ee4\u9ed8\u8ba4\u8f93\u51fa\u5230\u6807\u51c6\u8f93\u51fa\uff08stdout\uff09\uff0c\u53ef\u901a\u8fc7\u91cd\u5b9a\u5411\uff08 > \uff09\u5199\u5165\u6587\u4ef6\u3002 \u6216\u901a\u8fc7\u5168\u5c40\u53c2\u6570 -o \u6216 --out-file \u6307\u5b9a\u8f93\u51fa\u6587\u4ef6\uff0c\u4e14\u53ef\u81ea\u52a8\u8bc6\u522b\u8f93\u51fa\u6587\u4ef6\u540e\u7f00\uff08 .gz \uff09\u8f93\u51fagzip\u683c\u5f0f\u3002 \u8f93\u5165\uff1a \u9664\u4e86 list \u4e0e taxid-changelog \u4e4b\u5916\uff0c lineage , reformat , name2taxid , filter \u4e0e lca \u5747\u53ef\u4ece\u6807\u51c6\u8f93\u5165\uff08stdin\uff09\u8bfb\u53d6\u8f93\u5165\u6570\u636e\uff0c\u4e5f\u53ef\u901a\u8fc7\u4f4d\u7f6e\u53c2\u6570\uff08positional arguments\uff09\u8f93\u5165\uff0c\u5373\u547d\u4ee4\u540e\u9762\u4e0d\u5e26 \u4efb\u4f55flag\u7684\u53c2\u6570\uff0c\u5982 taxonkit lineage taxids.txt \u8f93\u5165\u683c\u5f0f\u4e3a\u5355\u5217\uff0c\u6216\u8005\u5236\u8868\u7b26\u5206\u9694\u7684\u683c\u5f0f\uff0c\u8f93\u5165\u6570\u636e\u6240\u5728\u5217\u7528 -i \u6216 --taxid-field \u6307\u5b9a\u3002 TaxonKit\u76f4\u63a5\u89e3\u6790NCBI Taxonomy\u6570\u636e\u6587\u4ef6\uff082\u79d2\u5de6\u53f3\uff09\uff0c\u914d\u7f6e\u66f4\u5bb9\u6613\uff0c\u4e5f\u4fbf\u4e8e\u66f4\u65b0\u6570\u636e\uff0c\u5360\u7528\u5185\u5b58\u5728500Mb-1.5G\u5de6\u53f3\u3002 \u6570\u636e\u4e0b\u8f7d\uff1a # \u6709\u65f6\u4e0b\u8f7d\u5931\u8d25\uff0c\u53ef\u591a\u8bd5\u51e0\u6b21\uff1b\u6216\u5c1d\u8bd5\u6d4f\u89c8\u5668\u4e0b\u8f7d\u6b64\u94fe\u63a5 wget -c https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz # \u89e3\u538b\u6587\u4ef6\u5b58\u4e8e\u5bb6\u76ee\u5f55\u4e2d.taxonkit/\uff0c\u7a0b\u5e8f\u9ed8\u8ba4\u6570\u636e\u5e93\u9ed8\u8ba4\u76ee\u5f55 mkdir -p $HOME/.taxonkit cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit list \u5217\u51fa\u6307\u5b9aTaxId\u6240\u5728\u5b50\u6811\u7684\u6240\u6709TaxID taxonkit list \u7528\u4e8e\u5217\u51fa\u6307\u5b9aTaxID\u6240\u5728\u5206\u7c7b\u5b66\u5355\u5143\uff08taxon\uff09\u7684\u5b50\u6811\uff08subtree\uff09\u7684\u6240\u6709taxon\u7684TaxID\uff0c\u53ef\u9009\u663e\u793a\u540d\u79f0\u548c\u5206\u7c7b\u5b66\u6c34\u5e73\u3002 \u6b64\u529f\u80fd\u4e0eNCBI Taxonomy\u7f51\u9875\u7248\u7c7b\u4f3c\u3002 \u5982\uff0c # \u4ee5\u4eba\u5c5e(9605)\u548c\u80a0\u9053\u4e2d\u8457\u540d\u7684Akk\u83cc\u5c5e(239934)\u4e3a\u4f8b $ taxonkit list --show-rank --show-name --indent \" \" --ids 9605,239934 9605 [genus] Homo 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' 1425170 [species] Homo heidelbergensis 2665952 [no rank] environmental samples 2665953 [species] Homo sapiens environmental sample 239934 [genus] Akkermansia 239935 [species] Akkermansia muciniphila 349741 [strain] Akkermansia muciniphila ATCC BAA-835 512293 [no rank] environmental samples 512294 [species] uncultured Akkermansia sp. 1131822 [species] uncultured Akkermansia sp. SMG25 1262691 [species] Akkermansia sp. CAG:344 1263034 [species] Akkermansia muciniphila CAG:154 1679444 [species] Akkermansia glycaniphila 2608915 [no rank] unclassified Akkermansia 1131336 [species] Akkermansia sp. KLE1605 ... list\u4f7f\u7528\u6700\u5e7f\u6cdb\u7684\u7684\u529f\u80fd\u662f\u83b7\u53d6\u67d0\u4e2a\u7c7b\u522b\uff08\u6bd4\u5982\u7ec6\u83cc\u3001\u75c5\u6bd2\u3001\u67d0\u4e2a\u5c5e\u7b49\uff09\u4e0b\u6240\u6709\u7684TaxID\uff0c \u7528\u6765\u4eceNCBI nt/nr\u4e2d\u83b7\u53d6\u5bf9\u5e94\u7684\u6838\u9178/\u86cb\u767d\u5e8f\u5217\uff0c\u4ece\u800c\u642d\u5efa\u7279\u5f02\u6027\u7684BLAST\u6570\u636e\u5e93\u3002 \u5b98\u7f51\u63d0\u4f9b\u4e86\u76f8\u5e94\u7684\u8be6\u7ec6\u6b65\u9aa4\uff1a http://bioinf.shenwei.me/taxonkit/tutorial \u3002 # \u6240\u6709\u7ec6\u83cc\u7684TaxID $ taxonkit list --show-rank --show-name --ids 2 > /dev/null lineage \u6839\u636eTaxID\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb \u5206\u7c7b\u5b66\u6570\u636e\u76f8\u5173\u6700\u5e38\u89c1\u7684\u529f\u80fd\u5c31\u662f\u6839\u636eTaxID\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb\u3002 TaxonKit\u53ef\u6839\u636e\u8f93\u5165\u6587\u4ef6\u63d0\u4f9b\u7684TaxID\u5217\u8868\u5feb\u901f\u8ba1\u7b97lineage\uff0c\u5e76\u53ef\u9009\u63d0\u4f9b\u540d\u79f0\uff0c\u5206\u7c7b\u5b66\u6c34\u5e73\uff0c\u4ee5\u53ca\u8c31\u7cfb\u5bf9\u5e94\u7684TaxID\u3002 \u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0c\u968f\u7740Taxonomy\u6570\u636e\u7684\u9891\u7e41\u66f4\u65b0\uff0c\u6709\u7684TaxID\u53ef\u80fd\u88ab\u5220\u9664\u3001\u6216\u5408\u5e76\uff08merge\uff09\u5230\u5176\u5b83TaxID\u4e2d\uff0c TaxonKit\u4f1a\u81ea\u52a8\u8bc6\u522b\uff0c\u5e76\u8fdb\u884c\u63d0\u793a\uff0c\u5bf9\u4e8e\u88ab\u5408\u5e76\u7684TaxID\uff0cTaxonKit\u4f1a\u6309\u65b0TaxID\u8fdb\u884c\u8ba1\u7b97\u3002 # \u4f7f\u7528example\u4e2d\u7684\u6d4b\u8bd5\u6570\u636e $ head taxids.txt 9606 9913 376619 # \u67e5\u627e\u6307\u5b9ataxids\u5217\u8868\u7684\u7269\u79cd\u4fe1\u606f\uff0ctee\u53ef\u8f93\u51fa\u5c4f\u5e55\u5e76\u5199\u5165\u6587\u4ef6 $ taxonkit lineage taxids.txt | tee lineage.txt 19:22:13.077 [WARN] taxid 92489 was merged into 796334 19:22:13.077 [WARN] taxid 1458427 was merged into 1458425 19:22:13.077 [WARN] taxid 123124124 not found 19:22:13.077 [WARN] taxid 3 was deleted 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus 376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y 123124124 3 92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raicheisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei \u4e0e\u5176\u5b83\u8f6f\u4ef6\u7684\u6027\u80fd\u76f8\u6bd4\uff0c\u5f53\u67e5\u8be2\u6570\u91cf\u8f83\u5c11\u65f6ETE\u8f83\u5feb\uff0c\u6570\u91cf\u8f83\u591a\u65f6\u5219TaxonKit\u66f4\u5feb\u3002 \u5728\u4e0d\u540c\u6570\u636e\u91cf\u89c4\u6a21\u4e0a TaxonKit\u901f\u5ea6\u4e00\u76f4\u5f88\u7a33\u5b9a\uff0c\u5747\u4e3a2-3\u79d2\uff0c\u65f6\u95f4\u4e3b\u8981\u82b1\u5728\u89e3\u6790Taxonomy\u6570\u636e\u6587\u4ef6\u4e0a\u3002 \u5217\u51falineage\u6bcf\u4e2a\u5206\u7c7b\u5b66\u5355\u5143\u7684\u7684TaxId\u548crank\u548c\u540d\u79f0\uff0c\u6bd4\u5982SARS-COV-2\u3002 # lineage\u63d0\u53d6SARS-COV-2\u7684\u4e16\u7cfb $ echo \"2697049\" \\ | taxonkit lineage -t -R \\ | sed \"s/\\t/\\n/g\" 2697049 Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 superkingdom;clade;kingdom;phylum;class;order;suborder;family;subfamily;genus;subgenus;species;no rank reformat \u751f\u6210\u6807\u51c6\u5c42\u7ea7\u7269\u79cd\u6ce8\u91ca \u6709\u65f6\u5019\uff0c\u6211\u4eec\u5e76\u4e0d\u9700\u8981\u5b8c\u6574\u7684\u5206\u7c7b\u5b66\u8c31\u7cfb\uff08complete lineage\uff09\uff0c\u56e0\u4e3a\u5f88\u591a\u7ea7\u522b\u5373\u4e0d\u5e38\u7528\uff0c\u800c\u4e14\u4e0d\u5b8c\u6574\u3002\u901a\u5e38\u53ea\u60f3\u4fdd\u7559\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\u3002 \u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0c \u4e0d\u662f\u6240\u6709\u7269\u79cd\u90fd\u6709\u5b8c\u6574\u7684\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\u6c34\u5e73\uff0c\u7279\u522b\u662f\u75c5\u6bd2\u4ee5\u53ca\u4e00\u4e9b\u73af\u5883\u6837\u54c1 \u3002 TaxonKit\u53ef\u4ee5\u7528\u81ea\u5b9a\u4e49\u5185\u5bb9\u66ff\u4ee3\u7f3a\u5931\u7684\u5206\u7c7b\u5355\u5143\uff0c\u5982\u7528\u201c__\u201d\u66ff\u4ee3\u3002 \u66f4 \u5389\u5bb3 \u6709\u7528\u7684\u662f\uff0c TaxonKit\u8fd8\u53ef\u4ee5\u7528\u66f4\u9ad8\u5c42\u7ea7\u7684\u5206\u7c7b\u5355\u5143\u4fe1\u606f\u6765\u8865\u9f50\u7f3a\u5931\u7684\u5c42\u7ea7 ( -F/--fill-miss-rank )\uff0c\u6bd4\u5982 # \u6ca1\u6709genus\u7684\u75c5\u6bd2 $ echo 1327037 | taxonkit lineage | taxonkit reformat | cut -f 1,3 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y # -F\u53c2\u6570\u4f1a\u7528family\u4fe1\u606f\u6765\u8865\u9f50genus\u4fe1\u606f $ echo 1327037 | taxonkit lineage | taxonkit reformat -F | cut -f 1,3 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae genus;Croceibacter phage P2559Y \u8f93\u51fa\u683c\u5f0f\u53ef\u9009\u53ea\u8f93\u51fa\u90e8\u5206\u5206\u7c7b\u5b66\u6c34\u5e73\uff0c\u8fd8\u652f\u6301\u5236\u8868\u7b26\uff08 \"\\t\" \uff09\uff0c\u518d\u914d\u5408\u4f5c\u8005\u7684\u53e6\u4e00\u4e2a\u5de5\u5177csvtk\uff0c\u53ef\u4ee5\u8f93\u51fa\u6f02\u4eae\u7684\u7ed3\u679c\u3002 \u5176\u5b83\u6709\u7528\u7684\u9009\u9879\uff1a -P/--add-prefix \uff1a\u7ed9\u6bcf\u4e2a\u5206\u7c7b\u5b66\u6c34\u5e73\u6dfb\u52a0\u524d\u7f00\uff0c\u6bd4\u5982 s__species \u3002 -t/--show-lineage-taxids \uff1a\u8f93\u51fa\u5206\u7c7b\u5b66\u5355\u5143\u5bf9\u5e94\u7684TaxID\u3002 -r/--miss-rank-repl : \u66ff\u4ee3\u6ca1\u6709\u5bf9\u5e94rank\u7684taxon\u540d\u79f0 -S/--pseudo-strain : \u5bf9\u4e8e\u4f4e\u4e8especies\u4e14rank\u65e2\u4e0d\u662fsubspecies\u4e5f\u4e0d\u662fstrain\u7684taxid\uff0c\u4f7f\u7528\u6c34\u5e73\u6700\u4f4etaxon\u540d\u79f0\u505a\u4e3a\u83cc\u682a\u540d\u79f0\u3002 \u4f8b\uff0c $ echo -ne \"349741\\n1327037\"\\ | taxonkit lineage \\ | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" -F -P \\ | csvtk cut -t -f -2 \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk pretty -t taxid kindom phylum class order family genus species 349741 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila 1327037 k__Viruses p__Uroviricota c__Caudoviricetes o__Caudovirales f__Siphoviridae g__unclassified Siphoviridae genus s__Croceibacter phage P2559Y # \u4fbf\u4e8e\u5c0f\u5c4f\u5e55\u67e5\u770b\uff0c\u7528csvtk\u8fdb\u884c\u8f6c\u7f6e $ echo -ne \"349741\\n1327037\"\\ | taxonkit lineage \\ | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" -F -P \\ | csvtk cut -t -f -2 \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk transpose -t \\ | csvtk pretty -H -t taxid 349741 1327037 kindom k__Bacteria k__Viruses phylum p__Verrucomicrobia p__Uroviricota class c__Verrucomicrobiae c__Caudoviricetes order o__Verrucomicrobiales o__Caudovirales family f__Akkermansiaceae f__Siphoviridae genus g__Akkermansia g__unclassified Siphoviridae genus species s__Akkermansia muciniphila s__Croceibacter phage P2559Y # \u5230\u682a\u6c34\u5e73\uff0c\u4ee5sars-cov-2\u4e3a\u4f8b $ echo -ne \"2697049\"\\ | taxonkit lineage \\ | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" -F -P -S \\ | csvtk cut -t -f -2 \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species,strain \\ | csvtk transpose -t \\ | csvtk pretty -H -t taxid 2697049 kindom k__Viruses phylum p__Pisuviricota class c__Pisoniviricetes order o__Nidovirales family f__Coronaviridae genus g__Betacoronavirus species s__Severe acute respiratory syndrome-related coronavirus strain t__Severe acute respiratory syndrome coronavirus 2 name2taxid \u5c06\u5206\u7c7b\u5355\u5143\u540d\u79f0\u8f6c\u5316\u4e3aTaxID \u5c06\u5206\u7c7b\u5355\u5143\u540d\u79f0\u8f6c\u5316\u4e3aTaxID\u975e\u5e38\u5bb9\u6613\u7406\u89e3\uff0c\u552f\u4e00\u8981\u6ce8\u610f\u7684\u662f \u67d0\u4e9bTaxId\u5bf9\u5e94\u76f8\u540c\u7684\u540d\u79f0 \uff0c\u6bd4\u5982 # -i\u6307\u5b9a\u5217\uff0c-r\u663e\u793a\u7ea7\u522b\uff0c-L\u4e0d\u663e\u793a\u4e16\u7cfb $ echo Drosophila | taxonkit name2taxid | taxonkit lineage -i 2 -r -L Drosophila 7215 genus Drosophila 32281 subgenus Drosophila 2081351 genus \u83b7\u53d6TaxID\u4e4b\u540e\uff0c\u53ef\u4ee5\u7acb\u5373\u4f20\u7ed9taxonkit\u8fdb\u884c\u540e\u7eed\u64cd\u4f5c\uff0c\u4f46\u8981\u6ce8\u610f\u7528 -i \u6307\u5b9aTaxId\u6240\u5728\u5217\u3002 filter \u6309\u5206\u7c7b\u5b66\u6c34\u5e73\u8303\u56f4\u8fc7\u6ee4TaxIDs filter\u53ef\u4ee5\u6309 \u5206\u7c7b\u5b66\u6c34\u5e73\u8303\u56f4 \u8fc7\u6ee4TaxIDs\uff0c\u6ce8\u610f\uff0c\u4e0d\u4ec5\u4ec5\u662f\u7279\u5b9a\u7684Rank\uff0c\u800c\u662f\u4e00\u4e2a \u8303\u56f4 \u3002 \u6bd4\u5982genus\u53ca\u4ee5\u4e0b\u7684\u5206\u7c7b\u5b66\u6c34\u5e73\uff0c\u7528 -L genus -E genus \uff0c\u7c7b\u4f3c\u4e8e <= genus \u3002 $ cat taxids2.txt \\ | taxonkit filter -L genus -E genus \\ | taxonkit lineage -r -n -L \\ | csvtk -Ht cut -f 1,3,2 \\ | csvtk pretty -H -t 239934 genus Akkermansia 239935 species Akkermansia muciniphila 349741 strain Akkermansia muciniphila ATCC BAA-835 lca \u8ba1\u7b97\u6700\u4f4e\u516c\u5171\u7956\u5148(LCA) \u6bd4\u5982\u4eba\u5c5e\u7684\u4f8b\u5b50 $ taxonkit list --ids 9605 -nr --indent \" \" 9605 [genus] Homo 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' 1425170 [species] Homo heidelbergensis 2665952 [no rank] environmental samples 2665953 [species] Homo sapiens environmental sample TaxID\u7684\u5206\u9694\u7b26\u53ef\u7528 -s/--separater \u6307\u5b9a\uff0c\u9ed8\u8ba4\u4e3a\" \"\u3002 # \u8ba1\u7b97\u4e24\u4e2a\u7269\u79cd\u7684\u6700\u8fd1\u5171\u540c\u7956\u5148\uff0c\u4ee5\u4e0a\u9762\u5c3c\u5b89\u5fb7\u7279\u4eba\u4e9a\u79cd\u548c\u6d77\u5fb7\u5821\u4eba\u79cd $ echo 63221 2665953 | taxonkit lca 63221 2665953 9605 # \u5176\u5b83\u5206\u9694\u7b26\uff0c\u4e14\u4e0d\u5c0f\u5fc3\u591a\u4e86\u7a7a\u683c $ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" a 63221,2665953 b 63221, 741158 $ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" \\ | taxonkit lca -i 2 -s \",\" a 63221,2665953 9605 b 63221, 741158 9606 TaxID changelog \u8ffd\u8e2aTaxID\u53d8\u66f4\u8bb0\u5f55 NCBI Taxonomy\u6570\u636e\u6bcf\u5929\u90fd\u5728\u66f4\u65b0\uff0c\u6bcf\u6708\u521d\uff08\u5927\u591a\u4e3a1\u53f7\uff09\u7684\u6570\u636e\u4f5c\u4e3a\u5b58\u6863\u4fdd\u5b58\u5728 taxdump_archive/ \u76ee\u5f55\uff0c \u65e7\u7248\u672c\u6700\u65e9\u6570\u636e\u52302014\u5e748\u6708\uff0c\u65b0\u7248\u672c\u53ea\u52302018\u5e7412\u6708\u3002 TaxonKit\u53ef\u4ee5\u8ffd\u8e2a\u6240\u6709TaxID\u6bcf\u4e2a\u6708\u7684\u53d8\u5316\uff0c\u8f93\u51fa\u5230csv\u6587\u4ef6\u4e2d\uff0c\u53ef\u4ee5\u901a\u8fc7\u547d\u4ee4\u884c\u5de5\u5177\u8fdb\u884c\u67e5\u8be2\u3002 \u6570\u636e\u548c\u6587\u6863\u5355\u72ec\u6258\u7ba1\u5728 https://github.com/shenwei356/taxid-changelog \u3002 \u9664\u4e86\u7b80\u5355\u7684\u589e\u52a0\u3001\u5220\u9664\u3001\u5408\u5e76\u4e4b\u5916\uff0c\u4f5c\u8005\u5c06TaxID\u6539\u53d8\u505a\u4e86\u7ec6\u5206\u3002\u8f93\u51fa\u683c\u5f0f\u5982\u4e0b # \u5217 \u5907\u6ce8 taxid # taxid version # version / time of archive, e.g, 2019-07-01 change # change, values: # NEW \u65b0\u589e # REUSE_DEL \u524d\u671f\u88ab\u5220\u9664\uff0c\u73b0\u5728\u53c8\u91cd\u65b0\u52a0\u5165 # REUSE_MER \u524d\u671f\u88ab\u5408\u5e76\uff0c\u73b0\u5728\u53c8\u91cd\u65b0\u52a0\u5165 # DELETE \u5220\u9664 # MERGE \u5408\u5e76\u5230\u53e6\u4e00\u4e2aTaxID # ABSORB \u5176\u4ed6TaxID\u5408\u5e76\u5230\u5f53\u524dTaxID # CHANGE_NAME \u540d\u79f0\u6539\u53d8 # CHANGE_RANK \u5206\u7c7b\u5b66\u6c34\u5e73\u6539\u53d8 # CHANGE_LIN_LIN \u8c31\u7cfb\u7684TaxID\u6ca1\u6709\u53d8\u5316\uff0c\u8c31\u7cfb\u6539\u53d8\uff08\u67d0\u4e9bTaxID\u7684\u540d\u79f0\u53d8\u4e86\uff09 # CHANGE_LIN_TAX \u8c31\u7cfb\u7684TaxID\u6539\u53d8 # CHANGE_LIN_LEN \u8c31\u7cfb\u7684\u957f\u5ea6/\u6df1\u5ea6\u53d1\u751f\u53d8\u5316 change-value # variable values for changes: # 1) new taxid for MERGE # 2) merged taxids for ABSORB # 3) empty for others name # scientific name rank # rank lineage # complete lineage of the taxid lineage-taxids # taxids of the lineage \u6570\u636e\u6587\u4ef6\u53ef\u4ee5\u5728\u524d\u9762\u7f51\u7ad9\u4e0a\u4e0b\u8f7d\uff0c taxid-changelog.csv.gz \uff0c130M\u5de6\u53f3\uff0c\u89e3\u538b\u540e2.2G\uff0c\u56e0\u4e3a\u662fgzip\u683c\u5f0f\uff0c\u5b8c\u5168\u4e0d\u9700\u8981\u89e3\u538b\u5373\u53ef\u5206\u6790\u3002 \u4e0b\u6587\u4f7f\u7528\u4e86 pigz \u4ee3\u66ff zcat \u548c gzip \u63d0\u9ad8\u89e3\u538b\u901f\u5ea6\u3002 \u4f8b1 superkingdom\u4e5f\u80fd\u6d88\u5931 \uff0c\u6bd4\u5982\u7c7b\u75c5\u6bd2(Viroids)\u57282019\u5e745\u6708\u88ab\u5220\u9664\u4e86\u3002 \u4f5c\u8005\u662f\u5728\u67d0\u4e00\u5929\u65e0\u610f\u4e2d\u53d1\u73b0\u6b64\u4e8b\uff0c\u6240\u4ee5\u51b3\u5b9a\u5228\u6839\u95ee\u5e95\uff0c\u5f00\u53d1\u4e86\u8fd9\u4e2a\u5b50\u547d\u4ee4\u3002 # \u4e0b\u8f7d wget -c https://github.com/shenwei356/taxid-changelog/releases/download/v2021.01/taxid-changelog.csv.gz # \u5b89\u88c5\u591a\u7ebf\u7a0b\u89e3\u538b\u7d22\u8f6f\u4ef6\u3002\u6216\u8005\u7528gzip\u66ff\u6362\u3002 conda install pigz $ pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f rank -p superkingdom \\ | csvtk pretty taxid version change change-value name rank lineage lineage-taxids 2 2014-08-01 NEW Bacteria superkingdom cellular organisms;Bacteria 131567;2 2157 2014-08-01 NEW Archaea superkingdom cellular organisms;Archaea 131567;2157 2759 2014-08-01 NEW Eukaryota superkingdom cellular organisms;Eukaryota 131567;2759 10239 2014-08-01 NEW Viruses superkingdom Viruses 10239 12884 2014-08-01 NEW Viroids superkingdom Viroids 12884 12884 2019-05-01 DELETE Viroids superkingdom Viroids 12884 \u4f8b2 SARS-CoV-2 \u3002\u53ef\u89c1\u65b0\u51a0\u75c5\u6bd2\u57282020\u5e742\u6708\u52a0\u5165\uff0c\u968f\u540e3\u6708\u548c6\u6708\u4efd\u6539\u4e86\u540d\u79f0\uff0c\u8c31\u7cfb\u7b49\u4fe1\u606f\u3002\u67e5\u8be2\u901f\u5ea6\u4e5f\u5f88\u5feb\u3002 # \u672c\u4f8b\u5b50\u53ea\u663e\u793a\u4e86\u90e8\u5206\u5217\u3002 $ time pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -p 2697049 \\ | csvtk cut -f version,change,name,rank \\ | csvtk pretty version change name rank 2020-02-01 NEW Wuhan seafood market pneumonia virus species 2020-03-01 CHANGE_NAME Severe acute respiratory syndrome coronavirus 2 no rank 2020-03-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank 2020-03-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank 2020-06-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank 2020-07-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 isolate 2020-08-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank real 0m7.644s user 0m16.749s sys 0m3.985s \u66f4\u591a\u6709\u610f\u601d\u7684\u53d1\u73b0\u8be6\u89c1 taxid-changelog","title":"\u4e2d\u6587\u4ecb\u7ecd"},{"location":"chinese/#taxonkit-ncbi","text":"","title":"TaxonKit: \u5c0f\u5de7\u3001\u9ad8\u6548\u3001\u5b9e\u7528\u7684NCBI\u5206\u7c7b\u5b66\u6570\u636e\u547d\u4ee4\u884c\u5de5\u5177\u96c6"},{"location":"chinese/#ncbi-taxonomy","text":"\u4ece\u4e8b\u751f\u7269\u591a\u6837\u6027\u7684\u7814\u7a76\u8005\u5bf9NCBI Taxonomy\u6570\u636e\u5e93\u4e00\u5b9a\u4e0d\u4f1a\u964c\u751f\uff0c \u5b83\u5305\u542b\u4e86NCBI\u6240\u6709\u6838\u9178\u548c\u86cb\u767d\u5e8f\u5217\u6570\u636e\u5e93\u4e2d\u6bcf\u6761\u5e8f\u5217\u5bf9\u5e94\u7684\u7269\u79cd\u540d\u79f0\u4e0e\u5206\u7c7b\u5b66\u4fe1\u606f\u3002 \u5927\u591a\u6570\u751f\u6001\u5b66\u7814\u7a76\u5bf9\u7269\u79cd\u7ec4\u6210\u7684\u63cf\u8ff0\u90fd\u662f\u57fa\u4e8eNCBI Taxonomy\u6570\u636e\u5e93\uff0c \u5f53\u7136\u76ee\u524d\u4e5f\u5f00\u59cb\u4f7f\u7528\u5176\u4ed6\u6570\u636e\u5e93\uff0c\u5982GTDB\u7b49\u3002 NCBI Taxonomy\u6570\u636e\u5e93\u59cb\u4e8e1991\u5e74\uff0c\u4e00\u76f4\u968f\u7740Entrez\u6570\u636e\u5e93\u548c\u5176\u4ed6\u6570\u636e\u5e93\u66f4\u65b0\uff0c 1996\u5e74\u63a8\u51fa\u7f51\u9875\u7248\u3002NCBI Taxonomy\u6570\u636e\u5e93\u5b98\u65b9\u5730\u5740\u4e3a https://www.ncbi.nlm.nih.gov/taxonomy \uff0c \u516c\u5f00\u6570\u636e\u4e0b\u8f7d\u5730\u5740\u4e3a https://ftp.ncbi.nih.gov/pub/taxonomy/ \uff0c \u6570\u636e\u6bcf\u5c0f\u65f6\u66f4\u65b0\uff0c\u6bcf\u4e2a\u6708\u521d\u751f\u6210\u4e00\u4efd\u6570\u636e\u5f52\u6863\u5b58\u4e8e taxdump_archive \u76ee\u5f55\uff0c\u6700\u65e9\u53ef\u8ffd\u6eaf\u52302014\u5e748\u6708\u3002","title":"NCBI Taxonomy \u6570\u636e\u5e93"},{"location":"chinese/#taxonkit","text":"TaxonKit\u662f\u91c7\u7528Go\u8bed\u8a00\u7f16\u5199\u7684\u547d\u4ee4\u884c\u5de5\u5177\uff0c \u63d0\u4f9bLinux, Windows, macOS\u64cd\u4f5c\u7cfb\u7edf\u4e0d\u540c\u67b6\u6784\uff08x86-64/arm64\uff09\u7684\u9759\u6001\u7f16\u8bd1\u7684\u53ef\u6267\u884c\u4e8c\u8fdb\u5236\u6587\u4ef6\u3002 \u53d1\u5e03\u7684\u538b\u7f29\u5305\u4e0d\u8db33Mb\uff0c\u9664\u4e86Github\u6258\u7ba1\u5916\uff0c\u8fd8\u63d0\u4f9b\u56fd\u5185\u955c\u50cf\u4f9b\u4e0b\u8f7d\uff0c\u540c\u65f6\u8fd8\u652f\u6301conda\u548chomebrew\u5b89\u88c5\u3002 \u7528\u6237\u53ea\u9700\u8981 \u4e0b\u8f7d\u3001\u89e3\u538b\uff0c\u5f00\u7bb1\u5373\u7528\uff0c\u65e0\u9700\u914d\u7f6e \uff0c\u4ec5\u9700\u4e0b\u8f7d\u89e3\u538bNCBI Taxonomy\u6570\u636e\u6587\u4ef6\u89e3\u538b\u5230\u6307\u5b9a\u76ee\u5f55\u5373\u53ef\u3002 \u6e90\u4ee3\u7801 https://github.com/shenwei356/taxonkit \uff0c \u6587\u6863 http://bioinf.shenwei.me/taxonkit \uff08\u4ecb\u7ecd\u3001\u4f7f\u7528\u8bf4\u660e\u3001\u4f8b\u5b50\u3001\u6559\u7a0b\uff09 \u9009\u62e9\u7cfb\u7edf\u5bf9\u5e94\u7684\u7248\u672c\u4e0b\u8f7d\u6700\u65b0\u7248 https://github.com/shenwei356/taxonkit/releases \uff0c\u89e3\u538b\u540e\u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u5373\u53ef\u4f7f\u7528\u3002\u6216\u53ef\u9009conda\u5b89\u88c5 conda install taxonkit -c bioconda -y # \u8868\u683c\u6570\u636e\u5904\u7406\uff0c\u63a8\u8350\u4f7f\u7528 csvtk \u66f4\u9ad8\u6548 conda install csvtk -c bioconda -y \u6d4b\u8bd5\u6570\u636e\u4e0b\u8f7d\u53ef\u76f4\u63a5 https://github.com/shenwei356/taxonkit \u4e0b\u8f7d\u9879\u76ee\u538b\u7f29\u5305\uff0c\u6216\u4f7f\u7528git clone\u4e0b\u8f7d\u9879\u76ee\u6587\u4ef6\u5939\uff0c\u5176\u4e2d\u7684example\u4e3a\u6d4b\u8bd5\u6570\u636e git clone https://github.com/shenwei356/taxonkit TaxonKit\u4e3a\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u91c7\u7528\u5b50\u547d\u4ee4\u7684\u65b9\u5f0f\u6765\u6267\u884c\u4e0d\u540c\u529f\u80fd\uff0c \u5927\u591a\u6570\u5b50\u547d\u4ee4\u652f\u6301\u6807\u51c6\u8f93\u5165/\u8f93\u51fa\uff0c\u4fbf\u4e8e\u4f7f\u7528\u547d\u4ee4\u884c\u7ba1\u9053\u8fdb\u884c\u6d41\u6c34\u4f5c\u4e1a\uff0c \u8f7b\u677e\u6574\u5408\u8fdb\u5206\u6790\u6d41\u7a0b\u4e2d\u3002 \u5b50\u547d\u4ee4 \u529f\u80fd list \u5217\u51fa\u6307\u5b9aTaxId\u4e0b\u6240\u6709\u5b50\u5355\u5143\u7684\u7684TaxID lineage \u6839\u636eTaxID\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb\uff08lineage\uff09 reformat \u5c06\u5b8c\u6574\u8c31\u7cfb\u8f6c\u5316\u4e3a\u201c\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\u682a\"\u7684\u81ea\u5b9a\u4e49\u683c\u5f0f name2taxid \u5c06\u5206\u7c7b\u5355\u5143\u540d\u79f0\u8f6c\u5316\u4e3aTaxID filter \u6309\u5206\u7c7b\u5b66\u6c34\u5e73\u8303\u56f4\u8fc7\u6ee4TaxIDs lca \u8ba1\u7b97\u6700\u4f4e\u516c\u5171\u7956\u5148(LCA) taxid-changelog \u8ffd\u8e2aTaxID\u53d8\u66f4\u8bb0\u5f55 version \u663e\u793a\u7248\u672c\u4fe1\u606f\u3001\u68c0\u6d4b\u65b0\u7248\u672c genautocomplete \u751f\u6210shell\u81ea\u52a8\u8865\u5168\u914d\u7f6e\u811a\u672c \u5907\u6ce8\uff1a \u8f93\u51fa\uff1a \u6240\u6709\u547d\u4ee4\u8f93\u51fa\u4e2d\u5305\u542b\u8f93\u5165\u6570\u636e\u5185\u5bb9\uff0c\u5728\u6b64\u57fa\u7840\u4e0a\u589e\u52a0\u5217\u3002 \u6240\u6709\u547d\u4ee4\u9ed8\u8ba4\u8f93\u51fa\u5230\u6807\u51c6\u8f93\u51fa\uff08stdout\uff09\uff0c\u53ef\u901a\u8fc7\u91cd\u5b9a\u5411\uff08 > \uff09\u5199\u5165\u6587\u4ef6\u3002 \u6216\u901a\u8fc7\u5168\u5c40\u53c2\u6570 -o \u6216 --out-file \u6307\u5b9a\u8f93\u51fa\u6587\u4ef6\uff0c\u4e14\u53ef\u81ea\u52a8\u8bc6\u522b\u8f93\u51fa\u6587\u4ef6\u540e\u7f00\uff08 .gz \uff09\u8f93\u51fagzip\u683c\u5f0f\u3002 \u8f93\u5165\uff1a \u9664\u4e86 list \u4e0e taxid-changelog \u4e4b\u5916\uff0c lineage , reformat , name2taxid , filter \u4e0e lca \u5747\u53ef\u4ece\u6807\u51c6\u8f93\u5165\uff08stdin\uff09\u8bfb\u53d6\u8f93\u5165\u6570\u636e\uff0c\u4e5f\u53ef\u901a\u8fc7\u4f4d\u7f6e\u53c2\u6570\uff08positional arguments\uff09\u8f93\u5165\uff0c\u5373\u547d\u4ee4\u540e\u9762\u4e0d\u5e26 \u4efb\u4f55flag\u7684\u53c2\u6570\uff0c\u5982 taxonkit lineage taxids.txt \u8f93\u5165\u683c\u5f0f\u4e3a\u5355\u5217\uff0c\u6216\u8005\u5236\u8868\u7b26\u5206\u9694\u7684\u683c\u5f0f\uff0c\u8f93\u5165\u6570\u636e\u6240\u5728\u5217\u7528 -i \u6216 --taxid-field \u6307\u5b9a\u3002 TaxonKit\u76f4\u63a5\u89e3\u6790NCBI Taxonomy\u6570\u636e\u6587\u4ef6\uff082\u79d2\u5de6\u53f3\uff09\uff0c\u914d\u7f6e\u66f4\u5bb9\u6613\uff0c\u4e5f\u4fbf\u4e8e\u66f4\u65b0\u6570\u636e\uff0c\u5360\u7528\u5185\u5b58\u5728500Mb-1.5G\u5de6\u53f3\u3002 \u6570\u636e\u4e0b\u8f7d\uff1a # \u6709\u65f6\u4e0b\u8f7d\u5931\u8d25\uff0c\u53ef\u591a\u8bd5\u51e0\u6b21\uff1b\u6216\u5c1d\u8bd5\u6d4f\u89c8\u5668\u4e0b\u8f7d\u6b64\u94fe\u63a5 wget -c https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz # \u89e3\u538b\u6587\u4ef6\u5b58\u4e8e\u5bb6\u76ee\u5f55\u4e2d.taxonkit/\uff0c\u7a0b\u5e8f\u9ed8\u8ba4\u6570\u636e\u5e93\u9ed8\u8ba4\u76ee\u5f55 mkdir -p $HOME/.taxonkit cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit","title":"TaxonKit \u4f7f\u7528"},{"location":"chinese/#list-taxidtaxid","text":"taxonkit list \u7528\u4e8e\u5217\u51fa\u6307\u5b9aTaxID\u6240\u5728\u5206\u7c7b\u5b66\u5355\u5143\uff08taxon\uff09\u7684\u5b50\u6811\uff08subtree\uff09\u7684\u6240\u6709taxon\u7684TaxID\uff0c\u53ef\u9009\u663e\u793a\u540d\u79f0\u548c\u5206\u7c7b\u5b66\u6c34\u5e73\u3002 \u6b64\u529f\u80fd\u4e0eNCBI Taxonomy\u7f51\u9875\u7248\u7c7b\u4f3c\u3002 \u5982\uff0c # \u4ee5\u4eba\u5c5e(9605)\u548c\u80a0\u9053\u4e2d\u8457\u540d\u7684Akk\u83cc\u5c5e(239934)\u4e3a\u4f8b $ taxonkit list --show-rank --show-name --indent \" \" --ids 9605,239934 9605 [genus] Homo 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' 1425170 [species] Homo heidelbergensis 2665952 [no rank] environmental samples 2665953 [species] Homo sapiens environmental sample 239934 [genus] Akkermansia 239935 [species] Akkermansia muciniphila 349741 [strain] Akkermansia muciniphila ATCC BAA-835 512293 [no rank] environmental samples 512294 [species] uncultured Akkermansia sp. 1131822 [species] uncultured Akkermansia sp. SMG25 1262691 [species] Akkermansia sp. CAG:344 1263034 [species] Akkermansia muciniphila CAG:154 1679444 [species] Akkermansia glycaniphila 2608915 [no rank] unclassified Akkermansia 1131336 [species] Akkermansia sp. KLE1605 ... list\u4f7f\u7528\u6700\u5e7f\u6cdb\u7684\u7684\u529f\u80fd\u662f\u83b7\u53d6\u67d0\u4e2a\u7c7b\u522b\uff08\u6bd4\u5982\u7ec6\u83cc\u3001\u75c5\u6bd2\u3001\u67d0\u4e2a\u5c5e\u7b49\uff09\u4e0b\u6240\u6709\u7684TaxID\uff0c \u7528\u6765\u4eceNCBI nt/nr\u4e2d\u83b7\u53d6\u5bf9\u5e94\u7684\u6838\u9178/\u86cb\u767d\u5e8f\u5217\uff0c\u4ece\u800c\u642d\u5efa\u7279\u5f02\u6027\u7684BLAST\u6570\u636e\u5e93\u3002 \u5b98\u7f51\u63d0\u4f9b\u4e86\u76f8\u5e94\u7684\u8be6\u7ec6\u6b65\u9aa4\uff1a http://bioinf.shenwei.me/taxonkit/tutorial \u3002 # \u6240\u6709\u7ec6\u83cc\u7684TaxID $ taxonkit list --show-rank --show-name --ids 2 > /dev/null","title":"list \u5217\u51fa\u6307\u5b9aTaxId\u6240\u5728\u5b50\u6811\u7684\u6240\u6709TaxID"},{"location":"chinese/#lineage-taxid","text":"\u5206\u7c7b\u5b66\u6570\u636e\u76f8\u5173\u6700\u5e38\u89c1\u7684\u529f\u80fd\u5c31\u662f\u6839\u636eTaxID\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb\u3002 TaxonKit\u53ef\u6839\u636e\u8f93\u5165\u6587\u4ef6\u63d0\u4f9b\u7684TaxID\u5217\u8868\u5feb\u901f\u8ba1\u7b97lineage\uff0c\u5e76\u53ef\u9009\u63d0\u4f9b\u540d\u79f0\uff0c\u5206\u7c7b\u5b66\u6c34\u5e73\uff0c\u4ee5\u53ca\u8c31\u7cfb\u5bf9\u5e94\u7684TaxID\u3002 \u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0c\u968f\u7740Taxonomy\u6570\u636e\u7684\u9891\u7e41\u66f4\u65b0\uff0c\u6709\u7684TaxID\u53ef\u80fd\u88ab\u5220\u9664\u3001\u6216\u5408\u5e76\uff08merge\uff09\u5230\u5176\u5b83TaxID\u4e2d\uff0c TaxonKit\u4f1a\u81ea\u52a8\u8bc6\u522b\uff0c\u5e76\u8fdb\u884c\u63d0\u793a\uff0c\u5bf9\u4e8e\u88ab\u5408\u5e76\u7684TaxID\uff0cTaxonKit\u4f1a\u6309\u65b0TaxID\u8fdb\u884c\u8ba1\u7b97\u3002 # \u4f7f\u7528example\u4e2d\u7684\u6d4b\u8bd5\u6570\u636e $ head taxids.txt 9606 9913 376619 # \u67e5\u627e\u6307\u5b9ataxids\u5217\u8868\u7684\u7269\u79cd\u4fe1\u606f\uff0ctee\u53ef\u8f93\u51fa\u5c4f\u5e55\u5e76\u5199\u5165\u6587\u4ef6 $ taxonkit lineage taxids.txt | tee lineage.txt 19:22:13.077 [WARN] taxid 92489 was merged into 796334 19:22:13.077 [WARN] taxid 1458427 was merged into 1458425 19:22:13.077 [WARN] taxid 123124124 not found 19:22:13.077 [WARN] taxid 3 was deleted 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus 376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y 123124124 3 92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raicheisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei \u4e0e\u5176\u5b83\u8f6f\u4ef6\u7684\u6027\u80fd\u76f8\u6bd4\uff0c\u5f53\u67e5\u8be2\u6570\u91cf\u8f83\u5c11\u65f6ETE\u8f83\u5feb\uff0c\u6570\u91cf\u8f83\u591a\u65f6\u5219TaxonKit\u66f4\u5feb\u3002 \u5728\u4e0d\u540c\u6570\u636e\u91cf\u89c4\u6a21\u4e0a TaxonKit\u901f\u5ea6\u4e00\u76f4\u5f88\u7a33\u5b9a\uff0c\u5747\u4e3a2-3\u79d2\uff0c\u65f6\u95f4\u4e3b\u8981\u82b1\u5728\u89e3\u6790Taxonomy\u6570\u636e\u6587\u4ef6\u4e0a\u3002 \u5217\u51falineage\u6bcf\u4e2a\u5206\u7c7b\u5b66\u5355\u5143\u7684\u7684TaxId\u548crank\u548c\u540d\u79f0\uff0c\u6bd4\u5982SARS-COV-2\u3002 # lineage\u63d0\u53d6SARS-COV-2\u7684\u4e16\u7cfb $ echo \"2697049\" \\ | taxonkit lineage -t -R \\ | sed \"s/\\t/\\n/g\" 2697049 Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 superkingdom;clade;kingdom;phylum;class;order;suborder;family;subfamily;genus;subgenus;species;no rank","title":"lineage \u6839\u636eTaxID\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb"},{"location":"chinese/#reformat","text":"\u6709\u65f6\u5019\uff0c\u6211\u4eec\u5e76\u4e0d\u9700\u8981\u5b8c\u6574\u7684\u5206\u7c7b\u5b66\u8c31\u7cfb\uff08complete lineage\uff09\uff0c\u56e0\u4e3a\u5f88\u591a\u7ea7\u522b\u5373\u4e0d\u5e38\u7528\uff0c\u800c\u4e14\u4e0d\u5b8c\u6574\u3002\u901a\u5e38\u53ea\u60f3\u4fdd\u7559\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\u3002 \u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0c \u4e0d\u662f\u6240\u6709\u7269\u79cd\u90fd\u6709\u5b8c\u6574\u7684\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\u6c34\u5e73\uff0c\u7279\u522b\u662f\u75c5\u6bd2\u4ee5\u53ca\u4e00\u4e9b\u73af\u5883\u6837\u54c1 \u3002 TaxonKit\u53ef\u4ee5\u7528\u81ea\u5b9a\u4e49\u5185\u5bb9\u66ff\u4ee3\u7f3a\u5931\u7684\u5206\u7c7b\u5355\u5143\uff0c\u5982\u7528\u201c__\u201d\u66ff\u4ee3\u3002 \u66f4 \u5389\u5bb3 \u6709\u7528\u7684\u662f\uff0c TaxonKit\u8fd8\u53ef\u4ee5\u7528\u66f4\u9ad8\u5c42\u7ea7\u7684\u5206\u7c7b\u5355\u5143\u4fe1\u606f\u6765\u8865\u9f50\u7f3a\u5931\u7684\u5c42\u7ea7 ( -F/--fill-miss-rank )\uff0c\u6bd4\u5982 # \u6ca1\u6709genus\u7684\u75c5\u6bd2 $ echo 1327037 | taxonkit lineage | taxonkit reformat | cut -f 1,3 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y # -F\u53c2\u6570\u4f1a\u7528family\u4fe1\u606f\u6765\u8865\u9f50genus\u4fe1\u606f $ echo 1327037 | taxonkit lineage | taxonkit reformat -F | cut -f 1,3 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae genus;Croceibacter phage P2559Y \u8f93\u51fa\u683c\u5f0f\u53ef\u9009\u53ea\u8f93\u51fa\u90e8\u5206\u5206\u7c7b\u5b66\u6c34\u5e73\uff0c\u8fd8\u652f\u6301\u5236\u8868\u7b26\uff08 \"\\t\" \uff09\uff0c\u518d\u914d\u5408\u4f5c\u8005\u7684\u53e6\u4e00\u4e2a\u5de5\u5177csvtk\uff0c\u53ef\u4ee5\u8f93\u51fa\u6f02\u4eae\u7684\u7ed3\u679c\u3002 \u5176\u5b83\u6709\u7528\u7684\u9009\u9879\uff1a -P/--add-prefix \uff1a\u7ed9\u6bcf\u4e2a\u5206\u7c7b\u5b66\u6c34\u5e73\u6dfb\u52a0\u524d\u7f00\uff0c\u6bd4\u5982 s__species \u3002 -t/--show-lineage-taxids \uff1a\u8f93\u51fa\u5206\u7c7b\u5b66\u5355\u5143\u5bf9\u5e94\u7684TaxID\u3002 -r/--miss-rank-repl : \u66ff\u4ee3\u6ca1\u6709\u5bf9\u5e94rank\u7684taxon\u540d\u79f0 -S/--pseudo-strain : \u5bf9\u4e8e\u4f4e\u4e8especies\u4e14rank\u65e2\u4e0d\u662fsubspecies\u4e5f\u4e0d\u662fstrain\u7684taxid\uff0c\u4f7f\u7528\u6c34\u5e73\u6700\u4f4etaxon\u540d\u79f0\u505a\u4e3a\u83cc\u682a\u540d\u79f0\u3002 \u4f8b\uff0c $ echo -ne \"349741\\n1327037\"\\ | taxonkit lineage \\ | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" -F -P \\ | csvtk cut -t -f -2 \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk pretty -t taxid kindom phylum class order family genus species 349741 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila 1327037 k__Viruses p__Uroviricota c__Caudoviricetes o__Caudovirales f__Siphoviridae g__unclassified Siphoviridae genus s__Croceibacter phage P2559Y # \u4fbf\u4e8e\u5c0f\u5c4f\u5e55\u67e5\u770b\uff0c\u7528csvtk\u8fdb\u884c\u8f6c\u7f6e $ echo -ne \"349741\\n1327037\"\\ | taxonkit lineage \\ | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" -F -P \\ | csvtk cut -t -f -2 \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk transpose -t \\ | csvtk pretty -H -t taxid 349741 1327037 kindom k__Bacteria k__Viruses phylum p__Verrucomicrobia p__Uroviricota class c__Verrucomicrobiae c__Caudoviricetes order o__Verrucomicrobiales o__Caudovirales family f__Akkermansiaceae f__Siphoviridae genus g__Akkermansia g__unclassified Siphoviridae genus species s__Akkermansia muciniphila s__Croceibacter phage P2559Y # \u5230\u682a\u6c34\u5e73\uff0c\u4ee5sars-cov-2\u4e3a\u4f8b $ echo -ne \"2697049\"\\ | taxonkit lineage \\ | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" -F -P -S \\ | csvtk cut -t -f -2 \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species,strain \\ | csvtk transpose -t \\ | csvtk pretty -H -t taxid 2697049 kindom k__Viruses phylum p__Pisuviricota class c__Pisoniviricetes order o__Nidovirales family f__Coronaviridae genus g__Betacoronavirus species s__Severe acute respiratory syndrome-related coronavirus strain t__Severe acute respiratory syndrome coronavirus 2","title":"reformat \u751f\u6210\u6807\u51c6\u5c42\u7ea7\u7269\u79cd\u6ce8\u91ca"},{"location":"chinese/#name2taxid-taxid","text":"\u5c06\u5206\u7c7b\u5355\u5143\u540d\u79f0\u8f6c\u5316\u4e3aTaxID\u975e\u5e38\u5bb9\u6613\u7406\u89e3\uff0c\u552f\u4e00\u8981\u6ce8\u610f\u7684\u662f \u67d0\u4e9bTaxId\u5bf9\u5e94\u76f8\u540c\u7684\u540d\u79f0 \uff0c\u6bd4\u5982 # -i\u6307\u5b9a\u5217\uff0c-r\u663e\u793a\u7ea7\u522b\uff0c-L\u4e0d\u663e\u793a\u4e16\u7cfb $ echo Drosophila | taxonkit name2taxid | taxonkit lineage -i 2 -r -L Drosophila 7215 genus Drosophila 32281 subgenus Drosophila 2081351 genus \u83b7\u53d6TaxID\u4e4b\u540e\uff0c\u53ef\u4ee5\u7acb\u5373\u4f20\u7ed9taxonkit\u8fdb\u884c\u540e\u7eed\u64cd\u4f5c\uff0c\u4f46\u8981\u6ce8\u610f\u7528 -i \u6307\u5b9aTaxId\u6240\u5728\u5217\u3002","title":"name2taxid \u5c06\u5206\u7c7b\u5355\u5143\u540d\u79f0\u8f6c\u5316\u4e3aTaxID"},{"location":"chinese/#filter-taxids","text":"filter\u53ef\u4ee5\u6309 \u5206\u7c7b\u5b66\u6c34\u5e73\u8303\u56f4 \u8fc7\u6ee4TaxIDs\uff0c\u6ce8\u610f\uff0c\u4e0d\u4ec5\u4ec5\u662f\u7279\u5b9a\u7684Rank\uff0c\u800c\u662f\u4e00\u4e2a \u8303\u56f4 \u3002 \u6bd4\u5982genus\u53ca\u4ee5\u4e0b\u7684\u5206\u7c7b\u5b66\u6c34\u5e73\uff0c\u7528 -L genus -E genus \uff0c\u7c7b\u4f3c\u4e8e <= genus \u3002 $ cat taxids2.txt \\ | taxonkit filter -L genus -E genus \\ | taxonkit lineage -r -n -L \\ | csvtk -Ht cut -f 1,3,2 \\ | csvtk pretty -H -t 239934 genus Akkermansia 239935 species Akkermansia muciniphila 349741 strain Akkermansia muciniphila ATCC BAA-835","title":"filter \u6309\u5206\u7c7b\u5b66\u6c34\u5e73\u8303\u56f4\u8fc7\u6ee4TaxIDs"},{"location":"chinese/#lca-lca","text":"\u6bd4\u5982\u4eba\u5c5e\u7684\u4f8b\u5b50 $ taxonkit list --ids 9605 -nr --indent \" \" 9605 [genus] Homo 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' 1425170 [species] Homo heidelbergensis 2665952 [no rank] environmental samples 2665953 [species] Homo sapiens environmental sample TaxID\u7684\u5206\u9694\u7b26\u53ef\u7528 -s/--separater \u6307\u5b9a\uff0c\u9ed8\u8ba4\u4e3a\" \"\u3002 # \u8ba1\u7b97\u4e24\u4e2a\u7269\u79cd\u7684\u6700\u8fd1\u5171\u540c\u7956\u5148\uff0c\u4ee5\u4e0a\u9762\u5c3c\u5b89\u5fb7\u7279\u4eba\u4e9a\u79cd\u548c\u6d77\u5fb7\u5821\u4eba\u79cd $ echo 63221 2665953 | taxonkit lca 63221 2665953 9605 # \u5176\u5b83\u5206\u9694\u7b26\uff0c\u4e14\u4e0d\u5c0f\u5fc3\u591a\u4e86\u7a7a\u683c $ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" a 63221,2665953 b 63221, 741158 $ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" \\ | taxonkit lca -i 2 -s \",\" a 63221,2665953 9605 b 63221, 741158 9606","title":"lca \u8ba1\u7b97\u6700\u4f4e\u516c\u5171\u7956\u5148(LCA)"},{"location":"chinese/#taxid-changelog-taxid","text":"NCBI Taxonomy\u6570\u636e\u6bcf\u5929\u90fd\u5728\u66f4\u65b0\uff0c\u6bcf\u6708\u521d\uff08\u5927\u591a\u4e3a1\u53f7\uff09\u7684\u6570\u636e\u4f5c\u4e3a\u5b58\u6863\u4fdd\u5b58\u5728 taxdump_archive/ \u76ee\u5f55\uff0c \u65e7\u7248\u672c\u6700\u65e9\u6570\u636e\u52302014\u5e748\u6708\uff0c\u65b0\u7248\u672c\u53ea\u52302018\u5e7412\u6708\u3002 TaxonKit\u53ef\u4ee5\u8ffd\u8e2a\u6240\u6709TaxID\u6bcf\u4e2a\u6708\u7684\u53d8\u5316\uff0c\u8f93\u51fa\u5230csv\u6587\u4ef6\u4e2d\uff0c\u53ef\u4ee5\u901a\u8fc7\u547d\u4ee4\u884c\u5de5\u5177\u8fdb\u884c\u67e5\u8be2\u3002 \u6570\u636e\u548c\u6587\u6863\u5355\u72ec\u6258\u7ba1\u5728 https://github.com/shenwei356/taxid-changelog \u3002 \u9664\u4e86\u7b80\u5355\u7684\u589e\u52a0\u3001\u5220\u9664\u3001\u5408\u5e76\u4e4b\u5916\uff0c\u4f5c\u8005\u5c06TaxID\u6539\u53d8\u505a\u4e86\u7ec6\u5206\u3002\u8f93\u51fa\u683c\u5f0f\u5982\u4e0b # \u5217 \u5907\u6ce8 taxid # taxid version # version / time of archive, e.g, 2019-07-01 change # change, values: # NEW \u65b0\u589e # REUSE_DEL \u524d\u671f\u88ab\u5220\u9664\uff0c\u73b0\u5728\u53c8\u91cd\u65b0\u52a0\u5165 # REUSE_MER \u524d\u671f\u88ab\u5408\u5e76\uff0c\u73b0\u5728\u53c8\u91cd\u65b0\u52a0\u5165 # DELETE \u5220\u9664 # MERGE \u5408\u5e76\u5230\u53e6\u4e00\u4e2aTaxID # ABSORB \u5176\u4ed6TaxID\u5408\u5e76\u5230\u5f53\u524dTaxID # CHANGE_NAME \u540d\u79f0\u6539\u53d8 # CHANGE_RANK \u5206\u7c7b\u5b66\u6c34\u5e73\u6539\u53d8 # CHANGE_LIN_LIN \u8c31\u7cfb\u7684TaxID\u6ca1\u6709\u53d8\u5316\uff0c\u8c31\u7cfb\u6539\u53d8\uff08\u67d0\u4e9bTaxID\u7684\u540d\u79f0\u53d8\u4e86\uff09 # CHANGE_LIN_TAX \u8c31\u7cfb\u7684TaxID\u6539\u53d8 # CHANGE_LIN_LEN \u8c31\u7cfb\u7684\u957f\u5ea6/\u6df1\u5ea6\u53d1\u751f\u53d8\u5316 change-value # variable values for changes: # 1) new taxid for MERGE # 2) merged taxids for ABSORB # 3) empty for others name # scientific name rank # rank lineage # complete lineage of the taxid lineage-taxids # taxids of the lineage \u6570\u636e\u6587\u4ef6\u53ef\u4ee5\u5728\u524d\u9762\u7f51\u7ad9\u4e0a\u4e0b\u8f7d\uff0c taxid-changelog.csv.gz \uff0c130M\u5de6\u53f3\uff0c\u89e3\u538b\u540e2.2G\uff0c\u56e0\u4e3a\u662fgzip\u683c\u5f0f\uff0c\u5b8c\u5168\u4e0d\u9700\u8981\u89e3\u538b\u5373\u53ef\u5206\u6790\u3002 \u4e0b\u6587\u4f7f\u7528\u4e86 pigz \u4ee3\u66ff zcat \u548c gzip \u63d0\u9ad8\u89e3\u538b\u901f\u5ea6\u3002 \u4f8b1 superkingdom\u4e5f\u80fd\u6d88\u5931 \uff0c\u6bd4\u5982\u7c7b\u75c5\u6bd2(Viroids)\u57282019\u5e745\u6708\u88ab\u5220\u9664\u4e86\u3002 \u4f5c\u8005\u662f\u5728\u67d0\u4e00\u5929\u65e0\u610f\u4e2d\u53d1\u73b0\u6b64\u4e8b\uff0c\u6240\u4ee5\u51b3\u5b9a\u5228\u6839\u95ee\u5e95\uff0c\u5f00\u53d1\u4e86\u8fd9\u4e2a\u5b50\u547d\u4ee4\u3002 # \u4e0b\u8f7d wget -c https://github.com/shenwei356/taxid-changelog/releases/download/v2021.01/taxid-changelog.csv.gz # \u5b89\u88c5\u591a\u7ebf\u7a0b\u89e3\u538b\u7d22\u8f6f\u4ef6\u3002\u6216\u8005\u7528gzip\u66ff\u6362\u3002 conda install pigz $ pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f rank -p superkingdom \\ | csvtk pretty taxid version change change-value name rank lineage lineage-taxids 2 2014-08-01 NEW Bacteria superkingdom cellular organisms;Bacteria 131567;2 2157 2014-08-01 NEW Archaea superkingdom cellular organisms;Archaea 131567;2157 2759 2014-08-01 NEW Eukaryota superkingdom cellular organisms;Eukaryota 131567;2759 10239 2014-08-01 NEW Viruses superkingdom Viruses 10239 12884 2014-08-01 NEW Viroids superkingdom Viroids 12884 12884 2019-05-01 DELETE Viroids superkingdom Viroids 12884 \u4f8b2 SARS-CoV-2 \u3002\u53ef\u89c1\u65b0\u51a0\u75c5\u6bd2\u57282020\u5e742\u6708\u52a0\u5165\uff0c\u968f\u540e3\u6708\u548c6\u6708\u4efd\u6539\u4e86\u540d\u79f0\uff0c\u8c31\u7cfb\u7b49\u4fe1\u606f\u3002\u67e5\u8be2\u901f\u5ea6\u4e5f\u5f88\u5feb\u3002 # \u672c\u4f8b\u5b50\u53ea\u663e\u793a\u4e86\u90e8\u5206\u5217\u3002 $ time pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -p 2697049 \\ | csvtk cut -f version,change,name,rank \\ | csvtk pretty version change name rank 2020-02-01 NEW Wuhan seafood market pneumonia virus species 2020-03-01 CHANGE_NAME Severe acute respiratory syndrome coronavirus 2 no rank 2020-03-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank 2020-03-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank 2020-06-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank 2020-07-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 isolate 2020-08-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank real 0m7.644s user 0m16.749s sys 0m3.985s \u66f4\u591a\u6709\u610f\u601d\u7684\u53d1\u73b0\u8be6\u89c1 taxid-changelog","title":"TaxID changelog \u8ffd\u8e2aTaxID\u53d8\u66f4\u8bb0\u5f55"},{"location":"download/","text":"Download TaxonKit is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page. Current Version TaxonKit v0.15.0 taxonkit reformat : For lineages with more than one node, if it fails to query TaxId with the parent-child pair, use the last child only. #82 - The flag -T/--trim also does not add the prefix for missing ranks lower than the current rank. #82 New flag -s/--miss-rank-repl-suffix to set the suffix for estimated taxon names. #85 Please cite Shen, W., Ren, H., TaxonKit: a practical and efficient NCBI Taxonomy toolkit, Journal of Genetics and Genomics, https://doi.org/10.1016/j.jgg.2021.03.006 Links Tips run taxonkit version to check update !!! run taxonkit genautocomplete to update Bash completion !!! OS Arch File, \u4e2d\u56fd\u955c\u50cf Download Count Linux 64-bit taxonkit_linux_amd64.tar.gz , \u4e2d\u56fd\u955c\u50cf Linux arm64 taxonkit_linux_arm64.tar.gz , \u4e2d\u56fd\u955c\u50cf macOS 64-bit taxonkit_darwin_amd64.tar.gz , \u4e2d\u56fd\u955c\u50cf macOS arm64 taxonkit_darwin_arm64.tar.gz , \u4e2d\u56fd\u955c\u50cf Windows 64-bit taxonkit_windows_amd64.exe.tar.gz , \u4e2d\u56fd\u955c\u50cf Installation Download Page TaxonKit is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page. Method 1: Download binaries (latest stable/dev version) Just download compressed executable file of your operating system, and uncompress it with tar -zxvf *.tar.gz command or other tools. And then: For Linux-like systems If you have root privilege simply copy it to /usr/local/bin : sudo cp taxonkit /usr/local/bin/ Or copy to anywhere in the environment variable PATH : mkdir -p $HOME/bin/; cp taxonkit $HOME/bin/ For windows , just copy taxonkit.exe to C:\\WINDOWS\\system32 . Method 2: Install via conda (latest stable version) conda install -c bioconda taxonkit Method 3: Install via homebrew (may not the lastest version) brew install brewsci/bio/taxonkit Method 4: Compile from source (latest stable/dev version) Install go wget https://go.dev/dl/go1.17.13.linux-amd64.tar.gz tar -zxf go1.17.13.linux-amd64.tar.gz -C $HOME/ # or # echo \"export PATH=$PATH:$HOME/go/bin\" >> ~/.bashrc # source ~/.bashrc export PATH=$PATH:$HOME/go/bin Compile TaxonKit # ------------- the latest stable version ------------- go get -v -u github.com/shenwei356/taxonkit/taxonkit # The executable binary file is located in: # ~/go/bin/taxonkit # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ~/go/bin/taxonkit $HOME/bin/ # --------------- the development version -------------- git clone https://github.com/shenwei356/taxonkit cd taxonkit/taxonkit/ go build # The executable binary file is located in: # ./taxonkit # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ./taxonkit $HOME/bin/ Bash-completion Supported shell: bash|zsh|fish|powershell Bash: # generate completion shell taxonkit genautocomplete --shell bash # configure if never did. # install bash-completion if the \"complete\" command is not found. echo \"for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\" >> ~/.bash_completion echo \"source ~/.bash_completion\" >> ~/.bashrc Zsh: # generate completion shell taxonkit genautocomplete --shell zsh --file ~/.zfunc/_taxonkit # configure if never did echo 'fpath=( ~/.zfunc \"${fpath[@]}\" )' >> ~/.zshrc echo \"autoload -U compinit; compinit\" >> ~/.zshrc fish: taxonkit genautocomplete --shell fish --file ~/.config/fish/completions/taxonkit.fish Dataset Download and uncompress taxdump.tar.gz : ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz Copy names.dmp , nodes.dmp , delnodes.dmp and merged.dmp to data directory: $HOME/.taxonkit , e.g., /home/shenwei/.taxonkit , Optionally copy to some other directories, and later you can refer to using flag --data-dir , or environment variable TAXONKIT_DB . All-in-one command: wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz mkdir -p $HOME/.taxonkit cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit Update dataset : Simply re-download the taxdump files, uncompress and override old ones. Release history TaxonKit v0.14.2 taxonkit filter : fix checking merged/deleted/not-found taxids. #80 taxonkit lca : add a new flag -b/--buffer-size to set the size of the line buffer. #75 fix typos: --separater -> --separater , the former is still available for backward compatibility. taxonkit reformat : output compatible format for TaxIds not found in the database. #79 taxonkit taxid-changelog : support gzip-compressed taxdump files for saving space. #78 TaxonKit v0.14.1 taxonkit reformat : The flag -S/--pseudo-strain does not require -F/--fill-miss-rank now. For taxa of rank >= species, {t} , {S} , and T outputs nothing when using -S/--pseudo-strain . TaxonKit v0.14.0 taxonkit create-taxdump : save taxIds in int32 instead of uint32 , as BLAST and DIAMOND do. #70 taxonkit list : do not skip visited subtrees when some of give taxids are descendants of others. #68 taxonkit : when environment variable TAXONKIT_DB is set, explicitly setting --data-dir will override the value of TAXONKIT_DB . TaxonKit v0.13.0 taxonkit reformat : add a new placeholder {K} for rank kingdom . #64 do not panic for invalid TaxIds, e.g., the column name, when using -I--taxid-field . taxonkit create-taxdump : fix merged.dmp and delnodes.dmp. Thanks to @apcamargo ! gtdb-taxdump/issues/2 . fix bug of handling non-GTDB data when using -A/--field-accession and no rank names given: the colname of the accession column would be treated as one of the ranks, which messed up all the ranks. fix the default option value of --field-accession-re which wrongly remove prefix like Sp_ . #65 taxonkit list : fix warning message of merged taxids. TaxonKit v0.12.0 taxonkit create-taxdump : accepts arbitrary ranks #60 better handle of taxa with same names. many flags changed. TaxonKit v0.11.1 taxonkit create-taxdump : fix bug of missing Class rank, contributed by @apcamargo. The flag --gtdb was not effected. #57 TaxonKit v0.11.0 new command taxonkit create-taxdump : Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV. #56 TaxonKit v0.10.1 taxonkit cami2-filter : fix option --show-rank which did not work in v0.10.0. TaxonKit v0.10.0 new command taxonkit cami2-filter : Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile taxonkit reformat : fix panic for deleted taxid using -F/--fill-miss-rank . #55 TaxonKit v0.9.0 new command taxonkit profile2cami : converting metagenomic profile table to CAMI format TaxonKit v0.8.0 taxonkit reformat : accept input of TaxIds via flag -I/--taxid-field . accept single taxonomy names . show warning message for TaxIds with the same lineage . #42 better flag checking. #40 taxonkit lca : slightly speedup. taxonkit genautocomplete : support bash|zsh|fish/powershell TaxonKit v0.7.2 taxonkit lineage : new flag -R/--show-lineage-ranks for appending ranks of all levels. reduce memory occupation and slightly speedup. taxonkit filter : flag -E/--equal-to supports multiple values. new flag -n/--save-predictable-norank : do not discard some special ranks without order when using -L, where rank of the closest higher node is still lower than rank cutoff. taxonkit reformat : new placeholder {t} for subspecies/strain , {T} for strain . Thanks @wqssf102 for feedback. new flag -S/--pseudo-strain for using the node with lowest rank as strain name, only if which rank is lower than \"species\" and not \"subpecies\" nor \"strain\". TaxonKit v0.7.1 taxonkit filter : disable unnecessary stdin check when using flag --list-order or --list-ranks . #36 better handling of black list, empty default value: \"no rank\" and \"clade\". And you need use -N/--discard-noranks to explicitly filter out \"no rank\", \"clade\". #37 update help message. Thanks @standage for improve this command! #38 TaxonKit v0.7.0 taxonkit : 2-3X faster taxonomy data loading . new command taxonkit filter : filtering TaxIds by taxonomic rank range . #32 new command taxonkit lca : Computing lowest common ancestor (LCA) for TaxIds. taxonkit reformat : new flag -P/--add-prefix : add prefixes for all ranks , single prefix for a rank is defined by flag --prefix-X , where X may be k , p , c , o , f , s , S . new flag -T/--trim : do not fill missing rank lower than current rank. taxonkit list : do not duplicate root node. TaxonKit v0.6.2 taxonkit reformat -F : fix taxids of abbreviated lineage containing names shared by different taxids. #35 TaxonKit v0.6.1 taxonkit lineage : new flag -n/--show-name for appending scientific name. new flag -L/--no-lineage for hide lineage, this is for fast retrieving names or/and ranks. taxonkit reformat : fix flag -F/--fill-miss-rank . discard order restriction of rank symbols. TaxonKit v0.6.0 taxonkit list : check merged and deleted taxids. fix bug of json output. #30 taxonkit name2taxid : new flag -s/--sci-name for limiting to searching scientific names. #29 taxonkit version : make checking update optional TaxonKit v0.5.0 taxonkit : requiring delnodes.dmp and merged.dmp. taxonkit lineage : detect deleted and merged taxids now. #19 taxonkit list/name2taxid : add short flag -r for --show-rank , -n for --show-name . TaxonKit v0.4.3 taxonkit taxid-changelog : rewrite logic, fix bug and add more change types TaxonKit v0.4.2 taxonkit taxid-changelog : change output of ABSORB , do not merged into one record for changes in different versions. TaxonKit v0.4.1 taxonkit taxid-changelog : add fields: name and rank . and fix sorting bug. detailed lineage change status TaxonKit v0.4.0 new command: taxonkit taxid-changelog : for creating taxid changelog from dump archive TaxonKit v0.3.0 this version is almost the same as v0.2.5 TaxonKit v0.2.5 add global flag: --line-buffered to disable output buffer. #11 replace global flags --names-file and --nodes-file with --data-dir , also support environment variable TAXONKIT_DB . #17 taxonkit reformat : detects lineages containing unofficial taxon name and won't show panic message. taxonkit name2taxid : supports synonyms names. #9 taxokit lineage : add flag -r/--show-rank to print rank at another new column. TaxonKit v0.2.4 taxonkit reformat : more accurate result when using flag -F/--fill-miss-rank to estimate and fill missing rank with original lineage information supporting escape strings like \\t , \\n , #5 outputting corresponding taxids for reformated lineage. #8 taxonkit lineage : fix bug for taxid 1 #7 add flag -d/--delimiter . TaxonKit v0.2.3 fix bug brought in v0.2.1 TaxonKit v0.2.2 make verbose information optional #4 TaxonKit v0.2.1 taxonkit list : fix bug of no output for leaf nodes of the taxonomic tree. #4 add new command genautocomplete to generate shell autocompletion script! TaxonKit v0.2.0 add command name2taxid to query taxid by taxon scientific name. lineage , reformat : changed flags and default operations , check the usage . TaxonKit v0.1.8 taxonkit lineage , add an extra column of lineage in Taxid. #3 . e.g., fix colorful output in windows. TaxonKit v0.1.7 taxonkit reformat : supports reading stdin from output of taxonkit lineage , reformated lineages are appended to input data. TaxonKit v0.1.6 remove flag -f/--formated-rank from taxonkit lineage , using taxonkit reformat can archieve same result. TaxonKit v0.1.5 reorganize code and flags TaxonKit v0.1.4 add flag --fill for taxonkit reformat , which estimates and fills missing rank with original lineage information TaxonKit v0.1.3 add command of taxonkit reformat which reformats full lineage to custom format TaxonKit v0.1.2 add command of taxonkit lineage , users can query lineage of given taxon IDs from file TaxonKit v0.1.1 add feature of taxonkit list , users can choose output in readable JSON format by flag --json so the taxonomy tree could be collapse and uncollapse in modern text editor. TaxonKit v0.1 first release /** * RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS. * LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/ /* var disqus_config = function () { this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable }; */ (function() { // DON'T EDIT BELOW THIS LINE var d = document, s = d.createElement('script'); s.src = '//taxonkit.disqus.com/embed.js'; s.setAttribute('data-timestamp', +new Date()); (d.head || d.body).appendChild(s); })(); Please enable JavaScript to view the comments powered by Disqus.","title":"Download"},{"location":"download/#download","text":"TaxonKit is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.","title":"Download"},{"location":"download/#current-version","text":"TaxonKit v0.15.0 taxonkit reformat : For lineages with more than one node, if it fails to query TaxId with the parent-child pair, use the last child only. #82 - The flag -T/--trim also does not add the prefix for missing ranks lower than the current rank. #82 New flag -s/--miss-rank-repl-suffix to set the suffix for estimated taxon names. #85","title":"Current Version"},{"location":"download/#please-cite","text":"Shen, W., Ren, H., TaxonKit: a practical and efficient NCBI Taxonomy toolkit, Journal of Genetics and Genomics, https://doi.org/10.1016/j.jgg.2021.03.006","title":"Please cite"},{"location":"download/#links","text":"Tips run taxonkit version to check update !!! run taxonkit genautocomplete to update Bash completion !!! OS Arch File, \u4e2d\u56fd\u955c\u50cf Download Count Linux 64-bit taxonkit_linux_amd64.tar.gz , \u4e2d\u56fd\u955c\u50cf Linux arm64 taxonkit_linux_arm64.tar.gz , \u4e2d\u56fd\u955c\u50cf macOS 64-bit taxonkit_darwin_amd64.tar.gz , \u4e2d\u56fd\u955c\u50cf macOS arm64 taxonkit_darwin_arm64.tar.gz , \u4e2d\u56fd\u955c\u50cf Windows 64-bit taxonkit_windows_amd64.exe.tar.gz , \u4e2d\u56fd\u955c\u50cf","title":"Links"},{"location":"download/#installation","text":"Download Page TaxonKit is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.","title":"Installation"},{"location":"download/#method-1-download-binaries-latest-stabledev-version","text":"Just download compressed executable file of your operating system, and uncompress it with tar -zxvf *.tar.gz command or other tools. And then: For Linux-like systems If you have root privilege simply copy it to /usr/local/bin : sudo cp taxonkit /usr/local/bin/ Or copy to anywhere in the environment variable PATH : mkdir -p $HOME/bin/; cp taxonkit $HOME/bin/ For windows , just copy taxonkit.exe to C:\\WINDOWS\\system32 .","title":"Method 1: Download binaries (latest stable/dev version)"},{"location":"download/#method-2-install-via-conda-latest-stable-version","text":"conda install -c bioconda taxonkit","title":"Method 2: Install via conda (latest stable version)"},{"location":"download/#method-3-install-via-homebrew-may-not-the-lastest-version","text":"brew install brewsci/bio/taxonkit","title":"Method 3: Install via homebrew (may not the lastest version)"},{"location":"download/#method-4-compile-from-source-latest-stabledev-version","text":"Install go wget https://go.dev/dl/go1.17.13.linux-amd64.tar.gz tar -zxf go1.17.13.linux-amd64.tar.gz -C $HOME/ # or # echo \"export PATH=$PATH:$HOME/go/bin\" >> ~/.bashrc # source ~/.bashrc export PATH=$PATH:$HOME/go/bin Compile TaxonKit # ------------- the latest stable version ------------- go get -v -u github.com/shenwei356/taxonkit/taxonkit # The executable binary file is located in: # ~/go/bin/taxonkit # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ~/go/bin/taxonkit $HOME/bin/ # --------------- the development version -------------- git clone https://github.com/shenwei356/taxonkit cd taxonkit/taxonkit/ go build # The executable binary file is located in: # ./taxonkit # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ./taxonkit $HOME/bin/","title":"Method 4: Compile from source (latest stable/dev version)"},{"location":"download/#bash-completion","text":"Supported shell: bash|zsh|fish|powershell Bash: # generate completion shell taxonkit genautocomplete --shell bash # configure if never did. # install bash-completion if the \"complete\" command is not found. echo \"for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\" >> ~/.bash_completion echo \"source ~/.bash_completion\" >> ~/.bashrc Zsh: # generate completion shell taxonkit genautocomplete --shell zsh --file ~/.zfunc/_taxonkit # configure if never did echo 'fpath=( ~/.zfunc \"${fpath[@]}\" )' >> ~/.zshrc echo \"autoload -U compinit; compinit\" >> ~/.zshrc fish: taxonkit genautocomplete --shell fish --file ~/.config/fish/completions/taxonkit.fish","title":"Bash-completion"},{"location":"download/#dataset","text":"Download and uncompress taxdump.tar.gz : ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz Copy names.dmp , nodes.dmp , delnodes.dmp and merged.dmp to data directory: $HOME/.taxonkit , e.g., /home/shenwei/.taxonkit , Optionally copy to some other directories, and later you can refer to using flag --data-dir , or environment variable TAXONKIT_DB . All-in-one command: wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz mkdir -p $HOME/.taxonkit cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit Update dataset : Simply re-download the taxdump files, uncompress and override old ones.","title":"Dataset"},{"location":"download/#release-history","text":"TaxonKit v0.14.2 taxonkit filter : fix checking merged/deleted/not-found taxids. #80 taxonkit lca : add a new flag -b/--buffer-size to set the size of the line buffer. #75 fix typos: --separater -> --separater , the former is still available for backward compatibility. taxonkit reformat : output compatible format for TaxIds not found in the database. #79 taxonkit taxid-changelog : support gzip-compressed taxdump files for saving space. #78 TaxonKit v0.14.1 taxonkit reformat : The flag -S/--pseudo-strain does not require -F/--fill-miss-rank now. For taxa of rank >= species, {t} , {S} , and T outputs nothing when using -S/--pseudo-strain . TaxonKit v0.14.0 taxonkit create-taxdump : save taxIds in int32 instead of uint32 , as BLAST and DIAMOND do. #70 taxonkit list : do not skip visited subtrees when some of give taxids are descendants of others. #68 taxonkit : when environment variable TAXONKIT_DB is set, explicitly setting --data-dir will override the value of TAXONKIT_DB . TaxonKit v0.13.0 taxonkit reformat : add a new placeholder {K} for rank kingdom . #64 do not panic for invalid TaxIds, e.g., the column name, when using -I--taxid-field . taxonkit create-taxdump : fix merged.dmp and delnodes.dmp. Thanks to @apcamargo ! gtdb-taxdump/issues/2 . fix bug of handling non-GTDB data when using -A/--field-accession and no rank names given: the colname of the accession column would be treated as one of the ranks, which messed up all the ranks. fix the default option value of --field-accession-re which wrongly remove prefix like Sp_ . #65 taxonkit list : fix warning message of merged taxids. TaxonKit v0.12.0 taxonkit create-taxdump : accepts arbitrary ranks #60 better handle of taxa with same names. many flags changed. TaxonKit v0.11.1 taxonkit create-taxdump : fix bug of missing Class rank, contributed by @apcamargo. The flag --gtdb was not effected. #57 TaxonKit v0.11.0 new command taxonkit create-taxdump : Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV. #56 TaxonKit v0.10.1 taxonkit cami2-filter : fix option --show-rank which did not work in v0.10.0. TaxonKit v0.10.0 new command taxonkit cami2-filter : Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile taxonkit reformat : fix panic for deleted taxid using -F/--fill-miss-rank . #55 TaxonKit v0.9.0 new command taxonkit profile2cami : converting metagenomic profile table to CAMI format TaxonKit v0.8.0 taxonkit reformat : accept input of TaxIds via flag -I/--taxid-field . accept single taxonomy names . show warning message for TaxIds with the same lineage . #42 better flag checking. #40 taxonkit lca : slightly speedup. taxonkit genautocomplete : support bash|zsh|fish/powershell TaxonKit v0.7.2 taxonkit lineage : new flag -R/--show-lineage-ranks for appending ranks of all levels. reduce memory occupation and slightly speedup. taxonkit filter : flag -E/--equal-to supports multiple values. new flag -n/--save-predictable-norank : do not discard some special ranks without order when using -L, where rank of the closest higher node is still lower than rank cutoff. taxonkit reformat : new placeholder {t} for subspecies/strain , {T} for strain . Thanks @wqssf102 for feedback. new flag -S/--pseudo-strain for using the node with lowest rank as strain name, only if which rank is lower than \"species\" and not \"subpecies\" nor \"strain\". TaxonKit v0.7.1 taxonkit filter : disable unnecessary stdin check when using flag --list-order or --list-ranks . #36 better handling of black list, empty default value: \"no rank\" and \"clade\". And you need use -N/--discard-noranks to explicitly filter out \"no rank\", \"clade\". #37 update help message. Thanks @standage for improve this command! #38 TaxonKit v0.7.0 taxonkit : 2-3X faster taxonomy data loading . new command taxonkit filter : filtering TaxIds by taxonomic rank range . #32 new command taxonkit lca : Computing lowest common ancestor (LCA) for TaxIds. taxonkit reformat : new flag -P/--add-prefix : add prefixes for all ranks , single prefix for a rank is defined by flag --prefix-X , where X may be k , p , c , o , f , s , S . new flag -T/--trim : do not fill missing rank lower than current rank. taxonkit list : do not duplicate root node. TaxonKit v0.6.2 taxonkit reformat -F : fix taxids of abbreviated lineage containing names shared by different taxids. #35 TaxonKit v0.6.1 taxonkit lineage : new flag -n/--show-name for appending scientific name. new flag -L/--no-lineage for hide lineage, this is for fast retrieving names or/and ranks. taxonkit reformat : fix flag -F/--fill-miss-rank . discard order restriction of rank symbols. TaxonKit v0.6.0 taxonkit list : check merged and deleted taxids. fix bug of json output. #30 taxonkit name2taxid : new flag -s/--sci-name for limiting to searching scientific names. #29 taxonkit version : make checking update optional TaxonKit v0.5.0 taxonkit : requiring delnodes.dmp and merged.dmp. taxonkit lineage : detect deleted and merged taxids now. #19 taxonkit list/name2taxid : add short flag -r for --show-rank , -n for --show-name . TaxonKit v0.4.3 taxonkit taxid-changelog : rewrite logic, fix bug and add more change types TaxonKit v0.4.2 taxonkit taxid-changelog : change output of ABSORB , do not merged into one record for changes in different versions. TaxonKit v0.4.1 taxonkit taxid-changelog : add fields: name and rank . and fix sorting bug. detailed lineage change status TaxonKit v0.4.0 new command: taxonkit taxid-changelog : for creating taxid changelog from dump archive TaxonKit v0.3.0 this version is almost the same as v0.2.5 TaxonKit v0.2.5 add global flag: --line-buffered to disable output buffer. #11 replace global flags --names-file and --nodes-file with --data-dir , also support environment variable TAXONKIT_DB . #17 taxonkit reformat : detects lineages containing unofficial taxon name and won't show panic message. taxonkit name2taxid : supports synonyms names. #9 taxokit lineage : add flag -r/--show-rank to print rank at another new column. TaxonKit v0.2.4 taxonkit reformat : more accurate result when using flag -F/--fill-miss-rank to estimate and fill missing rank with original lineage information supporting escape strings like \\t , \\n , #5 outputting corresponding taxids for reformated lineage. #8 taxonkit lineage : fix bug for taxid 1 #7 add flag -d/--delimiter . TaxonKit v0.2.3 fix bug brought in v0.2.1 TaxonKit v0.2.2 make verbose information optional #4 TaxonKit v0.2.1 taxonkit list : fix bug of no output for leaf nodes of the taxonomic tree. #4 add new command genautocomplete to generate shell autocompletion script! TaxonKit v0.2.0 add command name2taxid to query taxid by taxon scientific name. lineage , reformat : changed flags and default operations , check the usage . TaxonKit v0.1.8 taxonkit lineage , add an extra column of lineage in Taxid. #3 . e.g., fix colorful output in windows. TaxonKit v0.1.7 taxonkit reformat : supports reading stdin from output of taxonkit lineage , reformated lineages are appended to input data. TaxonKit v0.1.6 remove flag -f/--formated-rank from taxonkit lineage , using taxonkit reformat can archieve same result. TaxonKit v0.1.5 reorganize code and flags TaxonKit v0.1.4 add flag --fill for taxonkit reformat , which estimates and fills missing rank with original lineage information TaxonKit v0.1.3 add command of taxonkit reformat which reformats full lineage to custom format TaxonKit v0.1.2 add command of taxonkit lineage , users can query lineage of given taxon IDs from file TaxonKit v0.1.1 add feature of taxonkit list , users can choose output in readable JSON format by flag --json so the taxonomy tree could be collapse and uncollapse in modern text editor. TaxonKit v0.1 first release /** * RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS. * LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/ /* var disqus_config = function () { this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable }; */ (function() { // DON'T EDIT BELOW THIS LINE var d = document, s = d.createElement('script'); s.src = '//taxonkit.disqus.com/embed.js'; s.setAttribute('data-timestamp', +new Date()); (d.head || d.body).appendChild(s); })(); Please enable JavaScript to view the comments powered by Disqus.","title":"Release history"},{"location":"tutorial/","text":"Tutorial Table of Contents Formatting lineage Parsing kraken/bracken result Making nr blastdb for specific taxids Summaries of taxonomy data Merging GTDB and NCBI taxonomy Formatting lineage Show lineage detail of a TaxId. The command below works on Windows with help of csvtk . $ echo \"2697049\" \\ | taxonkit lineage -t \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -Ht 10239 superkingdom Viruses 2559587 clade Riboviria 2732396 kingdom Orthornavirae 2732408 phylum Pisuviricota 2732506 class Pisoniviricetes 76804 order Nidovirales 2499399 suborder Cornidovirineae 11118 family Coronaviridae 2501931 subfamily Orthocoronavirinae 694002 genus Betacoronavirus 2509511 subgenus Sarbecovirus 694009 species Severe acute respiratory syndrome-related coronavirus 2697049 no rank Severe acute respiratory syndrome coronavirus 2 Example data. $ cat taxids3.txt 376619 349741 239935 314101 11932 1327037 83333 1408252 2605619 2697049 Format to 7-level ranks (\"superkingdom phylum class order family genus species\"). $ cat taxids3.txt \\ | taxonkit reformat -I 1 376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis 349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y 83333 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 1408252 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 2605619 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Severe acute respiratory syndrome-related coronavirus Format to 8-level ranks (\"superkingdom phylum class order family genus species subspecies/rank\"). $ cat taxids3.txt \\ | taxonkit reformat -I 1 -f \"{k};{p};{c};{o};{f};{g};{s};{t}\" 376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica LVS 349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila; 314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B; 11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle; 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y; 83333 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli K-12 1408252 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli R178 2605619 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli; 2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Severe acute respiratory syndrome-related coronavirus; Replace missing ranks with Unassigned and output tab-delimited format. $ cat taxids3.txt \\ | taxonkit reformat -I 1 -r \"Unassigned\" -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | csvtk pretty -H -t 376619 Bacteria Proteobacteria Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis Francisella tularensis subsp. holarctica LVS 349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila Akkermansia muciniphila ATCC BAA-835 239935 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila Unassigned 314101 Bacteria Unassigned Unassigned Unassigned Unassigned Unassigned uncultured murine large bowel bacterium BAC 54B Unassigned 11932 Viruses Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle Unassigned 1327037 Viruses Uroviricota Caudoviricetes Caudovirales Siphoviridae Unassigned Croceibacter phage P2559Y Unassigned 83333 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Escherichia coli K-12 1408252 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Escherichia coli R178 2605619 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Unassigned 2697049 Viruses Pisuviricota Pisoniviricetes Nidovirales Coronaviridae Betacoronavirus Severe acute respiratory syndrome-related coronavirus Unassigned Fill missing ranks and add prefixes. $ cat taxids3.txt \\ | taxonkit reformat -I 1 -F -P -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | csvtk pretty -H -t 376619 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Thiotrichales f__Francisellaceae g__Francisella s__Francisella tularensis t__Francisella tularensis subsp. holarctica LVS 349741 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila t__Akkermansia muciniphila ATCC BAA-835 239935 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila t__unclassified Akkermansia muciniphila subspecies/strain 314101 k__Bacteria p__unclassified Bacteria phylum c__unclassified Bacteria class o__unclassified Bacteria order f__unclassified Bacteria family g__unclassified Bacteria genus s__uncultured murine large bowel bacterium BAC 54B t__unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain 11932 k__Viruses p__Artverviricota c__Revtraviricetes o__Ortervirales f__Retroviridae g__Intracisternal A-particles s__Mouse Intracisternal A-particle t__unclassified Mouse Intracisternal A-particle subspecies/strain 1327037 k__Viruses p__Uroviricota c__Caudoviricetes o__Caudovirales f__Siphoviridae g__unclassified Siphoviridae genus s__Croceibacter phage P2559Y t__unclassified Croceibacter phage P2559Y subspecies/strain 83333 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__Escherichia coli K-12 1408252 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__Escherichia coli R178 2605619 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__unclassified Escherichia coli subspecies/strain 2697049 k__Viruses p__Pisuviricota c__Pisoniviricetes o__Nidovirales f__Coronaviridae g__Betacoronavirus s__Severe acute respiratory syndrome-related coronavirus t__unclassified Severe acute respiratory syndrome-related coronavirus subspecies/strain When these's no nodes of rank \"subspecies\" nor \"strain\", we can switch -S/--pseudo-strain to use the node with lowest rank as subspecies/strain name, if which rank is lower than \"species\" . $ cat taxids3.txt \\ | taxonkit lineage -r -L \\ | taxonkit reformat -I 1 -F -S -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | cut -f 1,2,9,10 \\ | csvtk add-header -t -n \"taxid,rank,species,strain\" \\ | csvtk pretty -t taxid rank species strain ------- ---------- ----------------------------------------------------- ------------------------------------------------------------------------------ 376619 strain Francisella tularensis Francisella tularensis subsp. holarctica LVS 349741 strain Akkermansia muciniphila Akkermansia muciniphila ATCC BAA-835 239935 species Akkermansia muciniphila unclassified Akkermansia muciniphila subspecies/strain 314101 species uncultured murine large bowel bacterium BAC 54B unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain 11932 species Mouse Intracisternal A-particle unclassified Mouse Intracisternal A-particle subspecies/strain 1327037 species Croceibacter phage P2559Y unclassified Croceibacter phage P2559Y subspecies/strain 83333 strain Escherichia coli Escherichia coli K-12 1408252 subspecies Escherichia coli Escherichia coli R178 2605619 no rank Escherichia coli Escherichia coli O16:H48 2697049 no rank Severe acute respiratory syndrome-related coronavirus Severe acute respiratory syndrome coronavirus 2 List eight-level lineage for all TaxIds of rank lower than or equal to species, including some nodes with \"no rank\". But when filtering with -L/--lower-than , you can use -n/--save-predictable-norank to save some special ranks without order, where rank of the closest higher node is still lower than rank cutoff . $ time taxonkit list --ids 1 \\ | taxonkit filter -L species -E species -R -N -n \\ | taxonkit lineage -n -r -L \\ | taxonkit reformat -I 1 -F -S -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | csvtk cut -Ht -l -f 1,3,2,1,4-11 \\ | csvtk add-header -t -n \"taxid,rank,name,lineage,kingdom,phylum,class,order,family,genus,species,strain\" \\ | pigz -c > result.tsv.gz real 0m25.167s user 2m14.809s sys 0m7.197s $ pigz -cd result.tsv.gz \\ | csvtk grep -t -f taxid -p 2697049 \\ | csvtk transpose -t \\ | csvtk pretty -H -t taxid 2697049 rank no rank name Severe acute respiratory syndrome coronavirus 2 lineage Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 kingdom Viruses phylum Pisuviricota class Pisoniviricetes order Nidovirales family Coronaviridae genus Betacoronavirus species Severe acute respiratory syndrome-related coronavirus strain Severe acute respiratory syndrome coronavirus 2 Parsing kraken/bracken result Example Data SRS014459-Stool.fasta.gz Run Kraken2 and Bracken KRAKEN_DB=/home/shenwei/ws/db/kraken/k2_pluspf THREADS=16 CLASSIFICATION_LVL=S THRESHOLD=10 READ_LEN=100 SAMPLE=SRS014459-Stool.fasta.gz BRACKEN_OUTPUT_FILE=$SAMPLE kraken2 --db ${KRAKEN_DB} --threads ${THREADS} -report ${SAMPLE}.kreport $SAMPLE > ${SAMPLE}.kraken est_abundance.py -i ${SAMPLE}.kreport -k ${KRAKEN_DB}/database${READ_LEN}mers.kmer_distrib \\ -l ${CLASSIFICATION_LVL} -t ${THRESHOLD} -o ${BRACKEN_OUTPUT_FILE}.bracken Orignial format $ head -n 15 SRS014459-Stool.fasta.gz_bracken_species.kreport 100.00 9491 0 R 1 root 99.85 9477 0 R1 131567 cellular organisms 99.85 9477 0 D 2 Bacteria 66.08 6271 0 D1 1783270 FCB group 66.08 6271 0 D2 68336 Bacteroidetes/Chlorobi group 66.08 6271 0 P 976 Bacteroidetes 66.08 6271 0 C 200643 Bacteroidia 66.08 6271 0 O 171549 Bacteroidales 34.45 3270 0 F 815 Bacteroidaceae 34.45 3270 0 G 816 Bacteroides 10.43 990 990 S 246787 Bacteroides cellulosilyticus 7.98 757 757 S 28116 Bacteroides ovatus 3.10 293 0 G1 2646097 unclassified Bacteroides 1.06 100 100 S 2755405 Bacteroides sp. CACC 737 0.49 46 46 S 2650157 Bacteroides sp. HF-5287 Converting to MetaPhlAn2 format. (Similar to kreport2mpa.py ) $ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\ | csvtk cut -Ht -f 5,1 \\ | taxonkit lineage \\ | taxonkit reformat -i 3 -P -f \"{k}|{p}|{c}|{o}|{f}|{g}|{s}\" \\ | csvtk cut -Ht -f 4,2 \\ | csvtk replace -Ht -p \"(\\|[kpcofgs]__)+$\" \\ | csvtk replace -Ht -p \"\\|[kpcofgs]__\\|\" -r \"|\" \\ | csvtk uniq -Ht \\ | csvtk grep -Ht -p k__ -v \\ > SRS014459-Stool.fasta.gz_bracken_species.kreport.format $ head -n 10 SRS014459-Stool.fasta.gz_bracken_species.kreport.format k__Bacteria 99.85 k__Bacteria|p__Bacteroidetes 66.08 k__Bacteria|p__Bacteroidetes|c__Bacteroidia 66.08 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales 66.08 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae 34.45 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides 34.45 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides cellulosilyticus 10.43 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides ovatus 7.98 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides sp. CACC 737 1.06 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides sp. HF-5287 0.49 Converting to Qiime format $ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\ | csvtk cut -Ht -f 5,1 \\ | taxonkit lineage \\ | taxonkit reformat -i 3 -P -f \"{k}; {p}; {c}; {o}; {f}; {g}; {s}\" \\ | csvtk cut -Ht -f 4,2 \\ | csvtk replace -Ht -p \"(; [kpcofgs]__)+$\" \\ | csvtk replace -Ht -p \"; [kpcofgs]__; \" -r \"; \" \\ | csvtk uniq -Ht \\ | csvtk grep -Ht -p k__ -v \\ | head -n 10 k__Bacteria 99.85 k__Bacteria; p__Bacteroidetes 66.08 k__Bacteria; p__Bacteroidetes; c__Bacteroidia 66.08 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales 66.08 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae 34.45 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides 34.45 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides cellulosilyticus 10.43 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides ovatus 7.98 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides sp. CACC 737 1.06 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides sp. HF-5287 0.49 Save taxon proportion and taxid, and get lineage, name and rank. $ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\ | csvtk cut -Ht -f 1,5 \\ | taxonkit lineage -i 2 -n -r \\ | csvtk cut -Ht -f 1,2,5,4,3 \\ | head -n 10 \\ | csvtk pretty -Ht 100.00 1 no rank root root 99.85 131567 no rank cellular organisms cellular organisms 99.85 2 superkingdom Bacteria cellular organisms;Bacteria 66.08 1783270 clade FCB group cellular organisms;Bacteria;FCB group 66.08 68336 clade Bacteroidetes/Chlorobi group cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group 66.08 976 phylum Bacteroidetes cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes 66.08 200643 class Bacteroidia cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia 66.08 171549 order Bacteroidales cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales 34.45 815 family Bacteroidaceae cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae 34.45 816 genus Bacteroides cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides Only save species or lower level and get lineage in format of \"superkingdom phylum class order family genus species\". $ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\ | csvtk cut -Ht -f 1,5 \\ | taxonkit filter -N -E species -L species -i 2 \\ | taxonkit lineage -i 2 -n -r \\ | taxonkit reformat -i 3 -f \"{k};{p};{c};{o};{f};{g};{s}\" \\ | csvtk cut -Ht -f 1,2,5,4,6 \\ | csvtk add-header -t -n abundance,taxid,rank,name,lineage \\ | head -n 10 \\ | csvtk pretty -t abundance taxid rank name lineage --------- ------- ------- ---------------------------- -------------------------------------------------------------------------------------------------------- 10.43 246787 species Bacteroides cellulosilyticus Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides cellulosilyticus 7.98 28116 species Bacteroides ovatus Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides ovatus 1.06 2755405 species Bacteroides sp. CACC 737 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. CACC 737 0.49 2650157 species Bacteroides sp. HF-5287 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. HF-5287 0.99 2528203 species Bacteroides sp. A1C1 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. A1C1 0.28 2763022 species Bacteroides sp. M10 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. M10 0.16 2650158 species Bacteroides sp. HF-5141 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. HF-5141 0.12 2715212 species Bacteroides sp. CBA7301 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. CBA7301 5.10 817 species Bacteroides fragilis Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides fragilis Making nr blastdb for specific taxids Attention: BLAST+ 2.8.1 is released with new databases , which allows you to limit your search by taxonomy using information built into the BLAST databases. So you don't need to build blastdb for specific taxids now. Changes: 2018-09-13 rewritten 2018-12-22 providing faster method for step 3.1 2019-01-07 add note of new blastdb version 2020-10-14 update steps for huge number of accessions belong to high taxon level like bacteria. Data: pre-formated blastdb (09/10/2018) prot.accession2taxid.gz (09/07/2018) (optional, but recommended) Hardware in this tutorial CPU: AMD 8-cores/16-threads 3.7Ghz RAM: 64GB DISK: Taxonomy files stores in NVMe SSD blastdb files stores in 7200rpm HDD Tools: blast+ pigz (recommended, faster than gzip) taxonkit seqkit (recommended), version >= 0.14.0 rush (optional, for parallizing filtering sequence) Steps: Listing all taxids below $id using taxonkit. id=6656 # 6656 is the phylum Arthropoda # echo 6656 | taxonkit lineage | taxonkit reformat # 6656 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Protostomia;Ecdysozoa;Panarthropoda;Arthropoda Eukaryota;Arthropoda;;;;; # 2 bacteria # 2157 archaea # 4751 fungi # 10239 virus # time: 2s taxonkit list --ids $id --indent \"\" > $id.taxid.txt # taxonkit list --ids 2,4751,10239 --indent \"\" > microbe.taxid.txt wc -l $id.taxid.txt # 518373 6656.taxid.txt Retrieving target accessions. There are two options: From prot.accession2taxid.gz ( faster, recommended ). Note that some accessions are not in nr . # time: 4min pigz -dc prot.accession2taxid.gz \\ | csvtk grep -t -f taxid -P $id.taxid.txt \\ | csvtk cut -t -f accession.version,taxid \\ | sed 1d \\ > $id.acc2taxid.txt cut -f 1 $id.acc2taxid.txt > $id.acc.txt wc -l $id.acc.txt # 8174609 6656.acc.txt From pre-formated nr blastdb # time: 40min blastdbcmd -db nr -entry all -outfmt \"%a %T\" | pigz -c > nr.acc2taxid.txt.gz pigz -dc nr.acc2taxid.txt.gz | wc -l # 555220892 # time: 3min pigz -dc nr.acc2taxid.txt.gz \\ | csvtk grep -d ' ' -D ' ' -f 2 -P $id.taxid.txt \\ | cut -d ' ' -f 1 \\ > $id.acc.txt wc -l $id.acc.txt # 6928021 6656.acc.txt Retrieving FASTA sequences from pre-formated blastdb. There are two options: From nr.fa exported from pre-formated blastdb ( faster, smaller output file, recommended ). DO NOT directly download nr.gz from ncbi ftp , in which the FASTA headers are not well formated. # 1. exporting nr.fa from pre-formated blastdb # time: 117min (run only once) blastdbcmd -db nr -dbtype prot -entry all -outfmt \"%f\" -out - | pigz -c > nr.fa.gz # ===================================================================== # 2. filtering sequence belong to $taxid # --------------------------------------------------------------------- # methond 1) (for cases where $id.acc.txt is not very huge) # time: 80min # perl one-liner is used to unfold records having mulitple accessions time cat <(echo) <(pigz -dc nr.fa.gz) \\ | perl -e 'BEGIN{ $/ = \"\\n>\"; <>; } while(<>){s/>$//; $i = index $_, \"\\n\"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print \">$_\"; next; }; $h = \">$h\"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\\W+//; print \">$h1\\n$s\";} } ' \\ | seqkit grep -f $id.acc.txt -o nr.$id.fa.gz # --------------------------------------------------------------------- # method 2) (**faster**) # 33min (run only once) # (1). split nr.fa.gz. # Note: I have 16 cpus. $ time seqkit split2 -p 15 nr.fa.gz # (2). parallize unfolding $ cat _unfold_blastdb_fa.sh #!/bin/sh perl -e 'BEGIN{ $/ = \"\\n>\"; <>; } while(<>){s/>$//; $i = index $_, \"\\n\"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print \">$_\"; next; }; $h = \">$h\"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\\W+//; print \">$h1\\n$s\";} } ' # 10 min time ls nr.fa.gz.split/nr.part_*.fa.gz \\ | rush -j 15 -v id=$id 'cat <(echo) <(pigz -dc {}) \\ | ./_unfold_blastdb_fa.sh \\ | seqkit grep -f {id}.acc.txt -o nr.{id}.{%@nr\\.(.+)$} ' # (3). merge result cat nr.$id.part*.fa.gz > nr.$id.fa.gz rm nr.$id.part*.fa.gz # --------------------------------------------------------------------- # method 3) (for huge $id.acc.txt file, e.g., bacteria) # (1). split ${id}.acc.txt into several parts. chunk size depends on lines and RAM (64G for me). split -d -l 300000000 $id.acc.txt $id.acc.txt.part_ # (2). filter time ls $id.acc.txt.part_* \\ | rush -j 1 --immediate-output -v id=$id \\ 'echo {}; cat <(echo) <(pigz -dc nr.fa.gz ) \\ | ./_unfold_blastdb_fa.sh \\ | seqkit grep -f {} -o nr.{id}.{%@(part_.+)}.fa.gz ' # (3). merge cat nr.$id.part*.fa.gz > nr.$id.fa.gz # clean rm nr.$id.part*.fa.gz rm $id.acc.txt.part_ # (4). optionally adding taxid, you may edit replacement (-r) below # split time split -d -l 200000000 $id.acc2taxid.txt $id.acc2taxid.txt.part_ ln -s nr.$id.fa.gz nr.$id.with-taxid.part0.fa.gz i=0 for f in $id.acc2taxid.txt.part_* ; do echo $f time pigz -cd nr.$id.with-taxid.part$i.fa.gz \\ | seqkit replace -k $f -p \"^([^\\-]+?) \" -r \"{kv}-\\$1 \" -K -U -o nr.$id.with-taxid.part$(($i+1)).fa.gz; /bin/rm nr.$id.with-taxid.part$i.fa.gz i=$(($i+1)); done mv nr.$id.with-taxid.part$i.fa.gz nr.$id.with-taxid.fa.gz # ===================================================================== # 3. counting sequences # # ls -lh nr.$id.fa.gz # -rw-r--r-- 1 shenwei shenwei 902M 9\u6708 13 01:42 nr.6656.fa.gz # pigz -dc nr.$id.fa.gz | grep '^>' -c # 6928017 # Here 6928017 ~= 6928021 ($id.acc.txt) Directly from pre-formated blastdb # time: 5h20min blastdbcmd -db nr -entry_batch $id.acc.txt -out - | pigz -c > nr.$id.fa.gz # counting sequences # # Note that the headers of outputed fasta by blastdbcmd are \"folded\" # for accessions from different species with same sequences, so the # number may be small than $(wc -l $id.acc.txt). pigz -dc nr.$id.fa.gz | grep '^>' -c # 1577383 # counting accessions # # ls -lh nr.$id.fa.gz # -rw-r--r-- 1 shenwei shenwei 2.1G 9\u6708 13 03:38 nr.6656.fa.gz # pigz -dc nr.$id.fa.gz | grep '^>' | sed 's/>/\\n>/g' | grep '^>' -c # 288415413 makeblastdb pigz -dc nr.$id.fa.gz > nr.$id.fa # time: 3min ($nr.$id.fa from step 3 option 1) # # building $nr.$id.fa from step 3 option 2 with -parse_seqids would produce error: # # BLAST Database creation error: Error: Duplicate seq_ids are found: SP|P29868.1 # makeblastdb -parse_seqids -in nr.$id.fa -dbtype prot -out nr.$id # rm nr.$id.fa blastp (optional) # blastdb nr.$id is built from sequences in step 3 option 1 # blastp -num_threads 16 -db nr.$id -query t4.fa > t4.fa.blast # real 0m20.866s # $ cat t4.fa.blast | grep Query= -A 10 # Query= A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-Hd1a # # Length=35 Score E # Sequences producing significant alignments: (Bits) Value # 2MPQ_A Chain A, Solution structure of the sodium channel toxin Hd1a 72.4 2e-17 # A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-... 72.4 2e-17 # ADB56726.1 HNTX-IV.2 precursor [Haplopelma hainanum] 66.6 9e-15 # D2Y233.1 RecName: Full=Mu-theraphotoxin-Hhn1b 2; Short=Mu-TRTX-H... 66.6 9e-15 # ADB56830.1 HNTX-IV.3 precursor [Haplopelma hainanum] 66.6 9e-15 Summaries of taxonomy data You can change the TaxId of interest. Rank counts of common categories. $ echo Archaea Bacteria Eukaryota Fungi Metazoa Viridiplantae \\ | rush -D ' ' -T b \\ 'taxonkit list --ids $(echo {} | taxonkit name2taxid | cut -f 2) \\ | sed 1d \\ | taxonkit filter -i 2 -E genus -L genus \\ | taxonkit lineage -L -r \\ | csvtk freq -H -t -f 2 -nr \\ > stats.{}.tsv ' $ csvtk -t join --outer-join stats.*.tsv \\ | csvtk add-header -t -n \"rank,$(ls stats.*.tsv | rush -k 'echo {@stats.(.+).tsv}' | paste -sd, )\" \\ | csvtk csv2md -t Similar data on NCBI Taxonomy rank Archaea Bacteria Eukaryota Fungi Metazoa Viridiplantae species 12482 460940 1349648 156908 957297 191026 strain 354 40643 3486 2352 33 50 genus 205 4112 90882 6844 64148 16202 isolate 7 503 809 76 17 3 species group 2 77 251 22 214 5 serotype 218 serogroup 136 subsection 21 21 subspecies 632 24523 158 17043 7212 forma specialis 521 220 179 33 1 species subgroup 23 101 101 biotype 7 10 morph 12 3 4 5 section 437 37 2 398 genotype 12 12 series 9 5 4 varietas 25 8499 1100 2 7188 forma 4 560 185 6 315 subgenus 1 1558 10 1414 112 pathogroup 5 subvariety 5 5 Count of all ranks $ time taxonkit list --ids 1 \\ | taxonkit lineage -L -r \\ | csvtk freq -H -t -f 2 -nr \\ | csvtk pretty -H -t species 1879659 no rank 222743 genus 96625 strain 44483 subspecies 25174 family 9492 varietas 8524 subfamily 3050 tribe 2213 order 1660 subgenus 1618 isolate 1319 serotype 1216 clade 886 superfamily 865 forma specialis 741 forma 564 subtribe 508 section 437 class 429 suborder 372 species group 330 phylum 272 subclass 156 serogroup 138 infraorder 130 species subgroup 124 superorder 55 subphylum 33 parvorder 26 subsection 21 genotype 20 infraclass 18 biotype 17 morph 12 kingdom 11 series 9 superclass 6 cohort 5 pathogroup 5 subvariety 5 superkingdom 4 subcohort 3 subkingdom 1 superphylum 1 real 0m3.663s user 0m15.897s sys 0m1.010s Ranks of taxa at or below species. $ taxonkit list --ids 1 \\ | taxonkit filter --lower-than species --equal-to species \\ | taxonkit lineage -L -r \\ | csvtk freq -Ht -nr -f 2 \\ | csvtk add-header -t -n rank,count \\ | csvtk pretty -t rank count --------------- ------- species 1880044 no rank 222756 strain 44483 subspecies 25171 varietas 8524 isolate 1319 serotype 1216 clade 885 forma specialis 741 forma 564 serogroup 138 genotype 20 biotype 17 morph 12 pathogroup 5 subvariety 5 Merging GTDB and NCBI taxonomy Sometimes ( 1 ) one needs to build a database including bacteria and archaea (from GTDB) and viral database from NCBI. The idea is to export lineages from both GTDB and NCBI using taxonkit reformat , and then create taxdump files from them with taxonkit create-taxdump . Exporting taxonomic lineages of taxa with rank equal to species from GTDB-taxdump . taxonkit list --data-dir gtdb-taxdump/R207/ --ids 1 --indent \"\" \\ | taxonkit filter --data-dir gtdb-taxdump/R207/ --equal-to species \\ | taxonkit reformat --data-dir gtdb-taxdump/R207/ --taxid-field 1 \\ --format \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" \\ -o gtdb.tsv Exporting taxonomic lineages of viral taxa with rank equal to or lower than species from NCBI taxdump. For taxa whose rank is \"no rank\" below the species, we treat them as tax of strain rank ( --pseudo-strain , taxonkit v0.14.1 needed). # taxid of Viruses: 10239 taxonkit list --data-dir ~/.taxonkit --ids 10239 --indent \"\" \\ | taxonkit filter --data-dir ~/.taxonkit --equal-to species --lower-than species \\ | taxonkit reformat --data-dir ~/.taxonkit --taxid-field 1 \\ --pseudo-strain --format \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ -o ncbi-viral.tsv Creating taxdump from lineages above. (awk '{print $_\"\\t\"}' gtdb.tsv; cat ncbi-viral.tsv) \\ | taxonkit create-taxdump \\ --field-accession 1 \\ -R \"superkingdom,phylum,class,order,family,genus,species,strain\" \\ -O taxdump # we use --field-accession 1 to output the mapping file between old taxids and new ones. $ grep 2697049 taxdump/taxid.map # SARS-COV-2 2697049 21630522 Some tests: # SARS-COV-2 in NCBI taxonomy $ echo 2697049 \\ | taxonkit lineage -t --data-dir ~/.taxonkit \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L --data-dir ~/.taxonkit \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -Ht 10239 superkingdom Viruses 2559587 clade Riboviria 2732396 kingdom Orthornavirae 2732408 phylum Pisuviricota 2732506 class Pisoniviricetes 76804 order Nidovirales 2499399 suborder Cornidovirineae 11118 family Coronaviridae 2501931 subfamily Orthocoronavirinae 694002 genus Betacoronavirus 2509511 subgenus Sarbecovirus 694009 species Severe acute respiratory syndrome-related coronavirus 2697049 no rank Severe acute respiratory syndrome coronavirus 2 $ echo \"Severe acute respiratory syndrome coronavirus 2\" | taxonkit name2taxid --data-dir taxdump/ Severe acute respiratory syndrome coronavirus 2 216305222 $ echo 216305222 \\ | taxonkit lineage -t --data-dir taxdump/ \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L --data-dir taxdump/ \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -Ht 1287770734 superkingdom Viruses 1506901452 phylum Pisuviricota 1091693597 class Pisoniviricetes 37745009 order Nidovirales 738421640 family Coronaviridae 906833049 genus Betacoronavirus 1015862491 species Severe acute respiratory syndrome-related coronavirus 216305222 strain Severe acute respiratory syndrome coronavirus 2 $ echo \"Escherichia coli\" | taxonkit name2taxid --data-dir taxdump/ Escherichia coli 1945799576 $ echo 1945799576 \\ | taxonkit lineage -t --data-dir taxdump/ \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L --data-dir taxdump/ \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -Ht 609216830 superkingdom Bacteria 1641076285 phylum Proteobacteria 329474883 class Gammaproteobacteria 1012954932 order Enterobacterales 87250111 family Enterobacteriaceae 1187493883 genus Escherichia 1945799576 species Escherichia coli /** * RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS. * LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/ /* var disqus_config = function () { this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable }; */ (function() { // DON'T EDIT BELOW THIS LINE var d = document, s = d.createElement('script'); s.src = '//taxonkit.disqus.com/embed.js'; s.setAttribute('data-timestamp', +new Date()); (d.head || d.body).appendChild(s); })(); Please enable JavaScript to view the comments powered by Disqus.","title":"Tutorial"},{"location":"tutorial/#tutorial","text":"","title":"Tutorial"},{"location":"tutorial/#table-of-contents","text":"Formatting lineage Parsing kraken/bracken result Making nr blastdb for specific taxids Summaries of taxonomy data Merging GTDB and NCBI taxonomy","title":"Table of Contents"},{"location":"tutorial/#formatting-lineage","text":"Show lineage detail of a TaxId. The command below works on Windows with help of csvtk . $ echo \"2697049\" \\ | taxonkit lineage -t \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -Ht 10239 superkingdom Viruses 2559587 clade Riboviria 2732396 kingdom Orthornavirae 2732408 phylum Pisuviricota 2732506 class Pisoniviricetes 76804 order Nidovirales 2499399 suborder Cornidovirineae 11118 family Coronaviridae 2501931 subfamily Orthocoronavirinae 694002 genus Betacoronavirus 2509511 subgenus Sarbecovirus 694009 species Severe acute respiratory syndrome-related coronavirus 2697049 no rank Severe acute respiratory syndrome coronavirus 2 Example data. $ cat taxids3.txt 376619 349741 239935 314101 11932 1327037 83333 1408252 2605619 2697049 Format to 7-level ranks (\"superkingdom phylum class order family genus species\"). $ cat taxids3.txt \\ | taxonkit reformat -I 1 376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis 349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y 83333 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 1408252 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 2605619 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Severe acute respiratory syndrome-related coronavirus Format to 8-level ranks (\"superkingdom phylum class order family genus species subspecies/rank\"). $ cat taxids3.txt \\ | taxonkit reformat -I 1 -f \"{k};{p};{c};{o};{f};{g};{s};{t}\" 376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica LVS 349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila; 314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B; 11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle; 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y; 83333 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli K-12 1408252 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli R178 2605619 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli; 2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Severe acute respiratory syndrome-related coronavirus; Replace missing ranks with Unassigned and output tab-delimited format. $ cat taxids3.txt \\ | taxonkit reformat -I 1 -r \"Unassigned\" -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | csvtk pretty -H -t 376619 Bacteria Proteobacteria Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis Francisella tularensis subsp. holarctica LVS 349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila Akkermansia muciniphila ATCC BAA-835 239935 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila Unassigned 314101 Bacteria Unassigned Unassigned Unassigned Unassigned Unassigned uncultured murine large bowel bacterium BAC 54B Unassigned 11932 Viruses Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle Unassigned 1327037 Viruses Uroviricota Caudoviricetes Caudovirales Siphoviridae Unassigned Croceibacter phage P2559Y Unassigned 83333 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Escherichia coli K-12 1408252 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Escherichia coli R178 2605619 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Unassigned 2697049 Viruses Pisuviricota Pisoniviricetes Nidovirales Coronaviridae Betacoronavirus Severe acute respiratory syndrome-related coronavirus Unassigned Fill missing ranks and add prefixes. $ cat taxids3.txt \\ | taxonkit reformat -I 1 -F -P -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | csvtk pretty -H -t 376619 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Thiotrichales f__Francisellaceae g__Francisella s__Francisella tularensis t__Francisella tularensis subsp. holarctica LVS 349741 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila t__Akkermansia muciniphila ATCC BAA-835 239935 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila t__unclassified Akkermansia muciniphila subspecies/strain 314101 k__Bacteria p__unclassified Bacteria phylum c__unclassified Bacteria class o__unclassified Bacteria order f__unclassified Bacteria family g__unclassified Bacteria genus s__uncultured murine large bowel bacterium BAC 54B t__unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain 11932 k__Viruses p__Artverviricota c__Revtraviricetes o__Ortervirales f__Retroviridae g__Intracisternal A-particles s__Mouse Intracisternal A-particle t__unclassified Mouse Intracisternal A-particle subspecies/strain 1327037 k__Viruses p__Uroviricota c__Caudoviricetes o__Caudovirales f__Siphoviridae g__unclassified Siphoviridae genus s__Croceibacter phage P2559Y t__unclassified Croceibacter phage P2559Y subspecies/strain 83333 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__Escherichia coli K-12 1408252 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__Escherichia coli R178 2605619 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__unclassified Escherichia coli subspecies/strain 2697049 k__Viruses p__Pisuviricota c__Pisoniviricetes o__Nidovirales f__Coronaviridae g__Betacoronavirus s__Severe acute respiratory syndrome-related coronavirus t__unclassified Severe acute respiratory syndrome-related coronavirus subspecies/strain When these's no nodes of rank \"subspecies\" nor \"strain\", we can switch -S/--pseudo-strain to use the node with lowest rank as subspecies/strain name, if which rank is lower than \"species\" . $ cat taxids3.txt \\ | taxonkit lineage -r -L \\ | taxonkit reformat -I 1 -F -S -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | cut -f 1,2,9,10 \\ | csvtk add-header -t -n \"taxid,rank,species,strain\" \\ | csvtk pretty -t taxid rank species strain ------- ---------- ----------------------------------------------------- ------------------------------------------------------------------------------ 376619 strain Francisella tularensis Francisella tularensis subsp. holarctica LVS 349741 strain Akkermansia muciniphila Akkermansia muciniphila ATCC BAA-835 239935 species Akkermansia muciniphila unclassified Akkermansia muciniphila subspecies/strain 314101 species uncultured murine large bowel bacterium BAC 54B unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain 11932 species Mouse Intracisternal A-particle unclassified Mouse Intracisternal A-particle subspecies/strain 1327037 species Croceibacter phage P2559Y unclassified Croceibacter phage P2559Y subspecies/strain 83333 strain Escherichia coli Escherichia coli K-12 1408252 subspecies Escherichia coli Escherichia coli R178 2605619 no rank Escherichia coli Escherichia coli O16:H48 2697049 no rank Severe acute respiratory syndrome-related coronavirus Severe acute respiratory syndrome coronavirus 2 List eight-level lineage for all TaxIds of rank lower than or equal to species, including some nodes with \"no rank\". But when filtering with -L/--lower-than , you can use -n/--save-predictable-norank to save some special ranks without order, where rank of the closest higher node is still lower than rank cutoff . $ time taxonkit list --ids 1 \\ | taxonkit filter -L species -E species -R -N -n \\ | taxonkit lineage -n -r -L \\ | taxonkit reformat -I 1 -F -S -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | csvtk cut -Ht -l -f 1,3,2,1,4-11 \\ | csvtk add-header -t -n \"taxid,rank,name,lineage,kingdom,phylum,class,order,family,genus,species,strain\" \\ | pigz -c > result.tsv.gz real 0m25.167s user 2m14.809s sys 0m7.197s $ pigz -cd result.tsv.gz \\ | csvtk grep -t -f taxid -p 2697049 \\ | csvtk transpose -t \\ | csvtk pretty -H -t taxid 2697049 rank no rank name Severe acute respiratory syndrome coronavirus 2 lineage Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 kingdom Viruses phylum Pisuviricota class Pisoniviricetes order Nidovirales family Coronaviridae genus Betacoronavirus species Severe acute respiratory syndrome-related coronavirus strain Severe acute respiratory syndrome coronavirus 2","title":"Formatting lineage"},{"location":"tutorial/#parsing-krakenbracken-result","text":"Example Data SRS014459-Stool.fasta.gz Run Kraken2 and Bracken KRAKEN_DB=/home/shenwei/ws/db/kraken/k2_pluspf THREADS=16 CLASSIFICATION_LVL=S THRESHOLD=10 READ_LEN=100 SAMPLE=SRS014459-Stool.fasta.gz BRACKEN_OUTPUT_FILE=$SAMPLE kraken2 --db ${KRAKEN_DB} --threads ${THREADS} -report ${SAMPLE}.kreport $SAMPLE > ${SAMPLE}.kraken est_abundance.py -i ${SAMPLE}.kreport -k ${KRAKEN_DB}/database${READ_LEN}mers.kmer_distrib \\ -l ${CLASSIFICATION_LVL} -t ${THRESHOLD} -o ${BRACKEN_OUTPUT_FILE}.bracken Orignial format $ head -n 15 SRS014459-Stool.fasta.gz_bracken_species.kreport 100.00 9491 0 R 1 root 99.85 9477 0 R1 131567 cellular organisms 99.85 9477 0 D 2 Bacteria 66.08 6271 0 D1 1783270 FCB group 66.08 6271 0 D2 68336 Bacteroidetes/Chlorobi group 66.08 6271 0 P 976 Bacteroidetes 66.08 6271 0 C 200643 Bacteroidia 66.08 6271 0 O 171549 Bacteroidales 34.45 3270 0 F 815 Bacteroidaceae 34.45 3270 0 G 816 Bacteroides 10.43 990 990 S 246787 Bacteroides cellulosilyticus 7.98 757 757 S 28116 Bacteroides ovatus 3.10 293 0 G1 2646097 unclassified Bacteroides 1.06 100 100 S 2755405 Bacteroides sp. CACC 737 0.49 46 46 S 2650157 Bacteroides sp. HF-5287 Converting to MetaPhlAn2 format. (Similar to kreport2mpa.py ) $ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\ | csvtk cut -Ht -f 5,1 \\ | taxonkit lineage \\ | taxonkit reformat -i 3 -P -f \"{k}|{p}|{c}|{o}|{f}|{g}|{s}\" \\ | csvtk cut -Ht -f 4,2 \\ | csvtk replace -Ht -p \"(\\|[kpcofgs]__)+$\" \\ | csvtk replace -Ht -p \"\\|[kpcofgs]__\\|\" -r \"|\" \\ | csvtk uniq -Ht \\ | csvtk grep -Ht -p k__ -v \\ > SRS014459-Stool.fasta.gz_bracken_species.kreport.format $ head -n 10 SRS014459-Stool.fasta.gz_bracken_species.kreport.format k__Bacteria 99.85 k__Bacteria|p__Bacteroidetes 66.08 k__Bacteria|p__Bacteroidetes|c__Bacteroidia 66.08 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales 66.08 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae 34.45 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides 34.45 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides cellulosilyticus 10.43 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides ovatus 7.98 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides sp. CACC 737 1.06 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides sp. HF-5287 0.49 Converting to Qiime format $ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\ | csvtk cut -Ht -f 5,1 \\ | taxonkit lineage \\ | taxonkit reformat -i 3 -P -f \"{k}; {p}; {c}; {o}; {f}; {g}; {s}\" \\ | csvtk cut -Ht -f 4,2 \\ | csvtk replace -Ht -p \"(; [kpcofgs]__)+$\" \\ | csvtk replace -Ht -p \"; [kpcofgs]__; \" -r \"; \" \\ | csvtk uniq -Ht \\ | csvtk grep -Ht -p k__ -v \\ | head -n 10 k__Bacteria 99.85 k__Bacteria; p__Bacteroidetes 66.08 k__Bacteria; p__Bacteroidetes; c__Bacteroidia 66.08 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales 66.08 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae 34.45 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides 34.45 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides cellulosilyticus 10.43 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides ovatus 7.98 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides sp. CACC 737 1.06 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides sp. HF-5287 0.49 Save taxon proportion and taxid, and get lineage, name and rank. $ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\ | csvtk cut -Ht -f 1,5 \\ | taxonkit lineage -i 2 -n -r \\ | csvtk cut -Ht -f 1,2,5,4,3 \\ | head -n 10 \\ | csvtk pretty -Ht 100.00 1 no rank root root 99.85 131567 no rank cellular organisms cellular organisms 99.85 2 superkingdom Bacteria cellular organisms;Bacteria 66.08 1783270 clade FCB group cellular organisms;Bacteria;FCB group 66.08 68336 clade Bacteroidetes/Chlorobi group cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group 66.08 976 phylum Bacteroidetes cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes 66.08 200643 class Bacteroidia cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia 66.08 171549 order Bacteroidales cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales 34.45 815 family Bacteroidaceae cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae 34.45 816 genus Bacteroides cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides Only save species or lower level and get lineage in format of \"superkingdom phylum class order family genus species\". $ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\ | csvtk cut -Ht -f 1,5 \\ | taxonkit filter -N -E species -L species -i 2 \\ | taxonkit lineage -i 2 -n -r \\ | taxonkit reformat -i 3 -f \"{k};{p};{c};{o};{f};{g};{s}\" \\ | csvtk cut -Ht -f 1,2,5,4,6 \\ | csvtk add-header -t -n abundance,taxid,rank,name,lineage \\ | head -n 10 \\ | csvtk pretty -t abundance taxid rank name lineage --------- ------- ------- ---------------------------- -------------------------------------------------------------------------------------------------------- 10.43 246787 species Bacteroides cellulosilyticus Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides cellulosilyticus 7.98 28116 species Bacteroides ovatus Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides ovatus 1.06 2755405 species Bacteroides sp. CACC 737 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. CACC 737 0.49 2650157 species Bacteroides sp. HF-5287 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. HF-5287 0.99 2528203 species Bacteroides sp. A1C1 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. A1C1 0.28 2763022 species Bacteroides sp. M10 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. M10 0.16 2650158 species Bacteroides sp. HF-5141 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. HF-5141 0.12 2715212 species Bacteroides sp. CBA7301 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. CBA7301 5.10 817 species Bacteroides fragilis Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides fragilis","title":"Parsing kraken/bracken result"},{"location":"tutorial/#making-nr-blastdb-for-specific-taxids","text":"Attention: BLAST+ 2.8.1 is released with new databases , which allows you to limit your search by taxonomy using information built into the BLAST databases. So you don't need to build blastdb for specific taxids now. Changes: 2018-09-13 rewritten 2018-12-22 providing faster method for step 3.1 2019-01-07 add note of new blastdb version 2020-10-14 update steps for huge number of accessions belong to high taxon level like bacteria. Data: pre-formated blastdb (09/10/2018) prot.accession2taxid.gz (09/07/2018) (optional, but recommended) Hardware in this tutorial CPU: AMD 8-cores/16-threads 3.7Ghz RAM: 64GB DISK: Taxonomy files stores in NVMe SSD blastdb files stores in 7200rpm HDD Tools: blast+ pigz (recommended, faster than gzip) taxonkit seqkit (recommended), version >= 0.14.0 rush (optional, for parallizing filtering sequence) Steps: Listing all taxids below $id using taxonkit. id=6656 # 6656 is the phylum Arthropoda # echo 6656 | taxonkit lineage | taxonkit reformat # 6656 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Protostomia;Ecdysozoa;Panarthropoda;Arthropoda Eukaryota;Arthropoda;;;;; # 2 bacteria # 2157 archaea # 4751 fungi # 10239 virus # time: 2s taxonkit list --ids $id --indent \"\" > $id.taxid.txt # taxonkit list --ids 2,4751,10239 --indent \"\" > microbe.taxid.txt wc -l $id.taxid.txt # 518373 6656.taxid.txt Retrieving target accessions. There are two options: From prot.accession2taxid.gz ( faster, recommended ). Note that some accessions are not in nr . # time: 4min pigz -dc prot.accession2taxid.gz \\ | csvtk grep -t -f taxid -P $id.taxid.txt \\ | csvtk cut -t -f accession.version,taxid \\ | sed 1d \\ > $id.acc2taxid.txt cut -f 1 $id.acc2taxid.txt > $id.acc.txt wc -l $id.acc.txt # 8174609 6656.acc.txt From pre-formated nr blastdb # time: 40min blastdbcmd -db nr -entry all -outfmt \"%a %T\" | pigz -c > nr.acc2taxid.txt.gz pigz -dc nr.acc2taxid.txt.gz | wc -l # 555220892 # time: 3min pigz -dc nr.acc2taxid.txt.gz \\ | csvtk grep -d ' ' -D ' ' -f 2 -P $id.taxid.txt \\ | cut -d ' ' -f 1 \\ > $id.acc.txt wc -l $id.acc.txt # 6928021 6656.acc.txt Retrieving FASTA sequences from pre-formated blastdb. There are two options: From nr.fa exported from pre-formated blastdb ( faster, smaller output file, recommended ). DO NOT directly download nr.gz from ncbi ftp , in which the FASTA headers are not well formated. # 1. exporting nr.fa from pre-formated blastdb # time: 117min (run only once) blastdbcmd -db nr -dbtype prot -entry all -outfmt \"%f\" -out - | pigz -c > nr.fa.gz # ===================================================================== # 2. filtering sequence belong to $taxid # --------------------------------------------------------------------- # methond 1) (for cases where $id.acc.txt is not very huge) # time: 80min # perl one-liner is used to unfold records having mulitple accessions time cat <(echo) <(pigz -dc nr.fa.gz) \\ | perl -e 'BEGIN{ $/ = \"\\n>\"; <>; } while(<>){s/>$//; $i = index $_, \"\\n\"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print \">$_\"; next; }; $h = \">$h\"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\\W+//; print \">$h1\\n$s\";} } ' \\ | seqkit grep -f $id.acc.txt -o nr.$id.fa.gz # --------------------------------------------------------------------- # method 2) (**faster**) # 33min (run only once) # (1). split nr.fa.gz. # Note: I have 16 cpus. $ time seqkit split2 -p 15 nr.fa.gz # (2). parallize unfolding $ cat _unfold_blastdb_fa.sh #!/bin/sh perl -e 'BEGIN{ $/ = \"\\n>\"; <>; } while(<>){s/>$//; $i = index $_, \"\\n\"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print \">$_\"; next; }; $h = \">$h\"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\\W+//; print \">$h1\\n$s\";} } ' # 10 min time ls nr.fa.gz.split/nr.part_*.fa.gz \\ | rush -j 15 -v id=$id 'cat <(echo) <(pigz -dc {}) \\ | ./_unfold_blastdb_fa.sh \\ | seqkit grep -f {id}.acc.txt -o nr.{id}.{%@nr\\.(.+)$} ' # (3). merge result cat nr.$id.part*.fa.gz > nr.$id.fa.gz rm nr.$id.part*.fa.gz # --------------------------------------------------------------------- # method 3) (for huge $id.acc.txt file, e.g., bacteria) # (1). split ${id}.acc.txt into several parts. chunk size depends on lines and RAM (64G for me). split -d -l 300000000 $id.acc.txt $id.acc.txt.part_ # (2). filter time ls $id.acc.txt.part_* \\ | rush -j 1 --immediate-output -v id=$id \\ 'echo {}; cat <(echo) <(pigz -dc nr.fa.gz ) \\ | ./_unfold_blastdb_fa.sh \\ | seqkit grep -f {} -o nr.{id}.{%@(part_.+)}.fa.gz ' # (3). merge cat nr.$id.part*.fa.gz > nr.$id.fa.gz # clean rm nr.$id.part*.fa.gz rm $id.acc.txt.part_ # (4). optionally adding taxid, you may edit replacement (-r) below # split time split -d -l 200000000 $id.acc2taxid.txt $id.acc2taxid.txt.part_ ln -s nr.$id.fa.gz nr.$id.with-taxid.part0.fa.gz i=0 for f in $id.acc2taxid.txt.part_* ; do echo $f time pigz -cd nr.$id.with-taxid.part$i.fa.gz \\ | seqkit replace -k $f -p \"^([^\\-]+?) \" -r \"{kv}-\\$1 \" -K -U -o nr.$id.with-taxid.part$(($i+1)).fa.gz; /bin/rm nr.$id.with-taxid.part$i.fa.gz i=$(($i+1)); done mv nr.$id.with-taxid.part$i.fa.gz nr.$id.with-taxid.fa.gz # ===================================================================== # 3. counting sequences # # ls -lh nr.$id.fa.gz # -rw-r--r-- 1 shenwei shenwei 902M 9\u6708 13 01:42 nr.6656.fa.gz # pigz -dc nr.$id.fa.gz | grep '^>' -c # 6928017 # Here 6928017 ~= 6928021 ($id.acc.txt) Directly from pre-formated blastdb # time: 5h20min blastdbcmd -db nr -entry_batch $id.acc.txt -out - | pigz -c > nr.$id.fa.gz # counting sequences # # Note that the headers of outputed fasta by blastdbcmd are \"folded\" # for accessions from different species with same sequences, so the # number may be small than $(wc -l $id.acc.txt). pigz -dc nr.$id.fa.gz | grep '^>' -c # 1577383 # counting accessions # # ls -lh nr.$id.fa.gz # -rw-r--r-- 1 shenwei shenwei 2.1G 9\u6708 13 03:38 nr.6656.fa.gz # pigz -dc nr.$id.fa.gz | grep '^>' | sed 's/>/\\n>/g' | grep '^>' -c # 288415413 makeblastdb pigz -dc nr.$id.fa.gz > nr.$id.fa # time: 3min ($nr.$id.fa from step 3 option 1) # # building $nr.$id.fa from step 3 option 2 with -parse_seqids would produce error: # # BLAST Database creation error: Error: Duplicate seq_ids are found: SP|P29868.1 # makeblastdb -parse_seqids -in nr.$id.fa -dbtype prot -out nr.$id # rm nr.$id.fa blastp (optional) # blastdb nr.$id is built from sequences in step 3 option 1 # blastp -num_threads 16 -db nr.$id -query t4.fa > t4.fa.blast # real 0m20.866s # $ cat t4.fa.blast | grep Query= -A 10 # Query= A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-Hd1a # # Length=35 Score E # Sequences producing significant alignments: (Bits) Value # 2MPQ_A Chain A, Solution structure of the sodium channel toxin Hd1a 72.4 2e-17 # A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-... 72.4 2e-17 # ADB56726.1 HNTX-IV.2 precursor [Haplopelma hainanum] 66.6 9e-15 # D2Y233.1 RecName: Full=Mu-theraphotoxin-Hhn1b 2; Short=Mu-TRTX-H... 66.6 9e-15 # ADB56830.1 HNTX-IV.3 precursor [Haplopelma hainanum] 66.6 9e-15","title":"Making nr blastdb for specific taxids"},{"location":"tutorial/#summaries-of-taxonomy-data","text":"You can change the TaxId of interest. Rank counts of common categories. $ echo Archaea Bacteria Eukaryota Fungi Metazoa Viridiplantae \\ | rush -D ' ' -T b \\ 'taxonkit list --ids $(echo {} | taxonkit name2taxid | cut -f 2) \\ | sed 1d \\ | taxonkit filter -i 2 -E genus -L genus \\ | taxonkit lineage -L -r \\ | csvtk freq -H -t -f 2 -nr \\ > stats.{}.tsv ' $ csvtk -t join --outer-join stats.*.tsv \\ | csvtk add-header -t -n \"rank,$(ls stats.*.tsv | rush -k 'echo {@stats.(.+).tsv}' | paste -sd, )\" \\ | csvtk csv2md -t Similar data on NCBI Taxonomy rank Archaea Bacteria Eukaryota Fungi Metazoa Viridiplantae species 12482 460940 1349648 156908 957297 191026 strain 354 40643 3486 2352 33 50 genus 205 4112 90882 6844 64148 16202 isolate 7 503 809 76 17 3 species group 2 77 251 22 214 5 serotype 218 serogroup 136 subsection 21 21 subspecies 632 24523 158 17043 7212 forma specialis 521 220 179 33 1 species subgroup 23 101 101 biotype 7 10 morph 12 3 4 5 section 437 37 2 398 genotype 12 12 series 9 5 4 varietas 25 8499 1100 2 7188 forma 4 560 185 6 315 subgenus 1 1558 10 1414 112 pathogroup 5 subvariety 5 5 Count of all ranks $ time taxonkit list --ids 1 \\ | taxonkit lineage -L -r \\ | csvtk freq -H -t -f 2 -nr \\ | csvtk pretty -H -t species 1879659 no rank 222743 genus 96625 strain 44483 subspecies 25174 family 9492 varietas 8524 subfamily 3050 tribe 2213 order 1660 subgenus 1618 isolate 1319 serotype 1216 clade 886 superfamily 865 forma specialis 741 forma 564 subtribe 508 section 437 class 429 suborder 372 species group 330 phylum 272 subclass 156 serogroup 138 infraorder 130 species subgroup 124 superorder 55 subphylum 33 parvorder 26 subsection 21 genotype 20 infraclass 18 biotype 17 morph 12 kingdom 11 series 9 superclass 6 cohort 5 pathogroup 5 subvariety 5 superkingdom 4 subcohort 3 subkingdom 1 superphylum 1 real 0m3.663s user 0m15.897s sys 0m1.010s Ranks of taxa at or below species. $ taxonkit list --ids 1 \\ | taxonkit filter --lower-than species --equal-to species \\ | taxonkit lineage -L -r \\ | csvtk freq -Ht -nr -f 2 \\ | csvtk add-header -t -n rank,count \\ | csvtk pretty -t rank count --------------- ------- species 1880044 no rank 222756 strain 44483 subspecies 25171 varietas 8524 isolate 1319 serotype 1216 clade 885 forma specialis 741 forma 564 serogroup 138 genotype 20 biotype 17 morph 12 pathogroup 5 subvariety 5","title":"Summaries of taxonomy data"},{"location":"tutorial/#merging-gtdb-and-ncbi-taxonomy","text":"Sometimes ( 1 ) one needs to build a database including bacteria and archaea (from GTDB) and viral database from NCBI. The idea is to export lineages from both GTDB and NCBI using taxonkit reformat , and then create taxdump files from them with taxonkit create-taxdump . Exporting taxonomic lineages of taxa with rank equal to species from GTDB-taxdump . taxonkit list --data-dir gtdb-taxdump/R207/ --ids 1 --indent \"\" \\ | taxonkit filter --data-dir gtdb-taxdump/R207/ --equal-to species \\ | taxonkit reformat --data-dir gtdb-taxdump/R207/ --taxid-field 1 \\ --format \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" \\ -o gtdb.tsv Exporting taxonomic lineages of viral taxa with rank equal to or lower than species from NCBI taxdump. For taxa whose rank is \"no rank\" below the species, we treat them as tax of strain rank ( --pseudo-strain , taxonkit v0.14.1 needed). # taxid of Viruses: 10239 taxonkit list --data-dir ~/.taxonkit --ids 10239 --indent \"\" \\ | taxonkit filter --data-dir ~/.taxonkit --equal-to species --lower-than species \\ | taxonkit reformat --data-dir ~/.taxonkit --taxid-field 1 \\ --pseudo-strain --format \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ -o ncbi-viral.tsv Creating taxdump from lineages above. (awk '{print $_\"\\t\"}' gtdb.tsv; cat ncbi-viral.tsv) \\ | taxonkit create-taxdump \\ --field-accession 1 \\ -R \"superkingdom,phylum,class,order,family,genus,species,strain\" \\ -O taxdump # we use --field-accession 1 to output the mapping file between old taxids and new ones. $ grep 2697049 taxdump/taxid.map # SARS-COV-2 2697049 21630522 Some tests: # SARS-COV-2 in NCBI taxonomy $ echo 2697049 \\ | taxonkit lineage -t --data-dir ~/.taxonkit \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L --data-dir ~/.taxonkit \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -Ht 10239 superkingdom Viruses 2559587 clade Riboviria 2732396 kingdom Orthornavirae 2732408 phylum Pisuviricota 2732506 class Pisoniviricetes 76804 order Nidovirales 2499399 suborder Cornidovirineae 11118 family Coronaviridae 2501931 subfamily Orthocoronavirinae 694002 genus Betacoronavirus 2509511 subgenus Sarbecovirus 694009 species Severe acute respiratory syndrome-related coronavirus 2697049 no rank Severe acute respiratory syndrome coronavirus 2 $ echo \"Severe acute respiratory syndrome coronavirus 2\" | taxonkit name2taxid --data-dir taxdump/ Severe acute respiratory syndrome coronavirus 2 216305222 $ echo 216305222 \\ | taxonkit lineage -t --data-dir taxdump/ \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L --data-dir taxdump/ \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -Ht 1287770734 superkingdom Viruses 1506901452 phylum Pisuviricota 1091693597 class Pisoniviricetes 37745009 order Nidovirales 738421640 family Coronaviridae 906833049 genus Betacoronavirus 1015862491 species Severe acute respiratory syndrome-related coronavirus 216305222 strain Severe acute respiratory syndrome coronavirus 2 $ echo \"Escherichia coli\" | taxonkit name2taxid --data-dir taxdump/ Escherichia coli 1945799576 $ echo 1945799576 \\ | taxonkit lineage -t --data-dir taxdump/ \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L --data-dir taxdump/ \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -Ht 609216830 superkingdom Bacteria 1641076285 phylum Proteobacteria 329474883 class Gammaproteobacteria 1012954932 order Enterobacterales 87250111 family Enterobacteriaceae 1187493883 genus Escherichia 1945799576 species Escherichia coli /** * RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS. * LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/ /* var disqus_config = function () { this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable }; */ (function() { // DON'T EDIT BELOW THIS LINE var d = document, s = d.createElement('script'); s.src = '//taxonkit.disqus.com/embed.js'; s.setAttribute('data-timestamp', +new Date()); (d.head || d.body).appendChild(s); })(); Please enable JavaScript to view the comments powered by Disqus.","title":"Merging GTDB and NCBI taxonomy"},{"location":"usage/","text":"Usage and Examples Table of Contents Usage and Examples Before use taxonkit list lineage reformat name2taxid filter lca taxid-changelog profile2cami cami-filter create-taxdump genautocomplete Before use Download and uncompress taxdump.tar.gz : ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz Copy names.dmp , nodes.dmp , delnodes.dmp and merged.dmp to data directory: $HOME/.taxonkit , e.g., /home/shenwei/.taxonkit , Optionally copy to some other directories, and later you can refer to using flag --data-dir , or environment variable TAXONKIT_DB . All-in-one command: wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz mkdir -p $HOME/.taxonkit cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit Update dataset : Simply re-download the taxdump files, uncompress and override old ones. taxonkit TaxonKit - A Practical and Efficient NCBI Taxonomy Toolkit Version: 0.14.2 Author: Wei Shen Source code: https://github.com/shenwei356/taxonkit Documents : https://bioinf.shenwei.me/taxonkit Citation : https://www.sciencedirect.com/science/article/pii/S1673852721000837 Dataset: Please download and uncompress \"taxdump.tar.gz\": ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz and copy \"names.dmp\", \"nodes.dmp\", \"delnodes.dmp\" and \"merged.dmp\" to data directory: \"/home/shenwei/.taxonkit\" or some other directory, and later you can refer to using flag --data-dir, or environment variable TAXONKIT_DB. When environment variable TAXONKIT_DB is set, explicitly setting --data-dir will overide the value of TAXONKIT_DB. Usage: taxonkit [command] Available Commands: cami-filter Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile create-taxdump Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV filter Filter TaxIds by taxonomic rank range genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell) lca Compute lowest common ancestor (LCA) for TaxIds lineage Query taxonomic lineage of given TaxIds list List taxonomic subtrees of given TaxIds name2taxid Convert scientific names to TaxIds profile2cami Convert metagenomic profile table to CAMI format reformat Reformat lineage in canonical ranks taxid-changelog Create TaxId changelog from dump archives version print version information and check for update Flags: --data-dir string directory containing nodes.dmp and names.dmp (default \"/home/shenwei/.taxonkit\") -h, --help help for taxonkit --line-buffered use line buffering on output, i.e., immediately writing to stdin/file for every line of output -o, --out-file string out file (\"-\" for stdout, suffix .gz for gzipped out) (default \"-\") -j, --threads int number of CPUs. 4 is enough (default 4) --verbose print verbose information list Usage List taxonomic subtrees of given TaxIds Attentions: 1. When multiple taxids are given, the output may contain duplicated records if some taxids are descendants of others. Examples: $ taxonkit list --ids 9606 -n -r --indent \" \" 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' $ taxonkit list --ids 9606 --indent \"\" 9606 63221 741158 Usage: taxonkit list [flags] Flags: -h, --help help for list -i, --ids string TaxId(s), multiple values should be separated by comma -I, --indent string indent (default \" \") -J, --json output in JSON format. you can save the result in file with suffix \".json\" and open with modern text editor -n, --show-name output scientific name -r, --show-rank output rank Examples Default usage. $ taxonkit list --ids 9605,239934 9605 9606 63221 741158 1425170 2665952 2665953 239934 239935 349741 512293 512294 1131822 1262691 1263034 1679444 2608915 1131336 ... Removing indent. The list could be used to extract sequences from BLAST database with blastdbcmd (see tutorial ) $ taxonkit list --ids 9605,239934 --indent \"\" 9605 9606 63221 741158 1425170 2665952 2665953 239934 239935 349741 512293 512294 1131822 1262691 1263034 1679444 ... Performance: Time and memory usage for whole taxon tree: $ # emptying the buffers cache $ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\" $ memusg -t taxonkit list --ids 1 --indent \"\" --verbose > t0.txt 21:05:01.782 [INFO] parsing merged file: /home/shenwei/.taxonkit/names.dmp 21:05:01.782 [INFO] parsing names file: /home/shenwei/.taxonkit/names.dmp 21:05:01.782 [INFO] parsing delnodes file: /home/shenwei/.taxonkit/names.dmp 21:05:01.816 [INFO] 61023 merged nodes parsed 21:05:01.889 [INFO] 437929 delnodes parsed 21:05:03.178 [INFO] 2303979 names parsed elapsed time: 3.290s peak rss: 742.77 MB Adding names $ taxonkit list --show-rank --show-name --indent \" \" --ids 9605,239934 9605 [genus] Homo 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' 1425170 [species] Homo heidelbergensis 2665952 [no rank] environmental samples 2665953 [species] Homo sapiens environmental sample 239934 [genus] Akkermansia 239935 [species] Akkermansia muciniphila 349741 [strain] Akkermansia muciniphila ATCC BAA-835 512293 [no rank] environmental samples 512294 [species] uncultured Akkermansia sp. 1131822 [species] uncultured Akkermansia sp. SMG25 1262691 [species] Akkermansia sp. CAG:344 1263034 [species] Akkermansia muciniphila CAG:154 1679444 [species] Akkermansia glycaniphila 2608915 [no rank] unclassified Akkermansia 1131336 [species] Akkermansia sp. KLE1605 1574264 [species] Akkermansia sp. KLE1797 ... Performance: Time and memory usage for whole taxonomy tree: $ # emptying the buffers cache $ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\" $ memusg -t taxonkit list --show-rank --show-name --ids 1 > t1.txt elapsed time: 5.341s peak rss: 1.04 GB Output in JSON format, you can easily collapse and uncollapse taxonomy tree in modern text editor. $ taxonkit list --show-rank --show-name --indent \" \" --ids 9605,239934 --json { \"9605 [genus] Homo\": { \"9606 [species] Homo sapiens\": { \"63221 [subspecies] Homo sapiens neanderthalensis\": { }, \"741158 [subspecies] Homo sapiens subsp. 'Denisova'\": { } }, \"1425170 [species] Homo heidelbergensis\": { } }, \"239934 [genus] Akkermansia\": { \"239935 [species] Akkermansia muciniphila\": { \"349741 [no rank] Akkermansia muciniphila ATCC BAA-835\": { } }, \"512293 [no rank] environmental samples\": { \"512294 [species] uncultured Akkermansia sp.\": { }, \"1131822 [species] uncultured Akkermansia sp. SMG25\": { }, \"1262691 [species] Akkermansia sp. CAG:344\": { }, \"1263034 [species] Akkermansia muciniphila CAG:154\": { } }, \"1679444 [species] Akkermansia glycaniphila\": { }, \"2608915 [no rank] unclassified Akkermansia\": { \"1131336 [species] Akkermansia sp. KLE1605\": { }, \"1574264 [species] Akkermansia sp. KLE1797\": { }, \"1574265 [species] Akkermansia sp. KLE1798\": { }, \"1638783 [species] Akkermansia sp. UNK.MGS-1\": { }, \"1755639 [species] Akkermansia sp. MC_55\": { } } } } Snapshot of taxonomy (taxid 1) in kate: lineage Usage Query taxonomic lineage of given TaxIds Input: - List of TaxIds, one TaxId per line. - Or tab-delimited format, please specify TaxId field with flag -i/--taxid-field (default 1). - Supporting (gzipped) file or STDIN. Output: 1. Input line data. 2. (Optional) Status code (-c/--show-status-code), values: - \"-1\" for queries not found in whole database. - \"0\" for deleted TaxIds, provided by \"delnodes.dmp\". - New TaxIds for merged TaxIds, provided by \"merged.dmp\". - Taxids for these found in \"nodes.dmp\". 3. Lineage, delimiter can be changed with flag -d/--delimiter. 4. (Optional) TaxIds taxons in the lineage (-t/--show-lineage-taxids) 5. (Optional) Name (-n/--show-name) 6. (Optional) Rank (-r/--show-rank) Filter out invalid and deleted taxids, and replace merged taxids with new ones: # input is one-column-taxid $ taxonkit lineage -c taxids.txt \\ | awk '$2>0' \\ | cut -f 2- # taxids are in 3rd field in a 4-columns tab-delimited file, # for $5, where 5 = 4 + 1. $ cat input.txt \\ | taxonkit lineage -c -i 3 \\ | csvtk filter2 -H -t -f '$5>0' \\ | csvtk -H -t cut -f -3 Usage: taxonkit lineage [flags] Flags: -d, --delimiter string field delimiter in lineage (default \";\") -h, --help help for lineage -L, --no-lineage do not show lineage, when user just want names or/and ranks -R, --show-lineage-ranks appending ranks of all levels -t, --show-lineage-taxids appending lineage consisting of taxids -n, --show-name appending scientific name -r, --show-rank appending rank of taxids -c, --show-status-code show status code before lineage -i, --taxid-field int field index of taxid. input data should be tab-separated (default 1) Examples Full lineage: # note that 123124124 is a fake taxid, 3 was deleted, 92489,1458427 were merged $ cat taxids.txt 9606 9913 376619 349741 239935 314101 11932 1327037 123124124 3 92489 1458427 $ taxonkit lineage taxids.txt | tee lineage.txt 19:22:13.077 [WARN] taxid 92489 was merged into 796334 19:22:13.077 [WARN] taxid 1458427 was merged into 1458425 19:22:13.077 [WARN] taxid 123124124 not found 19:22:13.077 [WARN] taxid 3 was deleted 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus 376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y 123124124 3 92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raicheisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei # wrapped table with csvtk pretty (>v0.26.0) $ taxonkit lineage taxids.txt | csvtk pretty -Ht -x ';' -W 70 -S bold \u250f\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2513 \u2503 9606 \u2503 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria; \u2503 \u2503 \u2503 Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi; \u2503 \u2503 \u2503 Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota; \u2503 \u2503 \u2503 Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates; \u2503 \u2503 \u2503 Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae; \u2503 \u2503 \u2503 Homo;Homo sapiens \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 9913 \u2503 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria; \u2503 \u2503 \u2503 Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi; \u2503 \u2503 \u2503 Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota; \u2503 \u2503 \u2503 Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla; \u2503 \u2503 \u2503 Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 376619 \u2503 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria; \u2503 \u2503 \u2503 Thiotrichales;Francisellaceae;Francisella;Francisella tularensis; \u2503 \u2503 \u2503 Francisella tularensis subsp. holarctica; \u2503 \u2503 \u2503 Francisella tularensis subsp. holarctica LVS \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 349741 \u2503 cellular organisms;Bacteria;PVC group;Verrucomicrobia; \u2503 \u2503 \u2503 Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia; \u2503 \u2503 \u2503 Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 239935 \u2503 cellular organisms;Bacteria;PVC group;Verrucomicrobia; \u2503 \u2503 \u2503 Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia; \u2503 \u2503 \u2503 Akkermansia muciniphila \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 314101 \u2503 cellular organisms;Bacteria;environmental samples; \u2503 \u2503 \u2503 uncultured murine large bowel bacterium BAC 54B \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 11932 \u2503 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes; \u2503 \u2503 \u2503 Ortervirales;Retroviridae;unclassified Retroviridae; \u2503 \u2503 \u2503 Intracisternal A-particles;Mouse Intracisternal A-particle \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 1327037 \u2503 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes; \u2503 \u2503 \u2503 Caudovirales;Siphoviridae;unclassified Siphoviridae; \u2503 \u2503 \u2503 Croceibacter phage P2559Y \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 92489 \u2503 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria; \u2503 \u2503 \u2503 Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 1458427 \u2503 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria; \u2503 \u2503 \u2503 Burkholderiales;Comamonadaceae;Serpentinomonas; \u2503 \u2503 \u2503 Serpentinomonas raichei \u2503 \u2517\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u253b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u251b Speed. $ time echo 9606 | taxonkit lineage 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens real 0m1.190s user 0m2.365s sys 0m0.170s # all TaxIds $ time taxonkit list --ids 1 --indent \"\" | taxonkit lineage > t real 0m4.249s user 0m16.418s sys 0m1.221s Checking deleted or merged taxids $ taxonkit lineage --show-status-code taxids.txt | tee lineage.withcode.txt # valid $ cat lineage.withcode.txt | awk '$2 > 0' | cut -f 1,2 9606 9606 9913 9913 376619 376619 349741 349741 239935 239935 314101 314101 11932 11932 1327037 1327037 92489 796334 1458427 1458425 # merged $ cat lineage.withcode.txt | awk '$2 > 0 && $2 != $1' | cut -f 1,2 92489 796334 1458427 1458425 # deleted $ cat lineage.withcode.txt | awk '$2 == 0' | cut -f 1 3 # invalid $ cat lineage.withcode.txt | awk '$2 < 0' | cut -f 1 123124124 Filter out invalid and deleted taxids, and replace merged taxids with new ones , you may install csvtk . # input is one-column-taxid $ taxonkit lineage -c taxids.txt \\ | awk '$2>0' \\ | cut -f 2- # taxids are in 3rd field in a 4-columns tab-delimited file, # for $5, where 5 = 4 + 1. $ cat input.txt \\ | taxonkit lineage -c -i 3 \\ | csvtk filter2 -H -t -f '$5>0' \\ | csvtk -H -t cut -f -3 Only show name and rank. $ taxonkit lineage -r -n -L taxids.txt \\ | csvtk pretty -H -t 9606 Homo sapiens species 9913 Bos taurus species 376619 Francisella tularensis subsp. holarctica LVS strain 349741 Akkermansia muciniphila ATCC BAA-835 strain 239935 Akkermansia muciniphila species 314101 uncultured murine large bowel bacterium BAC 54B species 11932 Mouse Intracisternal A-particle species 1327037 Croceibacter phage P2559Y species 123124124 3 92489 Erwinia oleae species 1458427 Serpentinomonas raichei species Show lineage consisting of taxids: $ taxonkit lineage -t taxids.txt 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 131567;2759;33154;33208;6072;33213;33511;7711;89593;7742;7776;117570;117571;8287;1338369;32523;32524;40674;32525;9347;1437010;314146;9443;376913;314293;9526;314295;9604;207598;9605;9606 9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus 131567;2759;33154;33208;6072;33213;33511;7711;89593;7742;7776;117570;117571;8287;1338369;32523;32524;40674;32525;9347;1437010;314145;91561;9845;35500;9895;27592;9903;9913 376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS 131567;2;1224;1236;72273;34064;262;263;119857;376619 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;1783257;74201;203494;48461;1647988;239934;239935;349741 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;1783257;74201;203494;48461;1647988;239934;239935 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B 131567;2;48479;314101 11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 10239;2559587;2732397;2732409;2732514;2169561;11632;35276;11749;11932 1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y 10239;2731341;2731360;2731618;2731619;28883;10699;196894;1327037 123124124 3 92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 131567;2;1224;1236;91347;1903409;551;796334 1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei 131567;2;1224;28216;80840;80864;2490452;1458425 or read taxids from STDIN: $ cat taxids.txt | taxonkit lineage And ranks of all nodes: $ echo 2697049 \\ | taxonkit lineage -t -R \\ | csvtk transpose -Ht 2697049 Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 superkingdom;clade;kingdom;phylum;class;order;suborder;family;subfamily;genus;subgenus;species;no rank Another way to show lineage detail of a TaxId $ echo 2697049 \\ | taxonkit lineage -t \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -H -t 10239 superkingdom Viruses 2559587 clade Riboviria 2732396 kingdom Orthornavirae 2732408 phylum Pisuviricota 2732506 class Pisoniviricetes 76804 order Nidovirales 2499399 suborder Cornidovirineae 11118 family Coronaviridae 2501931 subfamily Orthocoronavirinae 694002 genus Betacoronavirus 2509511 subgenus Sarbecovirus 694009 species Severe acute respiratory syndrome-related coronavirus 2697049 no rank Severe acute respiratory syndrome coronavirus 2 reformat Usage Reformat lineage in canonical ranks Input: - List of TaxIds or lineages, one record per line. The lineage can be a complete lineage or only one taxonomy name. - Or tab-delimited format. Plese specify the lineage field with flag -i/--lineage-field (default 2). Or specify the TaxId field with flag -I/--taxid-field (default 0), which overrides -i/--lineage-field. - Supporting (gzipped) file or STDIN. Output: 1. Input line data. 2. Reformated lineage. 3. (Optional) TaxIds taxons in the lineage (-t/--show-lineage-taxids) Ambiguous names: - Some TaxIds have the same complete lineage, empty result is returned by default. You can use the flag -a/--output-ambiguous-result to return one possible result Output format can be formated by flag --format, available placeholders: {k}: superkingdom {K}: kingdom {p}: phylum {c}: class {o}: order {f}: family {g}: genus {s}: species {t}: subspecies/strain {S}: subspecies {T}: strain When these're no nodes of rank \"subspecies\" nor \"strain\", you can switch on -S/--pseudo-strain to use the node with lowest rank as subspecies/strain name, if which rank is lower than \"species\". This flag affects {t}, {S}, {T}. Output format can contains some escape charactors like \"\\t\". Usage: taxonkit reformat [flags] Flags: -P, --add-prefix add prefixes for all ranks, single prefix for a rank is defined by flag --prefix-X -d, --delimiter string field delimiter in input lineage (default \";\") -F, --fill-miss-rank fill missing rank with lineage information of the next higher rank -f, --format string output format, placeholders of rank are needed (default \"{k};{p};{c};{o};{f};{g};{s}\") -h, --help help for reformat -i, --lineage-field int field index of lineage. data should be tab-separated (default 2) -r, --miss-rank-repl string replacement string for missing rank -p, --miss-rank-repl-prefix string prefix for estimated taxon level (default \"unclassified \") -s, --miss-rank-repl-suffix string suffix for estimated taxon names. \"rank\" for rank name, \"\" for no suffix (default \"rank\") -R, --miss-taxid-repl string replacement string for missing taxid -a, --output-ambiguous-result output one of the ambigous result --prefix-K string prefix for kingdom, used along with flag -P/--add-prefix (default \"K__\") --prefix-S string prefix for subspecies, used along with flag -P/--add-prefix (default \"S__\") --prefix-T string prefix for strain, used along with flag -P/--add-prefix (default \"T__\") --prefix-c string prefix for class, used along with flag -P/--add-prefix (default \"c__\") --prefix-f string prefix for family, used along with flag -P/--add-prefix (default \"f__\") --prefix-g string prefix for genus, used along with flag -P/--add-prefix (default \"g__\") --prefix-k string prefix for superkingdom, used along with flag -P/--add-prefix (default \"k__\") --prefix-o string prefix for order, used along with flag -P/--add-prefix (default \"o__\") --prefix-p string prefix for phylum, used along with flag -P/--add-prefix (default \"p__\") --prefix-s string prefix for species, used along with flag -P/--add-prefix (default \"s__\") --prefix-t string prefix for subspecies/strain, used along with flag -P/--add-prefix (default \"t__\") -S, --pseudo-strain use the node with lowest rank as strain name, only if which rank is lower than \"species\" and not \"subpecies\" nor \"strain\". It affects {t}, {S}, {T}. This flag needs flag -F -t, --show-lineage-taxids show corresponding taxids of reformated lineage -I, --taxid-field int field index of taxid. input data should be tab-separated. it overrides -i/--lineage-field -T, --trim do not fill or add prefix for missing rank lower than current rank Examples: For version > 0.8.0, reformat accept input of TaxIds via flag -I/--taxid-field . $ echo 239935 | taxonkit reformat -I 1 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila $ echo 349741 | taxonkit reformat -I 1 -f \"{k}|{p}|{c}|{o}|{f}|{g}|{s}|{t}\" -F -t 349741 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila|Akkermansia muciniphila ATCC BAA-835 2|74201|203494|48461|1647988|239934|239935|349741 Example lineage (produced by: taxonkit lineage taxids.txt | awk '$2!=\"\"' > lineage.txt ). $ cat lineage.txt 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus 376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y 92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei Default output format ( \"{k};{p};{c};{o};{f};{g};{s}\" ). # reformated lineages are appended to the input data $ taxonkit reformat lineage.txt ... 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila ... $ $ taxonkit reformat lineage.txt | tee lineage.txt.reformat $ cut -f 1,3 lineage.txt.reformat 9606 Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens 9913 Eukaryota;Chordata;Mammalia;Artiodactyla;Bovidae;Bos;Bos taurus 376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis 349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y 92489 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 1458427 Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei # aligned $ cat lineage.txt \\ | taxonkit reformat \\ | csvtk -H -t cut -f 1,3 \\ | csvtk -H -t sep -f 2 -s ';' -R \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk pretty -t taxid kindom phylum class order family genus species ------- --------- --------------- ------------------- ------------------ --------------- -------------------------- ----------------------------------------------- 9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens 9913 Eukaryota Chordata Mammalia Artiodactyla Bovidae Bos Bos taurus 376619 Bacteria Proteobacteria Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis 349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila 239935 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila 314101 Bacteria uncultured murine large bowel bacterium BAC 54B 11932 Viruses Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle 1327037 Viruses Uroviricota Caudoviricetes Caudovirales Siphoviridae Croceibacter phage P2559Y 92489 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Erwiniaceae Erwinia Erwinia oleae 1458427 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Comamonadaceae Serpentinomonas Serpentinomonas raichei And subspecies/strain ( {t} ), subspecies ( {S} ), and strain ( {T} ) are also available. # default operation $ echo -ne \"239935\\n83333\\n1408252\\n2697049\\n2605619\\n\" \\ | taxonkit lineage -n -r \\ | taxonkit reformat -f '{t};{S};{T}' \\ | csvtk -H -t cut -f 1,4,3,5 \\ | csvtk -H -t sep -f 4 -s ';' -R \\ | csvtk -H -t add-header -n \"taxid,rank,name,subspecies/strain,subspecies,strain\" \\ | csvtk pretty -t taxid rank name subspecies/strain subspecies strain ------- ---------- ----------------------------------------------- --------------------- --------------------- --------------------- 239935 species Akkermansia muciniphila 83333 strain Escherichia coli K-12 Escherichia coli K-12 Escherichia coli K-12 1408252 subspecies Escherichia coli R178 Escherichia coli R178 Escherichia coli R178 2697049 no rank Severe acute respiratory syndrome coronavirus 2 2605619 no rank Escherichia coli O16:H48 # fill missing ranks # see example below for -F/--fill-miss-rank # $ echo -ne \"239935\\n83333\\n1408252\\n2697049\\n2605619\\n\" \\ | taxonkit lineage -n -r \\ | taxonkit reformat -f '{t};{S};{T}' --fill-miss-rank \\ | csvtk -H -t cut -f 1,4,3,5 \\ | csvtk -H -t sep -f 4 -s ';' -R \\ | csvtk -H -t add-header -n \"taxid,rank,name,subspecies/strain,subspecies,strain\" \\ | csvtk pretty -t taxid rank name subspecies/strain subspecies strain ------- ---------- ----------------------------------------------- ------------------------------------------------------------------------------------ ----------------------------------------------------------------------------- ------------------------------------------------------------------------- 239935 species Akkermansia muciniphila unclassified Akkermansia muciniphila subspecies/strain unclassified Akkermansia muciniphila subspecies unclassified Akkermansia muciniphila strain 83333 strain Escherichia coli K-12 Escherichia coli K-12 unclassified Escherichia coli subspecies Escherichia coli K-12 1408252 subspecies Escherichia coli R178 Escherichia coli R178 Escherichia coli R178 unclassified Escherichia coli R178 strain 2697049 no rank Severe acute respiratory syndrome coronavirus 2 unclassified Severe acute respiratory syndrome-related coronavirus subspecies/strain unclassified Severe acute respiratory syndrome-related coronavirus subspecies unclassified Severe acute respiratory syndrome-related coronavirus strain 2605619 no rank Escherichia coli O16:H48 unclassified Escherichia coli subspecies/strain unclassified Escherichia coli subspecies unclassified Escherichia coli strain When these's no nodes of rank \"subspecies\" nor \"strain\", you can switch -S/--pseudo-strain to use the node with lowest rank as subspecies/strain name, if which rank is lower than \"species\" . Recommend using v0.14.1 or later versions. $ echo -ne \"239935\\n83333\\n1408252\\n2697049\\n2605619\\n\" \\ | taxonkit lineage -n -r \\ | taxonkit reformat -f '{t};{S};{T}' --pseudo-strain \\ | csvtk -H -t cut -f 1,4,3,5 \\ | csvtk -H -t sep -f 4 -s ';' -R \\ | csvtk -H -t add-header -n \"taxid,rank,name,subspecies/strain,subspecies,strain\" \\ | csvtk pretty -t taxid rank name subspecies/strain subspecies strain ------- ---------- ----------------------------------------------- ----------------------------------------------- ----------------------------------------------- ----------------------------------------------- 239935 species Akkermansia muciniphila 83333 strain Escherichia coli K-12 Escherichia coli K-12 Escherichia coli K-12 1408252 subspecies Escherichia coli R178 Escherichia coli R178 Escherichia coli R178 2697049 no rank Severe acute respiratory syndrome coronavirus 2 Severe acute respiratory syndrome coronavirus 2 Severe acute respiratory syndrome coronavirus 2 Severe acute respiratory syndrome coronavirus 2 2605619 no rank Escherichia coli O16:H48 Escherichia coli O16:H48 Escherichia coli O16:H48 Escherichia coli O16:H48 Add prefix ( -P/--add-prefix ). $ cat lineage.txt \\ | taxonkit reformat -P \\ | csvtk -H -t cut -f 1,3 9606 k__Eukaryota;p__Chordata;c__Mammalia;o__Primates;f__Hominidae;g__Homo;s__Homo sapiens 9913 k__Eukaryota;p__Chordata;c__Mammalia;o__Artiodactyla;f__Bovidae;g__Bos;s__Bos taurus 376619 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Thiotrichales;f__Francisellaceae;g__Francisella;s__Francisella tularensis 349741 k__Bacteria;p__Verrucomicrobia;c__Verrucomicrobiae;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia muciniphila 239935 k__Bacteria;p__Verrucomicrobia;c__Verrucomicrobiae;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia muciniphila 314101 k__Bacteria;p__;c__;o__;f__;g__;s__uncultured murine large bowel bacterium BAC 54B 11932 k__Viruses;p__Artverviricota;c__Revtraviricetes;o__Ortervirales;f__Retroviridae;g__Intracisternal A-particles;s__Mouse Intracisternal A-particle 1327037 k__Viruses;p__Uroviricota;c__Caudoviricetes;o__Caudovirales;f__Siphoviridae;g__;s__Croceibacter phage P2559Y 92489 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Erwiniaceae;g__Erwinia;s__Erwinia oleae 1458427 k__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__Comamonadaceae;g__Serpentinomonas;s__Serpentinomonas raichei Show corresponding taxids of reformated lineage (flag -t/--show-lineage-taxids ) $ cat lineage.txt \\ | taxonkit reformat -t \\ | csvtk -H -t cut -f 1,4 \\ | csvtk -H -t sep -f 2 -s ';' -R \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk pretty -t taxid kindom phylum class order family genus species ------- ------ ------- ------- ------- ------- ------- ------- 9606 2759 7711 40674 9443 9604 9605 9606 9913 2759 7711 40674 91561 9895 9903 9913 376619 2 1224 1236 72273 34064 262 263 349741 2 74201 203494 48461 1647988 239934 239935 239935 2 74201 203494 48461 1647988 239934 239935 314101 2 314101 11932 10239 2732409 2732514 2169561 11632 11749 11932 1327037 10239 2731618 2731619 28883 10699 1327037 92489 2 1224 1236 91347 1903409 551 796334 1458427 2 1224 28216 80840 80864 2490452 1458425 Use custom symbols for unclassfied ranks ( -r/--miss-rank-repl ) $ taxonkit reformat lineage.txt -r \"__\" | cut -f 3 Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens Eukaryota;Chordata;Mammalia;Artiodactyla;Bovidae;Bos;Bos taurus Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;__;__;__;__;__;uncultured murine large bowel bacterium BAC 54B Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;__;Croceibacter phage P2559Y Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei $ taxonkit reformat lineage.txt -r Unassigned | cut -f 3 Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens Eukaryota;Chordata;Mammalia;Artiodactyla;Bovidae;Bos;Bos taurus Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Unassigned;Unassigned;Unassigned;Unassigned;Unassigned;uncultured murine large bowel bacterium BAC 54B Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;Unassigned;Croceibacter phage P2559Y Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei Estimate and fill missing rank with original lineage information ( -F, --fill-miss-rank , very useful for formatting input data for LEfSe ). You can change the prefix \"unclassified\" using flag -p/--miss-rank-repl-prefix . $ cat lineage.txt \\ | taxonkit reformat -F \\ | csvtk -H -t cut -f 1,3 \\ | csvtk -H -t sep -f 2 -s ';' -R \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk pretty -t taxid kindom phylum class order family genus species ------- --------- ---------------------------- --------------------------- --------------------------- ---------------------------- ------------------------------- ----------------------------------------------- 9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens 9913 Eukaryota Chordata Mammalia Artiodactyla Bovidae Bos Bos taurus 376619 Bacteria Proteobacteria Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis 349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila 239935 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila 314101 Bacteria unclassified Bacteria phylum unclassified Bacteria class unclassified Bacteria order unclassified Bacteria family unclassified Bacteria genus uncultured murine large bowel bacterium BAC 54B 11932 Viruses Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle 1327037 Viruses Uroviricota Caudoviricetes Caudovirales Siphoviridae unclassified Siphoviridae genus Croceibacter phage P2559Y 92489 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Erwiniaceae Erwinia Erwinia oleae 1458427 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Comamonadaceae Serpentinomonas Serpentinomonas raichei Do not add prefix or suffix for estimated nodes: $ echo 314101 | taxonkit reformat -I 1 314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B $ echo 314101 | taxonkit reformat -I 1 -F -p \"\" -s \"\" 314101 Bacteria;Bacteria;Bacteria;Bacteria;Bacteria;Bacteria;uncultured murine large bowel bacterium BAC 54B Only some ranks. $ cat lineage.txt \\ | taxonkit reformat -F -f \"{s};{p}\"\\ | csvtk -H -t cut -f 1,3 \\ | csvtk -H -t sep -f 2 -s ';' -R \\ | csvtk add-header -t -n taxid,species,phylum \\ | csvtk pretty -t taxid species phylum ------- ----------------------------------------------- ---------------------------- 9606 Homo sapiens Chordata 9913 Bos taurus Chordata 376619 Francisella tularensis Proteobacteria 349741 Akkermansia muciniphila Verrucomicrobia 239935 Akkermansia muciniphila Verrucomicrobia 314101 uncultured murine large bowel bacterium BAC 54B unclassified Bacteria phylum 11932 Mouse Intracisternal A-particle Artverviricota 1327037 Croceibacter phage P2559Y Uroviricota 92489 Erwinia oleae Proteobacteria 1458427 Serpentinomonas raichei Proteobacteria For some taxids which rank is higher than the lowest rank in -f/--format , use -T/--trim to avoid fill missing rank lower than current rank . $ echo -ne \"2\\n239934\\n239935\\n\" \\ | taxonkit lineage \\ | taxonkit reformat -F \\ | sed -r \"s/;+$//\" \\ | csvtk -H -t cut -f 1,3 2 Bacteria;unclassified Bacteria phylum;unclassified Bacteria class;unclassified Bacteria order;unclassified Bacteria family;unclassified Bacteria genus;unclassified Bacteria species 239934 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;unclassified Akkermansia species 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila $ echo -ne \"2\\n239934\\n239935\\n\" \\ | taxonkit lineage \\ | taxonkit reformat -F -T \\ | sed -r \"s/;+$//\" \\ | csvtk -H -t cut -f 1,3 2 Bacteria 239934 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Support tab in format string $ echo 9606 \\ | taxonkit lineage \\ | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{S}\" \\ | csvtk cut -t -f -2 9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens List seven-level lineage for all TaxIds. # replace empty taxon with \"Unassigned\" $ taxonkit list --ids 1 \\ | taxonkit lineage \\ | taxonkit reformat -r Unassigned | gzip -c > all.lineage.tsv.gz # tab-delimited seven-levels $ taxonkit list --ids 1 \\ | taxonkit lineage \\ | taxonkit reformat -r Unassigned -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" \\ | csvtk cut -H -t -f -2 \\ | head -n 5 \\ | csvtk pretty -H -t # 8-level $ taxonkit list --ids 1 \\ | taxonkit lineage \\ | taxonkit reformat -r Unassigned -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | csvtk cut -H -t -f -2 \\ | head -n 5 \\ | csvtk pretty -H -t # Fill and trim $ memusg -t -s ' taxonkit list --ids 1 \\ | taxonkit lineage \\ | taxonkit reformat -F -T \\ | sed -r \"s/;+$//\" \\ | gzip -c > all.lineage.tsv.gz ' elapsed time: 19.930s peak rss: 6.25 GB From taxid to 7-ranks lineage: $ cat taxids.txt | taxonkit lineage | taxonkit reformat # for taxonkit v0.8.0 or later versions $ cat taxids.txt | taxonkit reformat -I 1 Some TaxIds have the same complete lineage, empty result is returned by default. You can use the flag -a/--output-ambiguous-result to return one possible result. see #42 $ echo -ne \"2507530\\n2516889\\n\" | taxonkit lineage --data-dir . | taxonkit reformat --data-dir . -t 19:18:29.770 [WARN] we can't distinguish the TaxIds (2507530, 2516889) for lineage: cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019. But you can use -a/--output-ambiguous-result to return one possible result 19:18:29.770 [WARN] we can't distinguish the TaxIds (2507530, 2516889) for lineage: cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019. But you can use -a/--output-ambiguous-result to return one possible result 2507530 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 2516889 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 $ echo -ne \"2507530\\n2516889\\n\" | taxonkit lineage --data-dir . | taxonkit reformat --data-dir . -t -a 2507530 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 Eukaryota;Basidiomycota;Agaricomycetes;Russulales;Russulaceae;Russula;Russula sp. 8 KA-2019 2759;5204;155619;452342;5401;5402;2507530 2516889 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 Eukaryota;Basidiomycota;Agaricomycetes;Russulales;Russulaceae;Russula;Russula sp. 8 KA-2019 2759;5204;155619;452342;5401;5402;2507530 name2taxid Usage Convert scientific names to TaxIds Attention: 1. Some TaxIds share the same scientific names, e.g, Drosophila. These input lines are duplicated with multiple TaxIds. $ echo Drosophila | taxonkit name2taxid | taxonkit lineage -i 2 -r -L Drosophila 7215 genus Drosophila 32281 subgenus Drosophila 2081351 genus Usage: taxonkit name2taxid [flags] Flags: -h, --help help for name2taxid -i, --name-field int field index of name. data should be tab-separated (default 1) -s, --sci-name only searching scientific names -r, --show-rank show rank Examples Example data $ cat names.txt Homo sapiens Akkermansia muciniphila ATCC BAA-835 Akkermansia muciniphila Mouse Intracisternal A-particle Wei Shen uncultured murine large bowel bacterium BAC 54B Croceibacter phage P2559Y Default. # taxonkit name2taxid names.txt $ cat names.txt | taxonkit name2taxid | csvtk pretty -H -t Homo sapiens 9606 Akkermansia muciniphila ATCC BAA-835 349741 Akkermansia muciniphila 239935 Mouse Intracisternal A-particle 11932 Wei Shen uncultured murine large bowel bacterium BAC 54B 314101 Croceibacter phage P2559Y 1327037 Show rank. $ cat names.txt | taxonkit name2taxid --show-rank | csvtk pretty -H -t Homo sapiens 9606 species Akkermansia muciniphila ATCC BAA-835 349741 strain Akkermansia muciniphila 239935 species Mouse Intracisternal A-particle 11932 species Wei Shen uncultured murine large bowel bacterium BAC 54B 314101 species Croceibacter phage P2559Y 1327037 species From name to lineage. $ cat names.txt | taxonkit name2taxid | taxonkit lineage --taxid-field 2 Homo sapiens 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens Akkermansia muciniphila ATCC BAA-835 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 Akkermansia muciniphila 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Mouse Intracisternal A-particle 11932 Viruses;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle Wei Shen uncultured murine large bowel bacterium BAC 54B 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B Croceibacter phage P2559Y 1327037 Viruses;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y Some TaxIds share the same scientific names , e.g, Drosophila. $ echo Drosophila \\ | taxonkit name2taxid \\ | taxonkit lineage -i 2 -r \\ | taxonkit reformat -i 3 \\ | csvtk cut -H -t -f 1,2,4,5 \\ | csvtk pretty -H -t Drosophila 7215 genus Eukaryota;Arthropoda;Insecta;Diptera;Drosophilidae;Drosophila; Drosophila 32281 subgenus Eukaryota;Arthropoda;Insecta;Diptera;Drosophilidae;Drosophila; Drosophila 2081351 genus Eukaryota;Basidiomycota;Agaricomycetes;Agaricales;Psathyrellaceae;Drosophila; filter Usage Filter TaxIds by taxonomic rank range Attentions: 1. Flag -L/--lower-than and -H/--higher-than are exclusive, and can be used along with -E/--equal-to which values can be different. 2. A list of pre-ordered ranks is in ~/.taxonkit/ranks.txt, you can use your list by -r/--rank-file, the format specification is below. 3. All ranks in taxonomy database should be defined in rank file. 4. Ranks can be removed with black list via -B/--black-list. 5. TaxIDs with no rank are kept by default!!! They can be optionally discarded by -N/--discard-noranks. 6. [Recommended] When filtering with -L/--lower-than, you can use -n/--save-predictable-norank to save some special ranks without order, where rank of the closest higher node is still lower than rank cutoff. Rank file: 1. Blank lines or lines starting with \"#\" are ignored. 2. Ranks are in decending order and case ignored. 3. Ranks with same order should be in one line separated with comma (\",\", no space). 4. Ranks without order should be assigned a prefix symbol \"!\" for each rank. Usage: taxonkit filter [flags] Flags: -B, --black-list strings black list of ranks to discard, e.g., '-B \"no rank\" -B \"clade\" -N, --discard-noranks discard all ranks without order, type \"taxonkit filter --help\" for details -R, --discard-root discard root taxid, defined by --root-taxid -E, --equal-to strings output TaxIds with rank equal to some ranks, multiple values can be separated with comma \",\" (e.g., -E \"genus,species\"), or give multiple times (e.g., -E genus -E species) -h, --help help for filter -H, --higher-than string output TaxIds with rank higher than a rank, exclusive with --lower-than --list-order list user defined ranks in order, from \"$HOME/.taxonkit/ranks.txt\" --list-ranks list ordered ranks in taxonomy database, sorted in user defined order -L, --lower-than string output TaxIds with rank lower than a rank, exclusive with --higher-than -r, --rank-file string user-defined ordered taxonomic ranks, type \"taxonkit filter --help\" for details --root-taxid uint32 root taxid (default 1) -n, --save-predictable-norank do not discard some special ranks without order when using -L, where rank of the closest higher node is still lower than rank cutoff -i, --taxid-field int field index of taxid. input data should be tab-separated (default 1) Examples Example data $ echo 349741 | taxonkit lineage -t | cut -f 3 | sed 's/;/\\n/g' > taxids2.txt $ cat taxids2.txt 131567 2 1783257 74201 203494 48461 1647988 239934 239935 349741 $ cat taxids2.txt | taxonkit lineage -r | csvtk -Ht cut -f 1,3,2 | csvtk pretty -H -t 131567 no rank cellular organisms 2 superkingdom cellular organisms;Bacteria 1783257 clade cellular organisms;Bacteria;PVC group 74201 phylum cellular organisms;Bacteria;PVC group;Verrucomicrobia 203494 class cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae 48461 order cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales 1647988 family cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae 239934 genus cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia 239935 species cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 349741 strain cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 Equal to certain rank(s) ( -E/--equal-to ) $ cat taxids2.txt \\ | taxonkit filter -E Phylum -E Class \\ | taxonkit lineage -r \\ | csvtk -Ht cut -f 1,3,2 \\ | csvtk pretty -H -t 74201 phylum cellular organisms;Bacteria;PVC group;Verrucomicrobia 203494 class cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae Lower than a rank ( -L/--lower-than ) $ cat taxids2.txt \\ | taxonkit filter -L genus \\ | taxonkit lineage -r -n -L \\ | csvtk -Ht cut -f 1,3,2 \\ | csvtk pretty -H -t 239935 species Akkermansia muciniphila 349741 strain Akkermansia muciniphila ATCC BAA-835 Higher than a rank ( -H/--higher-than ) $ cat taxids2.txt \\ | taxonkit filter -H phylum \\ | taxonkit lineage -r -n -L \\ | csvtk -Ht cut -f 1,3,2 \\ | csvtk pretty -H -t 2 superkingdom Bacteria TaxIDs with no rank are kept by default!!! \"no rank\" and \"clade\" have no rank and can be filter out via -N/--discard-noranks . Futher ranks can be removed with black list via -B/--black-list . # 562 is the TaxId of Escherichia coli $ taxonkit list --ids 562 \\ | taxonkit filter -L species \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk freq -Ht -f 2 -nr \\ | csvtk pretty -H -t strain 2950 no rank 149 serotype 141 serogroup 95 isolate 1 subspecies 1 $ taxonkit list --ids 562 \\ | taxonkit filter -L species -N -B strain \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk freq -Ht -f 2 -nr \\ | csvtk pretty -H -t serotype 141 serogroup 95 isolate 1 subspecies 1 Combine of -L/-H with -E . $ cat taxids2.txt \\ | taxonkit filter -L genus -E genus \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -H -t 239934 genus Akkermansia 239935 species Akkermansia muciniphila 349741 strain Akkermansia muciniphila ATCC BAA-835 Special cases of \"no rank\" . ( -n/--save-predictable-norank ). When filtering with -L/--lower-than , you can use -n/--save-predictable-norank to save some special ranks without order, where rank of the closest higher node is still lower than rank cutoff. $ echo -ne \"2605619\\n1327037\\n\" \\ | taxonkit lineage -t \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -H -t 131567 no rank cellular organisms 2 superkingdom Bacteria 1224 phylum Proteobacteria 1236 class Gammaproteobacteria 91347 order Enterobacterales 543 family Enterobacteriaceae 561 genus Escherichia 562 species Escherichia coli 2605619 no rank Escherichia coli O16:H48 10239 superkingdom Viruses 2731341 clade Duplodnaviria 2731360 clade Heunggongvirae 2731618 phylum Uroviricota 2731619 class Caudoviricetes 28883 order Caudovirales 10699 family Siphoviridae 196894 no rank unclassified Siphoviridae 1327037 species Croceibacter phage P2559Y # save taxids $ echo -ne \"2605619\\n1327037\\n\" \\ | taxonkit lineage -t \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | tee taxids4.txt 131567 2 1224 1236 91347 543 561 562 2605619 10239 2731341 2731360 2731618 2731619 28883 10699 196894 1327037 Now, filter nodes of rank <= species. $ cat taxids4.txt \\ | taxonkit filter -L species -E species -N -n \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -H -t 562 species Escherichia coli 2605619 no rank Escherichia coli O16:H48 1327037 species Croceibacter phage P2559Y Note that 2605619 (no rank) is saved because its parent node 562 is <= species. lca Usage Compute lowest common ancestor (LCA) for TaxIds Attention: 1. This command computes LCA TaxId for a list of TaxIds in a field (\"-i/--taxids-field) of tab-delimited file or STDIN. 2. TaxIDs should have the same separator (\"-s/--separator\"), single charactor separator is prefered. 3. Empty lines or lines without valid TaxIds in the field are omitted. 4. If some TaxIds are not found in database, it returns 0. Examples: $ echo 239934, 239935, 349741 | taxonkit lca -s \", \" 239934, 239935, 349741 239934 $ time echo 239934 239935 349741 9606 | taxonkit lca 239934 239935 349741 9606 131567 Usage: taxonkit lca [flags] Flags: -b, --buffer-size string size of line buffer, supported unit: K, M, G. You need to increase the value when \"bufio.Scanner: token too long\" error occured (default \"1M\") -h, --help help for lca --separater string separater for TaxIds. This flag is same to --separator. (default \" \") -s, --separator string separator for TaxIds (default \" \") -D, --skip-deleted skip deleted TaxIds and compute with left ones -U, --skip-unfound skip unfound TaxIds and compute with left ones -i, --taxids-field int field index of TaxIds. Input data should be tab-separated (default 1) Examples: Example data $ taxonkit list --ids 9605 -nr --indent \" \" 9605 [genus] Homo 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' 1425170 [species] Homo heidelbergensis 2665952 [no rank] environmental samples 2665953 [species] Homo sapiens environmental sample Simple one $ echo 63221 2665953 | taxonkit lca 63221 2665953 9605 Custom field ( -i/--taxids-field ) and separater ( -s/--separator ). $ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" a 63221,2665953 b 63221, 741158 $ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" \\ | taxonkit lca -i 2 -s \",\" a 63221,2665953 9605 b 63221, 741158 9606 Merged TaxIds. # merged $ echo 92487 92488 92489 | taxonkit lca 10:08:26.578 [WARN] taxid 92489 was merged into 796334 92487 92488 92489 1236 Deleted TaxIds, you can ommit theses and continue compute with left onces with ( -D/--skip-deleted ). $ echo 1 2 3 | taxonkit lca 10:30:17.678 [WARN] taxid 3 not found 1 2 3 0 $ time echo 1 2 3 | taxonkit lca -D 10:29:31.828 [WARN] taxid 3 was deleted 1 2 3 1 TaxIDs not found in database, you can ommit theses and continue compute with left onces with ( -U/--skip-unfound ). $ echo 61021 61022 11111111 | taxonkit lca 10:31:44.929 [WARN] taxid 11111111 not found 61021 61022 11111111 0 $ echo 61021 61022 11111111 | taxonkit lca -U 10:32:02.772 [WARN] taxid 11111111 not found 61021 61022 11111111 2628496 taxid-changelog Usage Create TaxId changelog from dump archives Steps: # dependencies: # rush - https://github.com/shenwei356/rush/ mkdir -p archive; cd archive; # --------- download --------- # option 1 # for fast network connection wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/taxdmp*.zip # option 2 # for slow network connection url=https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/ wget $url -O - -o /dev/null \\ | grep taxdmp | perl -ne '/(taxdmp_.+?.zip)/; print \"$1\\n\";' \\ | rush -j 2 -v url=$url 'axel -n 5 {url}/{}' \\ --immediate-output -c -C download.rush # --------- unzip --------- ls taxdmp*.zip | rush -j 1 'unzip {} names.dmp nodes.dmp merged.dmp delnodes.dmp -d {@_(.+)\\.}' # optionally compress .dmp files with pigz, for saving disk space fd .dmp$ | rush -j 4 'pigz {}' # --------- create log --------- cd .. taxonkit taxid-changelog -i archive -o taxid-changelog.csv.gz --verbose Output format (CSV): # fields comments taxid # taxid version # version / time of archive, e.g, 2019-07-01 change # change, values: # NEW newly added # REUSE_DEL deleted taxids being reused # REUSE_MER merged taxids being reused # DELETE deleted # MERGE merged into another taxid # ABSORB other taxids merged into this one # CHANGE_NAME scientific name changed # CHANGE_RANK rank changed # CHANGE_LIN_LIN lineage taxids remain but lineage remain # CHANGE_LIN_TAX lineage taxids changed # CHANGE_LIN_LEN lineage length changed change-value # variable values for changes: # 1) new taxid for MERGE # 2) merged taxids for ABSORB # 3) empty for others name # scientific name rank # rank lineage # complete lineage of the taxid lineage-taxids # taxids of the lineage # you can use csvtk to investigate them. e.g., csvtk grep -f taxid -p 1390515 taxid-changelog.csv.gz Usage: taxonkit taxid-changelog [flags] Flags: -i, --archive string directory containing uncompressed dumped archives -h, --help help for taxid-changelog Details Example 1 ( E.coli with taxid 562 ) $ pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -p 562 \\ | csvtk pretty taxid version change change-value name rank lineage lineage-taxids 562 2014-08-01 NEW Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 562 2014-08-01 ABSORB 662101;662104 Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 562 2015-11-01 ABSORB 1637691 Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 562 2016-10-01 CHANGE_LIN_LIN Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 562 2018-06-01 ABSORB 469598 Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 # merged taxids $ pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -p 662101,662104,1637691,469598 \\ | csvtk pretty taxid version change change-value name rank lineage lineage-taxids 469598 2014-08-01 NEW Escherichia sp. 3_2_53FAA species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia sp. 3_2_53FAA 131567;2;1224;1236;91347;543;561;469598 469598 2016-10-01 CHANGE_LIN_LIN Escherichia sp. 3_2_53FAA species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia sp. 3_2_53FAA 131567;2;1224;1236;91347;543;561;469598 469598 2018-06-01 MERGE 562 Escherichia sp. 3_2_53FAA species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia sp. 3_2_53FAA 131567;2;1224;1236;91347;543;561;469598 662101 2014-08-01 MERGE 562 662104 2014-08-01 MERGE 562 1637691 2015-04-01 DELETE 1637691 2015-05-01 REUSE_DEL Escherichia sp. MAR species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia sp. MAR 131567;2;1224;1236;91347;543;561;1637691 1637691 2015-11-01 MERGE 562 Escherichia sp. MAR species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia sp. MAR 131567;2;1224;1236;91347;543;561;1637691 Example 2 (SARS-CoV-2). $ time pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -p 2697049 \\ | csvtk pretty taxid version change change-value name rank lineage lineage-taxids 2697049 2020-02-01 NEW Wuhan seafood market pneumonia virus species Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;unclassified Betacoronavirus;Wuhan seafood market pneumonia virus 10239;2559587;76804;2499399;11118;2501931;694002;696098;2697049 2697049 2020-03-01 CHANGE_NAME Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-03-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-03-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-06-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-07-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 isolate Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-08-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 real 0m7.644s user 0m16.749s sys 0m3.985s Example 3 (All subspecies and strain in Akkermansia muciniphila 239935) # species in Akkermansia $ taxonkit list --show-rank --show-name --indent \" \" --ids 239935 239935 [species] Akkermansia muciniphila 349741 [strain] Akkermansia muciniphila ATCC BAA-835 # check them all $ pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -P <(taxonkit list --indent \"\" --ids 239935) \\ | csvtk pretty lineage-taxids taxid version change change-value name rank lineage lineage-taxids 239935 2014-08-01 NEW Akkermansia muciniphila species cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Verrucomicrobiaceae;Akkermansia;Akkermansia muciniphila 131567;2;51290;74201;203494;48461;203557;239934;239935 239935 2015-05-01 CHANGE_LIN_TAX Akkermansia muciniphila species cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;51290;74201;203494;48461;1647988;239934;239935 239935 2016-03-01 CHANGE_LIN_TAX Akkermansia muciniphila species cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;1783257;74201;203494;48461;1647988;239934;239935 239935 2016-05-01 ABSORB 1834199 Akkermansia muciniphila species cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;1783257;74201;203494;48461;1647988;239934;239935 349741 2014-08-01 NEW Akkermansia muciniphila ATCC BAA-835 no rank cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Verrucomicrobiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;51290;74201;203494;48461;203557;239934;239935;349741 349741 2015-05-01 CHANGE_LIN_TAX Akkermansia muciniphila ATCC BAA-835 no rank cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;51290;74201;203494;48461;1647988;239934;239935;349741 349741 2016-03-01 CHANGE_LIN_TAX Akkermansia muciniphila ATCC BAA-835 no rank cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;1783257;74201;203494;48461;1647988;239934;239935;349741 349741 2020-07-01 CHANGE_RANK Akkermansia muciniphila ATCC BAA-835 strain cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;1783257;74201;203494;48461;1647988;239934;239935;349741 More create-taxdump Usage Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV Input format: 0. For GTDB taxonomy file, just use --gtdb. We use the numeric assembly accession as the taxon at subspecies rank. (without the prefix GCA_ and GCF_, and version number). 1. The input file should be tab-delimited, at least one column is needed. 2. Ranks can be given either via the first row or the flag --rank-names. 3. The column containing the genome/assembly accession is recommended to generate TaxId mapping file (taxid.map, id -> taxid). -A/--field-accession, field contaning genome/assembly accession --field-accession-re, regular expression to extract the accession Note that mutiple TaxIds pointing to the same accession are listed as comma-seperated integers. Attentions: 1. Names should be distinct in taxa of different ranks. But for these missing some taxon nodes, using names of parent nodes is allowed: GB_GCA_018897955.1 d__Archaea;p__EX4484-52;c__EX4484-52;o__EX4484-52;f__LFW-46;g__LFW-46;s__LFW-46 sp018897155 It can also detect duplicate names with different ranks, e.g., the Class and Genus have the same name B47-G6, and the Order and Family between them have different names. In this case, we reassign a new TaxId by increasing the TaxId until it being distinct. GB_GCA_003663585.1 d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663585 2. Taxa from different parents may have the same name. We will assign different TaxIds to them. E.g., in ICTV, many viruses from different species have the same names. In practice, we set the \"Virus names(s)\" as a subspecies rank and also specify it as the accession. Species Virus name(s) Jerseyvirus SETP3 Salmonella phage SETP7 Jerseyvirus SETP7 Salmonella phage SETP7 3. The generated TaxIds are not consecutive numbers, however some tools like MMSeqs2 required this, you can use the script below for convertion: https://github.com/apcamargo/ictv-mmseqs2-protein-database/blob/master/fix_taxdump.py Usage: taxonkit create-taxdump [flags] Flags: -A, --field-accession int field index of assembly accession (genome ID), for outputting taxid.map --field-accession-re string regular expression to extract assembly accession (default \"^\\\\w\\\\w_(.+)$\") --force overwrite existed output directory --gtdb input files are GTDB taxonomy file --gtdb-re-subs string regular expression to extract assembly accession as the subspecies (default \"^\\\\w\\\\w_GC[AF]_(.+)\\\\.\\\\d+$\") -h, --help help for create-taxdump --line-chunk-size int number of lines to process for each thread, and 4 threads is fast enough. (default 5000) --null strings null value of taxa (default [,NULL,NA]) -x, --old-taxdump-dir string taxdump directory of the previous version, for generating merged.dmp and delnodes.dmp -O, --out-dir string output directory -R, --rank-names strings names of all ranks, leave it empty to use the first row of input as rank names Examples: GTDB. See more: https://github.com/shenwei356/gtdb-taxdump $ taxonkit create-taxdump --gtdb ar53_taxonomy_r207.tsv.gz bac120_taxonomy_r207.tsv.gz --out-dir taxdump 16:42:35.213 [INFO] 317542 records saved to taxdump/taxid.map 16:42:35.460 [INFO] 401815 records saved to taxdump/nodes.dmp 16:42:35.611 [INFO] 401815 records saved to taxdump/names.dmp 16:42:35.611 [INFO] 0 records saved to taxdump/merged.dmp 16:42:35.611 [INFO] 0 records saved to taxdump/delnodes.dmp ICTV, See more: https://github.com/shenwei356/ictv-taxdump MGV . Only Order, Family, Genus information are available. $ cat mgv_contig_info.tsv \\ | csvtk cut -t -f ictv_order,ictv_family,ictv_genus,votu_id,contig_id \\ | sed 1d \\ > mgv.tsv $ taxonkit create-taxdump mgv.tsv --out-dir mgv --force -A 5 -R order,family,genus,species 23:33:18.098 [INFO] 189680 records saved to mgv/taxid.map 23:33:18.131 [INFO] 58102 records saved to mgv/nodes.dmp 23:33:18.150 [INFO] 58102 records saved to mgv/names.dmp 23:33:18.150 [INFO] 0 records saved to mgv/merged.dmp 23:33:18.150 [INFO] 0 records saved to mgv/delnodes.dmp $ head -n 5 mgv/taxid.map MGV-GENOME-0364295 677052301 MGV-GENOME-0364296 677052301 MGV-GENOME-0364303 1414406025 MGV-GENOME-0364311 1849074420 MGV-GENOME-0364312 2074846424 $ echo 677052301 | taxonkit lineage --data-dir mgv/ 677052301 Caudovirales;crAss-phage;OTU-61123 $ echo 677052301 | taxonkit reformat --data-dir mgv/ -I 1 -P 677052301 k__;p__;c__;o__Caudovirales;f__crAss-phage;g__;s__OTU-61123 $ grep MGV-GENOME-0364295 mgv.tsv Caudovirales crAss-phage NULL OTU-61123 MGV-GENOME-0364295 Custom lineages with the first row as rank names and treating one column as accession. $ csvtk pretty -t example/taxonomy.tsv id superkingdom phylum class order family genus species --------------- ------------ -------------- ------------------- ---------------- ------------------ -------------- -------------------------- GCF_001027105.1 Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus aureus GCF_001096185.1 Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus Streptococcus pneumoniae GCF_001544255.1 Bacteria Firmicutes Bacilli Lactobacillales Enterococcaceae Enterococcus Enterococcus faecium GCF_002949675.1 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Shigella Shigella dysenteriae GCF_002950215.1 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Shigella Shigella flexneri GCF_006742205.1 Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus epidermidis GCF_000006945.2 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Salmonella Salmonella enterica GCF_000017205.1 Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas Pseudomonas aeruginosa GCF_003697165.2 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli GCF_009759685.1 Bacteria Proteobacteria Gammaproteobacteria Moraxellales Moraxellaceae Acinetobacter Acinetobacter baumannii GCF_000148585.2 Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus Streptococcus mitis GCF_000392875.1 Bacteria Firmicutes Bacilli Lactobacillales Enterococcaceae Enterococcus Enterococcus faecalis GCF_000742135.1 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Klebsiella Klebsiella pneumonia # the first column as accession $ taxonkit create-taxdump -A 1 example/taxonomy.tsv -O example/taxdump 16:31:31.828 [INFO] I will use the first row of input as rank names 16:31:31.843 [INFO] 13 records saved to example/taxdump/taxid.map 16:31:31.843 [INFO] 39 records saved to example/taxdump/nodes.dmp 16:31:31.843 [INFO] 39 records saved to example/taxdump/names.dmp 16:31:31.843 [INFO] 0 records saved to example/taxdump/merged.dmp 16:31:31.843 [INFO] 0 records saved to example/taxdump/delnodes.dmp $ export TAXONKIT_DB=example/taxdump $ taxonkit list --ids 1 | taxonkit filter -E species | taxonkit lineage -r 1527235303 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus mitis species 2983929374 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus pneumoniae species 3809813362 Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecalis species 4145431389 Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecium species 1569132721 Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus aureus species 1920251658 Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus epidermidis species 3843752343 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas aeruginosa species 72054943 Bacteria;Proteobacteria;Gammaproteobacteria;Moraxellales;Moraxellaceae;Acinetobacter;Acinetobacter baumannii species 1678121664 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Salmonella;Salmonella enterica species 524994882 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Shigella;Shigella dysenteriae species 2695851945 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Shigella;Shigella flexneri species 3958205156 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae species 4093283224 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli species $ head -n 3 example/taxdump/taxid.map GCF_001027105.1 1569132721 GCF_001096185.1 2983929374 GCF_001544255.1 4145431389 Custom lineages with the first row as rank names (pure lineage data) $ csvtk cut -t -f 2- example/taxonomy.tsv | head -n 2 | csvtk pretty -t superkingdom phylum class order family genus species ------------ ---------- ------- ---------- ----------------- -------------- --------------------- Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus aureus $ csvtk cut -t -f 2- example/taxonomy.tsv \\ | taxonkit create-taxdump -O example/taxdump2 16:53:08.604 [INFO] I will use the first row of input as rank names 16:53:08.614 [INFO] 39 records saved to example/taxdump2/nodes.dmp 16:53:08.614 [INFO] 39 records saved to example/taxdump2/names.dmp 16:53:08.614 [INFO] 0 records saved to example/taxdump2/merged.dmp 16:53:08.615 [INFO] 0 records saved to example/taxdump2/delnodes.dmp $ export TAXONKIT_DB=example/taxdump2 $ taxonkit list --ids 1 | taxonkit filter -E species | taxonkit lineage -r | head -n 2 1527235303 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus mitis species 2983929374 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus pneumoniae species genautocomplete Usage Generate shell autocompletion script Supported shell: bash|zsh|fish|powershell Bash: # generate completion shell taxonkit genautocomplete --shell bash # configure if never did. # install bash-completion if the \"complete\" command is not found. echo \"for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\" >> ~/.bash_completion echo \"source ~/.bash_completion\" >> ~/.bashrc Zsh: # generate completion shell taxonkit genautocomplete --shell zsh --file ~/.zfunc/_taxonkit # configure if never did echo 'fpath=( ~/.zfunc \"${fpath[@]}\" )' >> ~/.zshrc echo \"autoload -U compinit; compinit\" >> ~/.zshrc fish: taxonkit genautocomplete --shell fish --file ~/.config/fish/completions/taxonkit.fish Usage: taxonkit genautocomplete [flags] Flags: --file string autocompletion file (default \"/home/shenwei/.bash_completion.d/taxonkit.sh\") -h, --help help for genautocomplete --type string autocompletion type (currently only bash supported) (default \"bash\") profile2cami Usage Convert metagenomic profile table to CAMI format Input format: 1. The input file should be tab-delimited 2. At least two columns needed: a) TaxId of taxon at species or lower rank. b) Abundance (could be percentage, automatically detected or use -p/--percentage). Attentions: 1. Some TaxIds may be merged to another ones in current taxonomy version, the abundances will be summed up. 2. Some TaxIds may be deleted in current taxonomy version, the abundances can be optionally recomputed with the flag -R/--recompute-abd. Usage: taxonkit profile2cami [flags] Flags: -a, --abundance-field int field index of abundance. input data should be tab-separated (default 2) -h, --help help for profile2cami -0, --keep-zero keep taxons with abundance of zero -p, --percentage abundance is in percentage -R, --recompute-abd recompute abundance if some TaxIds are deleted in current taxonomy version -s, --sample-id string sample ID in result file -r, --show-rank strings only show TaxIds and names of these ranks (default [superkingdom,phylum,class,order,family,genus,species,strain]) -i, --taxid-field int field index of taxid. input data should be tab-separated (default 1) -t, --taxonomy-id string taxonomy ID in result file Examples Test data, note that 2824115 is merged to 483329 and 1657696 is deleted in current taxonomy version. $ cat example/abundance.tsv 2824115 0.2 merged to 483329 483329 0.2 absord 2824115 239935 0.5 no change 1657696 0.1 deleted Example: $ taxonkit profile2cami -s sample1 -t 2021-10-01 \\ example/abundance.tsv 13:17:40.552 [WARN] taxid is deleted in current taxonomy version: 1657696 13:17:40.552 [WARN] you may recomputed abundance with the flag -R/--recompute-abd @SampleID:sample1 @Version:0.10.0 @Ranks:superkingdom|phylum|class|order|family|genus|species|strain @TaxonomyID:2021-10-01 @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 50.000000000000000 2759 superkingdom 2759 Eukaryota 40.000000000000000 74201 phylum 2|74201 Bacteria|Verrucomicrobia 50.000000000000000 6656 phylum 2759|6656 Eukaryota|Arthropoda 40.000000000000000 203494 class 2|74201|203494 Bacteria|Verrucomicrobia|Verrucomicrobiae 50.000000000000000 50557 class 2759|6656|50557 Eukaryota|Arthropoda|Insecta 40.000000000000000 48461 order 2|74201|203494|48461 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales 50.000000000000000 7041 order 2759|6656|50557|7041 Eukaryota|Arthropoda|Insecta|Coleoptera 40.000000000000000 1647988 family 2|74201|203494|48461|1647988 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae 50.000000000000000 57514 family 2759|6656|50557|7041|57514 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae 40.000000000000000 239934 genus 2|74201|203494|48461|1647988|239934 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia 50.000000000000000 57515 genus 2759|6656|50557|7041|57514|57515 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus 40.000000000000000 239935 species 2|74201|203494|48461|1647988|239934|239935 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila 50.000000000000000 483329 species 2759|6656|50557|7041|57514|57515|483329 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus|Nicrophorus carolina 40.000000000000000 Recompute (normalize) the abundance $ taxonkit profile2cami -s sample1 -t 2021-10-01 \\ example/abundance.tsv --recompute-abd 13:19:23.647 [WARN] taxid is deleted in current taxonomy version: 1657696 @SampleID:sample1 @Version:0.10.0 @Ranks:superkingdom|phylum|class|order|family|genus|species|strain @TaxonomyID:2021-10-01 @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 55.555555555555557 2759 superkingdom 2759 Eukaryota 44.444444444444450 74201 phylum 2|74201 Bacteria|Verrucomicrobia 55.555555555555557 6656 phylum 2759|6656 Eukaryota|Arthropoda 44.444444444444450 203494 class 2|74201|203494 Bacteria|Verrucomicrobia|Verrucomicrobiae 55.555555555555557 50557 class 2759|6656|50557 Eukaryota|Arthropoda|Insecta 44.444444444444450 48461 order 2|74201|203494|48461 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales 55.555555555555557 7041 order 2759|6656|50557|7041 Eukaryota|Arthropoda|Insecta|Coleoptera 44.444444444444450 1647988 family 2|74201|203494|48461|1647988 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae 55.555555555555557 57514 family 2759|6656|50557|7041|57514 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae 44.444444444444450 239934 genus 2|74201|203494|48461|1647988|239934 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia 55.555555555555557 57515 genus 2759|6656|50557|7041|57514|57515 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus 44.444444444444450 239935 species 2|74201|203494|48461|1647988|239934|239935 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila 55.555555555555557 483329 species 2759|6656|50557|7041|57514|57515|483329 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus|Nicrophorus carolina 44.444444444444450 See https://github.com/shenwei356/sun2021-cami-profiles cami-filter Usage Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile Input format: The CAMI (Taxonomic) Profiling Output Format - https://github.com/CAMI-challenge/contest_information/blob/master/file_formats/CAMI_TP_specification.mkd - One file with mutiple samples is also supported. How to: - No extra taxonomy data needed, so the original taxonomic information are used and not changed. - A mini taxonomic tree is built from records with abundance greater than zero, and only leaves are retained for later use. The rank of leaves may be \"strain\", \"species\", or \"no rank\". - Relative abundances (in percentage) are recomputed for all leaves (reference genome). - A new taxonomic tree is built from these leaves, and abundances are cumulatively added up from leaves to the root. Examples: 1. Remove Archaea, Bacteria, and EukaryoteS, only keep Viruses: taxonkit cami-filter -t 2,2157,2759 test.profile -o test.filter.profile 2. Remove Viruses: taxonkit cami-filter -t 10239 test.profile -o test.filter.profile Usage: taxonkit cami-filter [flags] Flags: --field-percentage int field index of PERCENTAGE (default 5) --field-rank int field index of taxid (default 2) --field-taxid int field index of taxid (default 1) --field-taxpath int field index of TAXPATH (default 3) --field-taxpathsn int field index of TAXPATHSN (default 4) -h, --help help for cami-filter --leaf-ranks strings only consider leaves at these ranks (default [species,strain,no rank]) --show-rank strings only show TaxIds and names of these ranks (default [superkingdom,phylum,class,order,family,genus,species,strain]) --taxid-sep string separator of taxid in TAXPATH and TAXPATHSN (default \"|\") -t, --taxids strings the parent taxid(s) to filter out -f, --taxids-file strings file(s) for the parent taxid(s) to filter out, one taxid per line Examples: Remove Eukaryota taxonkit profile2cami -s sample1 -t 2021-10-01 \\ example/abundance.tsv --recompute-abd \\ | taxonkit cami-filter -t 2759 @SampleID:sample1 @Version:0.10.0 @Ranks:superkingdom|phylum|class|order|family|genus|species|strain @TaxonomyID:2021-10-01 @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 100.000000000000000 74201 phylum 2|74201 Bacteria|Verrucomicrobia 100.000000000000000 203494 class 2|74201|203494 Bacteria|Verrucomicrobia|Verrucomicrobiae 100.000000000000000 48461 order 2|74201|203494|48461 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales 100.000000000000000 1647988 family 2|74201|203494|48461|1647988 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae 100.000000000000000 239934 genus 2|74201|203494|48461|1647988|239934 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia 100.000000000000000 239935 species 2|74201|203494|48461|1647988|239934|239935 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila 100.000000000000000 /** * RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS. * LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/ /* var disqus_config = function () { this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable }; */ (function() { // DON'T EDIT BELOW THIS LINE var d = document, s = d.createElement('script'); s.src = '//taxonkit.disqus.com/embed.js'; s.setAttribute('data-timestamp', +new Date()); (d.head || d.body).appendChild(s); })(); Please enable JavaScript to view the comments powered by Disqus.","title":"Usage"},{"location":"usage/#usage-and-examples","text":"Table of Contents Usage and Examples Before use taxonkit list lineage reformat name2taxid filter lca taxid-changelog profile2cami cami-filter create-taxdump genautocomplete","title":"Usage and Examples"},{"location":"usage/#before-use","text":"Download and uncompress taxdump.tar.gz : ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz Copy names.dmp , nodes.dmp , delnodes.dmp and merged.dmp to data directory: $HOME/.taxonkit , e.g., /home/shenwei/.taxonkit , Optionally copy to some other directories, and later you can refer to using flag --data-dir , or environment variable TAXONKIT_DB . All-in-one command: wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz mkdir -p $HOME/.taxonkit cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit Update dataset : Simply re-download the taxdump files, uncompress and override old ones.","title":"Before use"},{"location":"usage/#taxonkit","text":"TaxonKit - A Practical and Efficient NCBI Taxonomy Toolkit Version: 0.14.2 Author: Wei Shen Source code: https://github.com/shenwei356/taxonkit Documents : https://bioinf.shenwei.me/taxonkit Citation : https://www.sciencedirect.com/science/article/pii/S1673852721000837 Dataset: Please download and uncompress \"taxdump.tar.gz\": ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz and copy \"names.dmp\", \"nodes.dmp\", \"delnodes.dmp\" and \"merged.dmp\" to data directory: \"/home/shenwei/.taxonkit\" or some other directory, and later you can refer to using flag --data-dir, or environment variable TAXONKIT_DB. When environment variable TAXONKIT_DB is set, explicitly setting --data-dir will overide the value of TAXONKIT_DB. Usage: taxonkit [command] Available Commands: cami-filter Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile create-taxdump Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV filter Filter TaxIds by taxonomic rank range genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell) lca Compute lowest common ancestor (LCA) for TaxIds lineage Query taxonomic lineage of given TaxIds list List taxonomic subtrees of given TaxIds name2taxid Convert scientific names to TaxIds profile2cami Convert metagenomic profile table to CAMI format reformat Reformat lineage in canonical ranks taxid-changelog Create TaxId changelog from dump archives version print version information and check for update Flags: --data-dir string directory containing nodes.dmp and names.dmp (default \"/home/shenwei/.taxonkit\") -h, --help help for taxonkit --line-buffered use line buffering on output, i.e., immediately writing to stdin/file for every line of output -o, --out-file string out file (\"-\" for stdout, suffix .gz for gzipped out) (default \"-\") -j, --threads int number of CPUs. 4 is enough (default 4) --verbose print verbose information","title":"taxonkit"},{"location":"usage/#list","text":"Usage List taxonomic subtrees of given TaxIds Attentions: 1. When multiple taxids are given, the output may contain duplicated records if some taxids are descendants of others. Examples: $ taxonkit list --ids 9606 -n -r --indent \" \" 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' $ taxonkit list --ids 9606 --indent \"\" 9606 63221 741158 Usage: taxonkit list [flags] Flags: -h, --help help for list -i, --ids string TaxId(s), multiple values should be separated by comma -I, --indent string indent (default \" \") -J, --json output in JSON format. you can save the result in file with suffix \".json\" and open with modern text editor -n, --show-name output scientific name -r, --show-rank output rank Examples Default usage. $ taxonkit list --ids 9605,239934 9605 9606 63221 741158 1425170 2665952 2665953 239934 239935 349741 512293 512294 1131822 1262691 1263034 1679444 2608915 1131336 ... Removing indent. The list could be used to extract sequences from BLAST database with blastdbcmd (see tutorial ) $ taxonkit list --ids 9605,239934 --indent \"\" 9605 9606 63221 741158 1425170 2665952 2665953 239934 239935 349741 512293 512294 1131822 1262691 1263034 1679444 ... Performance: Time and memory usage for whole taxon tree: $ # emptying the buffers cache $ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\" $ memusg -t taxonkit list --ids 1 --indent \"\" --verbose > t0.txt 21:05:01.782 [INFO] parsing merged file: /home/shenwei/.taxonkit/names.dmp 21:05:01.782 [INFO] parsing names file: /home/shenwei/.taxonkit/names.dmp 21:05:01.782 [INFO] parsing delnodes file: /home/shenwei/.taxonkit/names.dmp 21:05:01.816 [INFO] 61023 merged nodes parsed 21:05:01.889 [INFO] 437929 delnodes parsed 21:05:03.178 [INFO] 2303979 names parsed elapsed time: 3.290s peak rss: 742.77 MB Adding names $ taxonkit list --show-rank --show-name --indent \" \" --ids 9605,239934 9605 [genus] Homo 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' 1425170 [species] Homo heidelbergensis 2665952 [no rank] environmental samples 2665953 [species] Homo sapiens environmental sample 239934 [genus] Akkermansia 239935 [species] Akkermansia muciniphila 349741 [strain] Akkermansia muciniphila ATCC BAA-835 512293 [no rank] environmental samples 512294 [species] uncultured Akkermansia sp. 1131822 [species] uncultured Akkermansia sp. SMG25 1262691 [species] Akkermansia sp. CAG:344 1263034 [species] Akkermansia muciniphila CAG:154 1679444 [species] Akkermansia glycaniphila 2608915 [no rank] unclassified Akkermansia 1131336 [species] Akkermansia sp. KLE1605 1574264 [species] Akkermansia sp. KLE1797 ... Performance: Time and memory usage for whole taxonomy tree: $ # emptying the buffers cache $ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\" $ memusg -t taxonkit list --show-rank --show-name --ids 1 > t1.txt elapsed time: 5.341s peak rss: 1.04 GB Output in JSON format, you can easily collapse and uncollapse taxonomy tree in modern text editor. $ taxonkit list --show-rank --show-name --indent \" \" --ids 9605,239934 --json { \"9605 [genus] Homo\": { \"9606 [species] Homo sapiens\": { \"63221 [subspecies] Homo sapiens neanderthalensis\": { }, \"741158 [subspecies] Homo sapiens subsp. 'Denisova'\": { } }, \"1425170 [species] Homo heidelbergensis\": { } }, \"239934 [genus] Akkermansia\": { \"239935 [species] Akkermansia muciniphila\": { \"349741 [no rank] Akkermansia muciniphila ATCC BAA-835\": { } }, \"512293 [no rank] environmental samples\": { \"512294 [species] uncultured Akkermansia sp.\": { }, \"1131822 [species] uncultured Akkermansia sp. SMG25\": { }, \"1262691 [species] Akkermansia sp. CAG:344\": { }, \"1263034 [species] Akkermansia muciniphila CAG:154\": { } }, \"1679444 [species] Akkermansia glycaniphila\": { }, \"2608915 [no rank] unclassified Akkermansia\": { \"1131336 [species] Akkermansia sp. KLE1605\": { }, \"1574264 [species] Akkermansia sp. KLE1797\": { }, \"1574265 [species] Akkermansia sp. KLE1798\": { }, \"1638783 [species] Akkermansia sp. UNK.MGS-1\": { }, \"1755639 [species] Akkermansia sp. MC_55\": { } } } } Snapshot of taxonomy (taxid 1) in kate:","title":"list"},{"location":"usage/#lineage","text":"Usage Query taxonomic lineage of given TaxIds Input: - List of TaxIds, one TaxId per line. - Or tab-delimited format, please specify TaxId field with flag -i/--taxid-field (default 1). - Supporting (gzipped) file or STDIN. Output: 1. Input line data. 2. (Optional) Status code (-c/--show-status-code), values: - \"-1\" for queries not found in whole database. - \"0\" for deleted TaxIds, provided by \"delnodes.dmp\". - New TaxIds for merged TaxIds, provided by \"merged.dmp\". - Taxids for these found in \"nodes.dmp\". 3. Lineage, delimiter can be changed with flag -d/--delimiter. 4. (Optional) TaxIds taxons in the lineage (-t/--show-lineage-taxids) 5. (Optional) Name (-n/--show-name) 6. (Optional) Rank (-r/--show-rank) Filter out invalid and deleted taxids, and replace merged taxids with new ones: # input is one-column-taxid $ taxonkit lineage -c taxids.txt \\ | awk '$2>0' \\ | cut -f 2- # taxids are in 3rd field in a 4-columns tab-delimited file, # for $5, where 5 = 4 + 1. $ cat input.txt \\ | taxonkit lineage -c -i 3 \\ | csvtk filter2 -H -t -f '$5>0' \\ | csvtk -H -t cut -f -3 Usage: taxonkit lineage [flags] Flags: -d, --delimiter string field delimiter in lineage (default \";\") -h, --help help for lineage -L, --no-lineage do not show lineage, when user just want names or/and ranks -R, --show-lineage-ranks appending ranks of all levels -t, --show-lineage-taxids appending lineage consisting of taxids -n, --show-name appending scientific name -r, --show-rank appending rank of taxids -c, --show-status-code show status code before lineage -i, --taxid-field int field index of taxid. input data should be tab-separated (default 1) Examples Full lineage: # note that 123124124 is a fake taxid, 3 was deleted, 92489,1458427 were merged $ cat taxids.txt 9606 9913 376619 349741 239935 314101 11932 1327037 123124124 3 92489 1458427 $ taxonkit lineage taxids.txt | tee lineage.txt 19:22:13.077 [WARN] taxid 92489 was merged into 796334 19:22:13.077 [WARN] taxid 1458427 was merged into 1458425 19:22:13.077 [WARN] taxid 123124124 not found 19:22:13.077 [WARN] taxid 3 was deleted 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus 376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y 123124124 3 92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raicheisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei # wrapped table with csvtk pretty (>v0.26.0) $ taxonkit lineage taxids.txt | csvtk pretty -Ht -x ';' -W 70 -S bold \u250f\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2513 \u2503 9606 \u2503 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria; \u2503 \u2503 \u2503 Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi; \u2503 \u2503 \u2503 Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota; \u2503 \u2503 \u2503 Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates; \u2503 \u2503 \u2503 Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae; \u2503 \u2503 \u2503 Homo;Homo sapiens \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 9913 \u2503 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria; \u2503 \u2503 \u2503 Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi; \u2503 \u2503 \u2503 Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota; \u2503 \u2503 \u2503 Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla; \u2503 \u2503 \u2503 Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 376619 \u2503 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria; \u2503 \u2503 \u2503 Thiotrichales;Francisellaceae;Francisella;Francisella tularensis; \u2503 \u2503 \u2503 Francisella tularensis subsp. holarctica; \u2503 \u2503 \u2503 Francisella tularensis subsp. holarctica LVS \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 349741 \u2503 cellular organisms;Bacteria;PVC group;Verrucomicrobia; \u2503 \u2503 \u2503 Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia; \u2503 \u2503 \u2503 Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 239935 \u2503 cellular organisms;Bacteria;PVC group;Verrucomicrobia; \u2503 \u2503 \u2503 Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia; \u2503 \u2503 \u2503 Akkermansia muciniphila \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 314101 \u2503 cellular organisms;Bacteria;environmental samples; \u2503 \u2503 \u2503 uncultured murine large bowel bacterium BAC 54B \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 11932 \u2503 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes; \u2503 \u2503 \u2503 Ortervirales;Retroviridae;unclassified Retroviridae; \u2503 \u2503 \u2503 Intracisternal A-particles;Mouse Intracisternal A-particle \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 1327037 \u2503 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes; \u2503 \u2503 \u2503 Caudovirales;Siphoviridae;unclassified Siphoviridae; \u2503 \u2503 \u2503 Croceibacter phage P2559Y \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 92489 \u2503 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria; \u2503 \u2503 \u2503 Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 1458427 \u2503 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria; \u2503 \u2503 \u2503 Burkholderiales;Comamonadaceae;Serpentinomonas; \u2503 \u2503 \u2503 Serpentinomonas raichei \u2503 \u2517\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u253b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u251b Speed. $ time echo 9606 | taxonkit lineage 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens real 0m1.190s user 0m2.365s sys 0m0.170s # all TaxIds $ time taxonkit list --ids 1 --indent \"\" | taxonkit lineage > t real 0m4.249s user 0m16.418s sys 0m1.221s Checking deleted or merged taxids $ taxonkit lineage --show-status-code taxids.txt | tee lineage.withcode.txt # valid $ cat lineage.withcode.txt | awk '$2 > 0' | cut -f 1,2 9606 9606 9913 9913 376619 376619 349741 349741 239935 239935 314101 314101 11932 11932 1327037 1327037 92489 796334 1458427 1458425 # merged $ cat lineage.withcode.txt | awk '$2 > 0 && $2 != $1' | cut -f 1,2 92489 796334 1458427 1458425 # deleted $ cat lineage.withcode.txt | awk '$2 == 0' | cut -f 1 3 # invalid $ cat lineage.withcode.txt | awk '$2 < 0' | cut -f 1 123124124 Filter out invalid and deleted taxids, and replace merged taxids with new ones , you may install csvtk . # input is one-column-taxid $ taxonkit lineage -c taxids.txt \\ | awk '$2>0' \\ | cut -f 2- # taxids are in 3rd field in a 4-columns tab-delimited file, # for $5, where 5 = 4 + 1. $ cat input.txt \\ | taxonkit lineage -c -i 3 \\ | csvtk filter2 -H -t -f '$5>0' \\ | csvtk -H -t cut -f -3 Only show name and rank. $ taxonkit lineage -r -n -L taxids.txt \\ | csvtk pretty -H -t 9606 Homo sapiens species 9913 Bos taurus species 376619 Francisella tularensis subsp. holarctica LVS strain 349741 Akkermansia muciniphila ATCC BAA-835 strain 239935 Akkermansia muciniphila species 314101 uncultured murine large bowel bacterium BAC 54B species 11932 Mouse Intracisternal A-particle species 1327037 Croceibacter phage P2559Y species 123124124 3 92489 Erwinia oleae species 1458427 Serpentinomonas raichei species Show lineage consisting of taxids: $ taxonkit lineage -t taxids.txt 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 131567;2759;33154;33208;6072;33213;33511;7711;89593;7742;7776;117570;117571;8287;1338369;32523;32524;40674;32525;9347;1437010;314146;9443;376913;314293;9526;314295;9604;207598;9605;9606 9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus 131567;2759;33154;33208;6072;33213;33511;7711;89593;7742;7776;117570;117571;8287;1338369;32523;32524;40674;32525;9347;1437010;314145;91561;9845;35500;9895;27592;9903;9913 376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS 131567;2;1224;1236;72273;34064;262;263;119857;376619 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;1783257;74201;203494;48461;1647988;239934;239935;349741 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;1783257;74201;203494;48461;1647988;239934;239935 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B 131567;2;48479;314101 11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 10239;2559587;2732397;2732409;2732514;2169561;11632;35276;11749;11932 1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y 10239;2731341;2731360;2731618;2731619;28883;10699;196894;1327037 123124124 3 92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 131567;2;1224;1236;91347;1903409;551;796334 1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei 131567;2;1224;28216;80840;80864;2490452;1458425 or read taxids from STDIN: $ cat taxids.txt | taxonkit lineage And ranks of all nodes: $ echo 2697049 \\ | taxonkit lineage -t -R \\ | csvtk transpose -Ht 2697049 Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 superkingdom;clade;kingdom;phylum;class;order;suborder;family;subfamily;genus;subgenus;species;no rank Another way to show lineage detail of a TaxId $ echo 2697049 \\ | taxonkit lineage -t \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -H -t 10239 superkingdom Viruses 2559587 clade Riboviria 2732396 kingdom Orthornavirae 2732408 phylum Pisuviricota 2732506 class Pisoniviricetes 76804 order Nidovirales 2499399 suborder Cornidovirineae 11118 family Coronaviridae 2501931 subfamily Orthocoronavirinae 694002 genus Betacoronavirus 2509511 subgenus Sarbecovirus 694009 species Severe acute respiratory syndrome-related coronavirus 2697049 no rank Severe acute respiratory syndrome coronavirus 2","title":"lineage"},{"location":"usage/#reformat","text":"Usage Reformat lineage in canonical ranks Input: - List of TaxIds or lineages, one record per line. The lineage can be a complete lineage or only one taxonomy name. - Or tab-delimited format. Plese specify the lineage field with flag -i/--lineage-field (default 2). Or specify the TaxId field with flag -I/--taxid-field (default 0), which overrides -i/--lineage-field. - Supporting (gzipped) file or STDIN. Output: 1. Input line data. 2. Reformated lineage. 3. (Optional) TaxIds taxons in the lineage (-t/--show-lineage-taxids) Ambiguous names: - Some TaxIds have the same complete lineage, empty result is returned by default. You can use the flag -a/--output-ambiguous-result to return one possible result Output format can be formated by flag --format, available placeholders: {k}: superkingdom {K}: kingdom {p}: phylum {c}: class {o}: order {f}: family {g}: genus {s}: species {t}: subspecies/strain {S}: subspecies {T}: strain When these're no nodes of rank \"subspecies\" nor \"strain\", you can switch on -S/--pseudo-strain to use the node with lowest rank as subspecies/strain name, if which rank is lower than \"species\". This flag affects {t}, {S}, {T}. Output format can contains some escape charactors like \"\\t\". Usage: taxonkit reformat [flags] Flags: -P, --add-prefix add prefixes for all ranks, single prefix for a rank is defined by flag --prefix-X -d, --delimiter string field delimiter in input lineage (default \";\") -F, --fill-miss-rank fill missing rank with lineage information of the next higher rank -f, --format string output format, placeholders of rank are needed (default \"{k};{p};{c};{o};{f};{g};{s}\") -h, --help help for reformat -i, --lineage-field int field index of lineage. data should be tab-separated (default 2) -r, --miss-rank-repl string replacement string for missing rank -p, --miss-rank-repl-prefix string prefix for estimated taxon level (default \"unclassified \") -s, --miss-rank-repl-suffix string suffix for estimated taxon names. \"rank\" for rank name, \"\" for no suffix (default \"rank\") -R, --miss-taxid-repl string replacement string for missing taxid -a, --output-ambiguous-result output one of the ambigous result --prefix-K string prefix for kingdom, used along with flag -P/--add-prefix (default \"K__\") --prefix-S string prefix for subspecies, used along with flag -P/--add-prefix (default \"S__\") --prefix-T string prefix for strain, used along with flag -P/--add-prefix (default \"T__\") --prefix-c string prefix for class, used along with flag -P/--add-prefix (default \"c__\") --prefix-f string prefix for family, used along with flag -P/--add-prefix (default \"f__\") --prefix-g string prefix for genus, used along with flag -P/--add-prefix (default \"g__\") --prefix-k string prefix for superkingdom, used along with flag -P/--add-prefix (default \"k__\") --prefix-o string prefix for order, used along with flag -P/--add-prefix (default \"o__\") --prefix-p string prefix for phylum, used along with flag -P/--add-prefix (default \"p__\") --prefix-s string prefix for species, used along with flag -P/--add-prefix (default \"s__\") --prefix-t string prefix for subspecies/strain, used along with flag -P/--add-prefix (default \"t__\") -S, --pseudo-strain use the node with lowest rank as strain name, only if which rank is lower than \"species\" and not \"subpecies\" nor \"strain\". It affects {t}, {S}, {T}. This flag needs flag -F -t, --show-lineage-taxids show corresponding taxids of reformated lineage -I, --taxid-field int field index of taxid. input data should be tab-separated. it overrides -i/--lineage-field -T, --trim do not fill or add prefix for missing rank lower than current rank Examples: For version > 0.8.0, reformat accept input of TaxIds via flag -I/--taxid-field . $ echo 239935 | taxonkit reformat -I 1 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila $ echo 349741 | taxonkit reformat -I 1 -f \"{k}|{p}|{c}|{o}|{f}|{g}|{s}|{t}\" -F -t 349741 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila|Akkermansia muciniphila ATCC BAA-835 2|74201|203494|48461|1647988|239934|239935|349741 Example lineage (produced by: taxonkit lineage taxids.txt | awk '$2!=\"\"' > lineage.txt ). $ cat lineage.txt 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus 376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y 92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei Default output format ( \"{k};{p};{c};{o};{f};{g};{s}\" ). # reformated lineages are appended to the input data $ taxonkit reformat lineage.txt ... 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila ... $ $ taxonkit reformat lineage.txt | tee lineage.txt.reformat $ cut -f 1,3 lineage.txt.reformat 9606 Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens 9913 Eukaryota;Chordata;Mammalia;Artiodactyla;Bovidae;Bos;Bos taurus 376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis 349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y 92489 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 1458427 Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei # aligned $ cat lineage.txt \\ | taxonkit reformat \\ | csvtk -H -t cut -f 1,3 \\ | csvtk -H -t sep -f 2 -s ';' -R \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk pretty -t taxid kindom phylum class order family genus species ------- --------- --------------- ------------------- ------------------ --------------- -------------------------- ----------------------------------------------- 9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens 9913 Eukaryota Chordata Mammalia Artiodactyla Bovidae Bos Bos taurus 376619 Bacteria Proteobacteria Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis 349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila 239935 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila 314101 Bacteria uncultured murine large bowel bacterium BAC 54B 11932 Viruses Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle 1327037 Viruses Uroviricota Caudoviricetes Caudovirales Siphoviridae Croceibacter phage P2559Y 92489 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Erwiniaceae Erwinia Erwinia oleae 1458427 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Comamonadaceae Serpentinomonas Serpentinomonas raichei And subspecies/strain ( {t} ), subspecies ( {S} ), and strain ( {T} ) are also available. # default operation $ echo -ne \"239935\\n83333\\n1408252\\n2697049\\n2605619\\n\" \\ | taxonkit lineage -n -r \\ | taxonkit reformat -f '{t};{S};{T}' \\ | csvtk -H -t cut -f 1,4,3,5 \\ | csvtk -H -t sep -f 4 -s ';' -R \\ | csvtk -H -t add-header -n \"taxid,rank,name,subspecies/strain,subspecies,strain\" \\ | csvtk pretty -t taxid rank name subspecies/strain subspecies strain ------- ---------- ----------------------------------------------- --------------------- --------------------- --------------------- 239935 species Akkermansia muciniphila 83333 strain Escherichia coli K-12 Escherichia coli K-12 Escherichia coli K-12 1408252 subspecies Escherichia coli R178 Escherichia coli R178 Escherichia coli R178 2697049 no rank Severe acute respiratory syndrome coronavirus 2 2605619 no rank Escherichia coli O16:H48 # fill missing ranks # see example below for -F/--fill-miss-rank # $ echo -ne \"239935\\n83333\\n1408252\\n2697049\\n2605619\\n\" \\ | taxonkit lineage -n -r \\ | taxonkit reformat -f '{t};{S};{T}' --fill-miss-rank \\ | csvtk -H -t cut -f 1,4,3,5 \\ | csvtk -H -t sep -f 4 -s ';' -R \\ | csvtk -H -t add-header -n \"taxid,rank,name,subspecies/strain,subspecies,strain\" \\ | csvtk pretty -t taxid rank name subspecies/strain subspecies strain ------- ---------- ----------------------------------------------- ------------------------------------------------------------------------------------ ----------------------------------------------------------------------------- ------------------------------------------------------------------------- 239935 species Akkermansia muciniphila unclassified Akkermansia muciniphila subspecies/strain unclassified Akkermansia muciniphila subspecies unclassified Akkermansia muciniphila strain 83333 strain Escherichia coli K-12 Escherichia coli K-12 unclassified Escherichia coli subspecies Escherichia coli K-12 1408252 subspecies Escherichia coli R178 Escherichia coli R178 Escherichia coli R178 unclassified Escherichia coli R178 strain 2697049 no rank Severe acute respiratory syndrome coronavirus 2 unclassified Severe acute respiratory syndrome-related coronavirus subspecies/strain unclassified Severe acute respiratory syndrome-related coronavirus subspecies unclassified Severe acute respiratory syndrome-related coronavirus strain 2605619 no rank Escherichia coli O16:H48 unclassified Escherichia coli subspecies/strain unclassified Escherichia coli subspecies unclassified Escherichia coli strain When these's no nodes of rank \"subspecies\" nor \"strain\", you can switch -S/--pseudo-strain to use the node with lowest rank as subspecies/strain name, if which rank is lower than \"species\" . Recommend using v0.14.1 or later versions. $ echo -ne \"239935\\n83333\\n1408252\\n2697049\\n2605619\\n\" \\ | taxonkit lineage -n -r \\ | taxonkit reformat -f '{t};{S};{T}' --pseudo-strain \\ | csvtk -H -t cut -f 1,4,3,5 \\ | csvtk -H -t sep -f 4 -s ';' -R \\ | csvtk -H -t add-header -n \"taxid,rank,name,subspecies/strain,subspecies,strain\" \\ | csvtk pretty -t taxid rank name subspecies/strain subspecies strain ------- ---------- ----------------------------------------------- ----------------------------------------------- ----------------------------------------------- ----------------------------------------------- 239935 species Akkermansia muciniphila 83333 strain Escherichia coli K-12 Escherichia coli K-12 Escherichia coli K-12 1408252 subspecies Escherichia coli R178 Escherichia coli R178 Escherichia coli R178 2697049 no rank Severe acute respiratory syndrome coronavirus 2 Severe acute respiratory syndrome coronavirus 2 Severe acute respiratory syndrome coronavirus 2 Severe acute respiratory syndrome coronavirus 2 2605619 no rank Escherichia coli O16:H48 Escherichia coli O16:H48 Escherichia coli O16:H48 Escherichia coli O16:H48 Add prefix ( -P/--add-prefix ). $ cat lineage.txt \\ | taxonkit reformat -P \\ | csvtk -H -t cut -f 1,3 9606 k__Eukaryota;p__Chordata;c__Mammalia;o__Primates;f__Hominidae;g__Homo;s__Homo sapiens 9913 k__Eukaryota;p__Chordata;c__Mammalia;o__Artiodactyla;f__Bovidae;g__Bos;s__Bos taurus 376619 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Thiotrichales;f__Francisellaceae;g__Francisella;s__Francisella tularensis 349741 k__Bacteria;p__Verrucomicrobia;c__Verrucomicrobiae;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia muciniphila 239935 k__Bacteria;p__Verrucomicrobia;c__Verrucomicrobiae;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia muciniphila 314101 k__Bacteria;p__;c__;o__;f__;g__;s__uncultured murine large bowel bacterium BAC 54B 11932 k__Viruses;p__Artverviricota;c__Revtraviricetes;o__Ortervirales;f__Retroviridae;g__Intracisternal A-particles;s__Mouse Intracisternal A-particle 1327037 k__Viruses;p__Uroviricota;c__Caudoviricetes;o__Caudovirales;f__Siphoviridae;g__;s__Croceibacter phage P2559Y 92489 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Erwiniaceae;g__Erwinia;s__Erwinia oleae 1458427 k__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__Comamonadaceae;g__Serpentinomonas;s__Serpentinomonas raichei Show corresponding taxids of reformated lineage (flag -t/--show-lineage-taxids ) $ cat lineage.txt \\ | taxonkit reformat -t \\ | csvtk -H -t cut -f 1,4 \\ | csvtk -H -t sep -f 2 -s ';' -R \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk pretty -t taxid kindom phylum class order family genus species ------- ------ ------- ------- ------- ------- ------- ------- 9606 2759 7711 40674 9443 9604 9605 9606 9913 2759 7711 40674 91561 9895 9903 9913 376619 2 1224 1236 72273 34064 262 263 349741 2 74201 203494 48461 1647988 239934 239935 239935 2 74201 203494 48461 1647988 239934 239935 314101 2 314101 11932 10239 2732409 2732514 2169561 11632 11749 11932 1327037 10239 2731618 2731619 28883 10699 1327037 92489 2 1224 1236 91347 1903409 551 796334 1458427 2 1224 28216 80840 80864 2490452 1458425 Use custom symbols for unclassfied ranks ( -r/--miss-rank-repl ) $ taxonkit reformat lineage.txt -r \"__\" | cut -f 3 Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens Eukaryota;Chordata;Mammalia;Artiodactyla;Bovidae;Bos;Bos taurus Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;__;__;__;__;__;uncultured murine large bowel bacterium BAC 54B Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;__;Croceibacter phage P2559Y Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei $ taxonkit reformat lineage.txt -r Unassigned | cut -f 3 Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens Eukaryota;Chordata;Mammalia;Artiodactyla;Bovidae;Bos;Bos taurus Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Unassigned;Unassigned;Unassigned;Unassigned;Unassigned;uncultured murine large bowel bacterium BAC 54B Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;Unassigned;Croceibacter phage P2559Y Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei Estimate and fill missing rank with original lineage information ( -F, --fill-miss-rank , very useful for formatting input data for LEfSe ). You can change the prefix \"unclassified\" using flag -p/--miss-rank-repl-prefix . $ cat lineage.txt \\ | taxonkit reformat -F \\ | csvtk -H -t cut -f 1,3 \\ | csvtk -H -t sep -f 2 -s ';' -R \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk pretty -t taxid kindom phylum class order family genus species ------- --------- ---------------------------- --------------------------- --------------------------- ---------------------------- ------------------------------- ----------------------------------------------- 9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens 9913 Eukaryota Chordata Mammalia Artiodactyla Bovidae Bos Bos taurus 376619 Bacteria Proteobacteria Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis 349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila 239935 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila 314101 Bacteria unclassified Bacteria phylum unclassified Bacteria class unclassified Bacteria order unclassified Bacteria family unclassified Bacteria genus uncultured murine large bowel bacterium BAC 54B 11932 Viruses Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle 1327037 Viruses Uroviricota Caudoviricetes Caudovirales Siphoviridae unclassified Siphoviridae genus Croceibacter phage P2559Y 92489 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Erwiniaceae Erwinia Erwinia oleae 1458427 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Comamonadaceae Serpentinomonas Serpentinomonas raichei Do not add prefix or suffix for estimated nodes: $ echo 314101 | taxonkit reformat -I 1 314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B $ echo 314101 | taxonkit reformat -I 1 -F -p \"\" -s \"\" 314101 Bacteria;Bacteria;Bacteria;Bacteria;Bacteria;Bacteria;uncultured murine large bowel bacterium BAC 54B Only some ranks. $ cat lineage.txt \\ | taxonkit reformat -F -f \"{s};{p}\"\\ | csvtk -H -t cut -f 1,3 \\ | csvtk -H -t sep -f 2 -s ';' -R \\ | csvtk add-header -t -n taxid,species,phylum \\ | csvtk pretty -t taxid species phylum ------- ----------------------------------------------- ---------------------------- 9606 Homo sapiens Chordata 9913 Bos taurus Chordata 376619 Francisella tularensis Proteobacteria 349741 Akkermansia muciniphila Verrucomicrobia 239935 Akkermansia muciniphila Verrucomicrobia 314101 uncultured murine large bowel bacterium BAC 54B unclassified Bacteria phylum 11932 Mouse Intracisternal A-particle Artverviricota 1327037 Croceibacter phage P2559Y Uroviricota 92489 Erwinia oleae Proteobacteria 1458427 Serpentinomonas raichei Proteobacteria For some taxids which rank is higher than the lowest rank in -f/--format , use -T/--trim to avoid fill missing rank lower than current rank . $ echo -ne \"2\\n239934\\n239935\\n\" \\ | taxonkit lineage \\ | taxonkit reformat -F \\ | sed -r \"s/;+$//\" \\ | csvtk -H -t cut -f 1,3 2 Bacteria;unclassified Bacteria phylum;unclassified Bacteria class;unclassified Bacteria order;unclassified Bacteria family;unclassified Bacteria genus;unclassified Bacteria species 239934 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;unclassified Akkermansia species 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila $ echo -ne \"2\\n239934\\n239935\\n\" \\ | taxonkit lineage \\ | taxonkit reformat -F -T \\ | sed -r \"s/;+$//\" \\ | csvtk -H -t cut -f 1,3 2 Bacteria 239934 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Support tab in format string $ echo 9606 \\ | taxonkit lineage \\ | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{S}\" \\ | csvtk cut -t -f -2 9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens List seven-level lineage for all TaxIds. # replace empty taxon with \"Unassigned\" $ taxonkit list --ids 1 \\ | taxonkit lineage \\ | taxonkit reformat -r Unassigned | gzip -c > all.lineage.tsv.gz # tab-delimited seven-levels $ taxonkit list --ids 1 \\ | taxonkit lineage \\ | taxonkit reformat -r Unassigned -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" \\ | csvtk cut -H -t -f -2 \\ | head -n 5 \\ | csvtk pretty -H -t # 8-level $ taxonkit list --ids 1 \\ | taxonkit lineage \\ | taxonkit reformat -r Unassigned -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | csvtk cut -H -t -f -2 \\ | head -n 5 \\ | csvtk pretty -H -t # Fill and trim $ memusg -t -s ' taxonkit list --ids 1 \\ | taxonkit lineage \\ | taxonkit reformat -F -T \\ | sed -r \"s/;+$//\" \\ | gzip -c > all.lineage.tsv.gz ' elapsed time: 19.930s peak rss: 6.25 GB From taxid to 7-ranks lineage: $ cat taxids.txt | taxonkit lineage | taxonkit reformat # for taxonkit v0.8.0 or later versions $ cat taxids.txt | taxonkit reformat -I 1 Some TaxIds have the same complete lineage, empty result is returned by default. You can use the flag -a/--output-ambiguous-result to return one possible result. see #42 $ echo -ne \"2507530\\n2516889\\n\" | taxonkit lineage --data-dir . | taxonkit reformat --data-dir . -t 19:18:29.770 [WARN] we can't distinguish the TaxIds (2507530, 2516889) for lineage: cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019. But you can use -a/--output-ambiguous-result to return one possible result 19:18:29.770 [WARN] we can't distinguish the TaxIds (2507530, 2516889) for lineage: cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019. But you can use -a/--output-ambiguous-result to return one possible result 2507530 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 2516889 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 $ echo -ne \"2507530\\n2516889\\n\" | taxonkit lineage --data-dir . | taxonkit reformat --data-dir . -t -a 2507530 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 Eukaryota;Basidiomycota;Agaricomycetes;Russulales;Russulaceae;Russula;Russula sp. 8 KA-2019 2759;5204;155619;452342;5401;5402;2507530 2516889 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 Eukaryota;Basidiomycota;Agaricomycetes;Russulales;Russulaceae;Russula;Russula sp. 8 KA-2019 2759;5204;155619;452342;5401;5402;2507530","title":"reformat"},{"location":"usage/#name2taxid","text":"Usage Convert scientific names to TaxIds Attention: 1. Some TaxIds share the same scientific names, e.g, Drosophila. These input lines are duplicated with multiple TaxIds. $ echo Drosophila | taxonkit name2taxid | taxonkit lineage -i 2 -r -L Drosophila 7215 genus Drosophila 32281 subgenus Drosophila 2081351 genus Usage: taxonkit name2taxid [flags] Flags: -h, --help help for name2taxid -i, --name-field int field index of name. data should be tab-separated (default 1) -s, --sci-name only searching scientific names -r, --show-rank show rank Examples Example data $ cat names.txt Homo sapiens Akkermansia muciniphila ATCC BAA-835 Akkermansia muciniphila Mouse Intracisternal A-particle Wei Shen uncultured murine large bowel bacterium BAC 54B Croceibacter phage P2559Y Default. # taxonkit name2taxid names.txt $ cat names.txt | taxonkit name2taxid | csvtk pretty -H -t Homo sapiens 9606 Akkermansia muciniphila ATCC BAA-835 349741 Akkermansia muciniphila 239935 Mouse Intracisternal A-particle 11932 Wei Shen uncultured murine large bowel bacterium BAC 54B 314101 Croceibacter phage P2559Y 1327037 Show rank. $ cat names.txt | taxonkit name2taxid --show-rank | csvtk pretty -H -t Homo sapiens 9606 species Akkermansia muciniphila ATCC BAA-835 349741 strain Akkermansia muciniphila 239935 species Mouse Intracisternal A-particle 11932 species Wei Shen uncultured murine large bowel bacterium BAC 54B 314101 species Croceibacter phage P2559Y 1327037 species From name to lineage. $ cat names.txt | taxonkit name2taxid | taxonkit lineage --taxid-field 2 Homo sapiens 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens Akkermansia muciniphila ATCC BAA-835 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 Akkermansia muciniphila 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Mouse Intracisternal A-particle 11932 Viruses;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle Wei Shen uncultured murine large bowel bacterium BAC 54B 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B Croceibacter phage P2559Y 1327037 Viruses;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y Some TaxIds share the same scientific names , e.g, Drosophila. $ echo Drosophila \\ | taxonkit name2taxid \\ | taxonkit lineage -i 2 -r \\ | taxonkit reformat -i 3 \\ | csvtk cut -H -t -f 1,2,4,5 \\ | csvtk pretty -H -t Drosophila 7215 genus Eukaryota;Arthropoda;Insecta;Diptera;Drosophilidae;Drosophila; Drosophila 32281 subgenus Eukaryota;Arthropoda;Insecta;Diptera;Drosophilidae;Drosophila; Drosophila 2081351 genus Eukaryota;Basidiomycota;Agaricomycetes;Agaricales;Psathyrellaceae;Drosophila;","title":"name2taxid"},{"location":"usage/#filter","text":"Usage Filter TaxIds by taxonomic rank range Attentions: 1. Flag -L/--lower-than and -H/--higher-than are exclusive, and can be used along with -E/--equal-to which values can be different. 2. A list of pre-ordered ranks is in ~/.taxonkit/ranks.txt, you can use your list by -r/--rank-file, the format specification is below. 3. All ranks in taxonomy database should be defined in rank file. 4. Ranks can be removed with black list via -B/--black-list. 5. TaxIDs with no rank are kept by default!!! They can be optionally discarded by -N/--discard-noranks. 6. [Recommended] When filtering with -L/--lower-than, you can use -n/--save-predictable-norank to save some special ranks without order, where rank of the closest higher node is still lower than rank cutoff. Rank file: 1. Blank lines or lines starting with \"#\" are ignored. 2. Ranks are in decending order and case ignored. 3. Ranks with same order should be in one line separated with comma (\",\", no space). 4. Ranks without order should be assigned a prefix symbol \"!\" for each rank. Usage: taxonkit filter [flags] Flags: -B, --black-list strings black list of ranks to discard, e.g., '-B \"no rank\" -B \"clade\" -N, --discard-noranks discard all ranks without order, type \"taxonkit filter --help\" for details -R, --discard-root discard root taxid, defined by --root-taxid -E, --equal-to strings output TaxIds with rank equal to some ranks, multiple values can be separated with comma \",\" (e.g., -E \"genus,species\"), or give multiple times (e.g., -E genus -E species) -h, --help help for filter -H, --higher-than string output TaxIds with rank higher than a rank, exclusive with --lower-than --list-order list user defined ranks in order, from \"$HOME/.taxonkit/ranks.txt\" --list-ranks list ordered ranks in taxonomy database, sorted in user defined order -L, --lower-than string output TaxIds with rank lower than a rank, exclusive with --higher-than -r, --rank-file string user-defined ordered taxonomic ranks, type \"taxonkit filter --help\" for details --root-taxid uint32 root taxid (default 1) -n, --save-predictable-norank do not discard some special ranks without order when using -L, where rank of the closest higher node is still lower than rank cutoff -i, --taxid-field int field index of taxid. input data should be tab-separated (default 1) Examples Example data $ echo 349741 | taxonkit lineage -t | cut -f 3 | sed 's/;/\\n/g' > taxids2.txt $ cat taxids2.txt 131567 2 1783257 74201 203494 48461 1647988 239934 239935 349741 $ cat taxids2.txt | taxonkit lineage -r | csvtk -Ht cut -f 1,3,2 | csvtk pretty -H -t 131567 no rank cellular organisms 2 superkingdom cellular organisms;Bacteria 1783257 clade cellular organisms;Bacteria;PVC group 74201 phylum cellular organisms;Bacteria;PVC group;Verrucomicrobia 203494 class cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae 48461 order cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales 1647988 family cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae 239934 genus cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia 239935 species cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 349741 strain cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 Equal to certain rank(s) ( -E/--equal-to ) $ cat taxids2.txt \\ | taxonkit filter -E Phylum -E Class \\ | taxonkit lineage -r \\ | csvtk -Ht cut -f 1,3,2 \\ | csvtk pretty -H -t 74201 phylum cellular organisms;Bacteria;PVC group;Verrucomicrobia 203494 class cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae Lower than a rank ( -L/--lower-than ) $ cat taxids2.txt \\ | taxonkit filter -L genus \\ | taxonkit lineage -r -n -L \\ | csvtk -Ht cut -f 1,3,2 \\ | csvtk pretty -H -t 239935 species Akkermansia muciniphila 349741 strain Akkermansia muciniphila ATCC BAA-835 Higher than a rank ( -H/--higher-than ) $ cat taxids2.txt \\ | taxonkit filter -H phylum \\ | taxonkit lineage -r -n -L \\ | csvtk -Ht cut -f 1,3,2 \\ | csvtk pretty -H -t 2 superkingdom Bacteria TaxIDs with no rank are kept by default!!! \"no rank\" and \"clade\" have no rank and can be filter out via -N/--discard-noranks . Futher ranks can be removed with black list via -B/--black-list . # 562 is the TaxId of Escherichia coli $ taxonkit list --ids 562 \\ | taxonkit filter -L species \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk freq -Ht -f 2 -nr \\ | csvtk pretty -H -t strain 2950 no rank 149 serotype 141 serogroup 95 isolate 1 subspecies 1 $ taxonkit list --ids 562 \\ | taxonkit filter -L species -N -B strain \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk freq -Ht -f 2 -nr \\ | csvtk pretty -H -t serotype 141 serogroup 95 isolate 1 subspecies 1 Combine of -L/-H with -E . $ cat taxids2.txt \\ | taxonkit filter -L genus -E genus \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -H -t 239934 genus Akkermansia 239935 species Akkermansia muciniphila 349741 strain Akkermansia muciniphila ATCC BAA-835 Special cases of \"no rank\" . ( -n/--save-predictable-norank ). When filtering with -L/--lower-than , you can use -n/--save-predictable-norank to save some special ranks without order, where rank of the closest higher node is still lower than rank cutoff. $ echo -ne \"2605619\\n1327037\\n\" \\ | taxonkit lineage -t \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -H -t 131567 no rank cellular organisms 2 superkingdom Bacteria 1224 phylum Proteobacteria 1236 class Gammaproteobacteria 91347 order Enterobacterales 543 family Enterobacteriaceae 561 genus Escherichia 562 species Escherichia coli 2605619 no rank Escherichia coli O16:H48 10239 superkingdom Viruses 2731341 clade Duplodnaviria 2731360 clade Heunggongvirae 2731618 phylum Uroviricota 2731619 class Caudoviricetes 28883 order Caudovirales 10699 family Siphoviridae 196894 no rank unclassified Siphoviridae 1327037 species Croceibacter phage P2559Y # save taxids $ echo -ne \"2605619\\n1327037\\n\" \\ | taxonkit lineage -t \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | tee taxids4.txt 131567 2 1224 1236 91347 543 561 562 2605619 10239 2731341 2731360 2731618 2731619 28883 10699 196894 1327037 Now, filter nodes of rank <= species. $ cat taxids4.txt \\ | taxonkit filter -L species -E species -N -n \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -H -t 562 species Escherichia coli 2605619 no rank Escherichia coli O16:H48 1327037 species Croceibacter phage P2559Y Note that 2605619 (no rank) is saved because its parent node 562 is <= species.","title":"filter"},{"location":"usage/#lca","text":"Usage Compute lowest common ancestor (LCA) for TaxIds Attention: 1. This command computes LCA TaxId for a list of TaxIds in a field (\"-i/--taxids-field) of tab-delimited file or STDIN. 2. TaxIDs should have the same separator (\"-s/--separator\"), single charactor separator is prefered. 3. Empty lines or lines without valid TaxIds in the field are omitted. 4. If some TaxIds are not found in database, it returns 0. Examples: $ echo 239934, 239935, 349741 | taxonkit lca -s \", \" 239934, 239935, 349741 239934 $ time echo 239934 239935 349741 9606 | taxonkit lca 239934 239935 349741 9606 131567 Usage: taxonkit lca [flags] Flags: -b, --buffer-size string size of line buffer, supported unit: K, M, G. You need to increase the value when \"bufio.Scanner: token too long\" error occured (default \"1M\") -h, --help help for lca --separater string separater for TaxIds. This flag is same to --separator. (default \" \") -s, --separator string separator for TaxIds (default \" \") -D, --skip-deleted skip deleted TaxIds and compute with left ones -U, --skip-unfound skip unfound TaxIds and compute with left ones -i, --taxids-field int field index of TaxIds. Input data should be tab-separated (default 1) Examples: Example data $ taxonkit list --ids 9605 -nr --indent \" \" 9605 [genus] Homo 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' 1425170 [species] Homo heidelbergensis 2665952 [no rank] environmental samples 2665953 [species] Homo sapiens environmental sample Simple one $ echo 63221 2665953 | taxonkit lca 63221 2665953 9605 Custom field ( -i/--taxids-field ) and separater ( -s/--separator ). $ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" a 63221,2665953 b 63221, 741158 $ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" \\ | taxonkit lca -i 2 -s \",\" a 63221,2665953 9605 b 63221, 741158 9606 Merged TaxIds. # merged $ echo 92487 92488 92489 | taxonkit lca 10:08:26.578 [WARN] taxid 92489 was merged into 796334 92487 92488 92489 1236 Deleted TaxIds, you can ommit theses and continue compute with left onces with ( -D/--skip-deleted ). $ echo 1 2 3 | taxonkit lca 10:30:17.678 [WARN] taxid 3 not found 1 2 3 0 $ time echo 1 2 3 | taxonkit lca -D 10:29:31.828 [WARN] taxid 3 was deleted 1 2 3 1 TaxIDs not found in database, you can ommit theses and continue compute with left onces with ( -U/--skip-unfound ). $ echo 61021 61022 11111111 | taxonkit lca 10:31:44.929 [WARN] taxid 11111111 not found 61021 61022 11111111 0 $ echo 61021 61022 11111111 | taxonkit lca -U 10:32:02.772 [WARN] taxid 11111111 not found 61021 61022 11111111 2628496","title":"lca"},{"location":"usage/#taxid-changelog","text":"Usage Create TaxId changelog from dump archives Steps: # dependencies: # rush - https://github.com/shenwei356/rush/ mkdir -p archive; cd archive; # --------- download --------- # option 1 # for fast network connection wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/taxdmp*.zip # option 2 # for slow network connection url=https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/ wget $url -O - -o /dev/null \\ | grep taxdmp | perl -ne '/(taxdmp_.+?.zip)/; print \"$1\\n\";' \\ | rush -j 2 -v url=$url 'axel -n 5 {url}/{}' \\ --immediate-output -c -C download.rush # --------- unzip --------- ls taxdmp*.zip | rush -j 1 'unzip {} names.dmp nodes.dmp merged.dmp delnodes.dmp -d {@_(.+)\\.}' # optionally compress .dmp files with pigz, for saving disk space fd .dmp$ | rush -j 4 'pigz {}' # --------- create log --------- cd .. taxonkit taxid-changelog -i archive -o taxid-changelog.csv.gz --verbose Output format (CSV): # fields comments taxid # taxid version # version / time of archive, e.g, 2019-07-01 change # change, values: # NEW newly added # REUSE_DEL deleted taxids being reused # REUSE_MER merged taxids being reused # DELETE deleted # MERGE merged into another taxid # ABSORB other taxids merged into this one # CHANGE_NAME scientific name changed # CHANGE_RANK rank changed # CHANGE_LIN_LIN lineage taxids remain but lineage remain # CHANGE_LIN_TAX lineage taxids changed # CHANGE_LIN_LEN lineage length changed change-value # variable values for changes: # 1) new taxid for MERGE # 2) merged taxids for ABSORB # 3) empty for others name # scientific name rank # rank lineage # complete lineage of the taxid lineage-taxids # taxids of the lineage # you can use csvtk to investigate them. e.g., csvtk grep -f taxid -p 1390515 taxid-changelog.csv.gz Usage: taxonkit taxid-changelog [flags] Flags: -i, --archive string directory containing uncompressed dumped archives -h, --help help for taxid-changelog Details Example 1 ( E.coli with taxid 562 ) $ pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -p 562 \\ | csvtk pretty taxid version change change-value name rank lineage lineage-taxids 562 2014-08-01 NEW Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 562 2014-08-01 ABSORB 662101;662104 Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 562 2015-11-01 ABSORB 1637691 Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 562 2016-10-01 CHANGE_LIN_LIN Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 562 2018-06-01 ABSORB 469598 Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 # merged taxids $ pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -p 662101,662104,1637691,469598 \\ | csvtk pretty taxid version change change-value name rank lineage lineage-taxids 469598 2014-08-01 NEW Escherichia sp. 3_2_53FAA species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia sp. 3_2_53FAA 131567;2;1224;1236;91347;543;561;469598 469598 2016-10-01 CHANGE_LIN_LIN Escherichia sp. 3_2_53FAA species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia sp. 3_2_53FAA 131567;2;1224;1236;91347;543;561;469598 469598 2018-06-01 MERGE 562 Escherichia sp. 3_2_53FAA species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia sp. 3_2_53FAA 131567;2;1224;1236;91347;543;561;469598 662101 2014-08-01 MERGE 562 662104 2014-08-01 MERGE 562 1637691 2015-04-01 DELETE 1637691 2015-05-01 REUSE_DEL Escherichia sp. MAR species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia sp. MAR 131567;2;1224;1236;91347;543;561;1637691 1637691 2015-11-01 MERGE 562 Escherichia sp. MAR species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia sp. MAR 131567;2;1224;1236;91347;543;561;1637691 Example 2 (SARS-CoV-2). $ time pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -p 2697049 \\ | csvtk pretty taxid version change change-value name rank lineage lineage-taxids 2697049 2020-02-01 NEW Wuhan seafood market pneumonia virus species Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;unclassified Betacoronavirus;Wuhan seafood market pneumonia virus 10239;2559587;76804;2499399;11118;2501931;694002;696098;2697049 2697049 2020-03-01 CHANGE_NAME Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-03-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-03-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-06-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-07-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 isolate Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-08-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 real 0m7.644s user 0m16.749s sys 0m3.985s Example 3 (All subspecies and strain in Akkermansia muciniphila 239935) # species in Akkermansia $ taxonkit list --show-rank --show-name --indent \" \" --ids 239935 239935 [species] Akkermansia muciniphila 349741 [strain] Akkermansia muciniphila ATCC BAA-835 # check them all $ pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -P <(taxonkit list --indent \"\" --ids 239935) \\ | csvtk pretty lineage-taxids taxid version change change-value name rank lineage lineage-taxids 239935 2014-08-01 NEW Akkermansia muciniphila species cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Verrucomicrobiaceae;Akkermansia;Akkermansia muciniphila 131567;2;51290;74201;203494;48461;203557;239934;239935 239935 2015-05-01 CHANGE_LIN_TAX Akkermansia muciniphila species cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;51290;74201;203494;48461;1647988;239934;239935 239935 2016-03-01 CHANGE_LIN_TAX Akkermansia muciniphila species cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;1783257;74201;203494;48461;1647988;239934;239935 239935 2016-05-01 ABSORB 1834199 Akkermansia muciniphila species cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;1783257;74201;203494;48461;1647988;239934;239935 349741 2014-08-01 NEW Akkermansia muciniphila ATCC BAA-835 no rank cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Verrucomicrobiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;51290;74201;203494;48461;203557;239934;239935;349741 349741 2015-05-01 CHANGE_LIN_TAX Akkermansia muciniphila ATCC BAA-835 no rank cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;51290;74201;203494;48461;1647988;239934;239935;349741 349741 2016-03-01 CHANGE_LIN_TAX Akkermansia muciniphila ATCC BAA-835 no rank cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;1783257;74201;203494;48461;1647988;239934;239935;349741 349741 2020-07-01 CHANGE_RANK Akkermansia muciniphila ATCC BAA-835 strain cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;1783257;74201;203494;48461;1647988;239934;239935;349741 More","title":"taxid-changelog"},{"location":"usage/#create-taxdump","text":"Usage Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV Input format: 0. For GTDB taxonomy file, just use --gtdb. We use the numeric assembly accession as the taxon at subspecies rank. (without the prefix GCA_ and GCF_, and version number). 1. The input file should be tab-delimited, at least one column is needed. 2. Ranks can be given either via the first row or the flag --rank-names. 3. The column containing the genome/assembly accession is recommended to generate TaxId mapping file (taxid.map, id -> taxid). -A/--field-accession, field contaning genome/assembly accession --field-accession-re, regular expression to extract the accession Note that mutiple TaxIds pointing to the same accession are listed as comma-seperated integers. Attentions: 1. Names should be distinct in taxa of different ranks. But for these missing some taxon nodes, using names of parent nodes is allowed: GB_GCA_018897955.1 d__Archaea;p__EX4484-52;c__EX4484-52;o__EX4484-52;f__LFW-46;g__LFW-46;s__LFW-46 sp018897155 It can also detect duplicate names with different ranks, e.g., the Class and Genus have the same name B47-G6, and the Order and Family between them have different names. In this case, we reassign a new TaxId by increasing the TaxId until it being distinct. GB_GCA_003663585.1 d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663585 2. Taxa from different parents may have the same name. We will assign different TaxIds to them. E.g., in ICTV, many viruses from different species have the same names. In practice, we set the \"Virus names(s)\" as a subspecies rank and also specify it as the accession. Species Virus name(s) Jerseyvirus SETP3 Salmonella phage SETP7 Jerseyvirus SETP7 Salmonella phage SETP7 3. The generated TaxIds are not consecutive numbers, however some tools like MMSeqs2 required this, you can use the script below for convertion: https://github.com/apcamargo/ictv-mmseqs2-protein-database/blob/master/fix_taxdump.py Usage: taxonkit create-taxdump [flags] Flags: -A, --field-accession int field index of assembly accession (genome ID), for outputting taxid.map --field-accession-re string regular expression to extract assembly accession (default \"^\\\\w\\\\w_(.+)$\") --force overwrite existed output directory --gtdb input files are GTDB taxonomy file --gtdb-re-subs string regular expression to extract assembly accession as the subspecies (default \"^\\\\w\\\\w_GC[AF]_(.+)\\\\.\\\\d+$\") -h, --help help for create-taxdump --line-chunk-size int number of lines to process for each thread, and 4 threads is fast enough. (default 5000) --null strings null value of taxa (default [,NULL,NA]) -x, --old-taxdump-dir string taxdump directory of the previous version, for generating merged.dmp and delnodes.dmp -O, --out-dir string output directory -R, --rank-names strings names of all ranks, leave it empty to use the first row of input as rank names Examples: GTDB. See more: https://github.com/shenwei356/gtdb-taxdump $ taxonkit create-taxdump --gtdb ar53_taxonomy_r207.tsv.gz bac120_taxonomy_r207.tsv.gz --out-dir taxdump 16:42:35.213 [INFO] 317542 records saved to taxdump/taxid.map 16:42:35.460 [INFO] 401815 records saved to taxdump/nodes.dmp 16:42:35.611 [INFO] 401815 records saved to taxdump/names.dmp 16:42:35.611 [INFO] 0 records saved to taxdump/merged.dmp 16:42:35.611 [INFO] 0 records saved to taxdump/delnodes.dmp ICTV, See more: https://github.com/shenwei356/ictv-taxdump MGV . Only Order, Family, Genus information are available. $ cat mgv_contig_info.tsv \\ | csvtk cut -t -f ictv_order,ictv_family,ictv_genus,votu_id,contig_id \\ | sed 1d \\ > mgv.tsv $ taxonkit create-taxdump mgv.tsv --out-dir mgv --force -A 5 -R order,family,genus,species 23:33:18.098 [INFO] 189680 records saved to mgv/taxid.map 23:33:18.131 [INFO] 58102 records saved to mgv/nodes.dmp 23:33:18.150 [INFO] 58102 records saved to mgv/names.dmp 23:33:18.150 [INFO] 0 records saved to mgv/merged.dmp 23:33:18.150 [INFO] 0 records saved to mgv/delnodes.dmp $ head -n 5 mgv/taxid.map MGV-GENOME-0364295 677052301 MGV-GENOME-0364296 677052301 MGV-GENOME-0364303 1414406025 MGV-GENOME-0364311 1849074420 MGV-GENOME-0364312 2074846424 $ echo 677052301 | taxonkit lineage --data-dir mgv/ 677052301 Caudovirales;crAss-phage;OTU-61123 $ echo 677052301 | taxonkit reformat --data-dir mgv/ -I 1 -P 677052301 k__;p__;c__;o__Caudovirales;f__crAss-phage;g__;s__OTU-61123 $ grep MGV-GENOME-0364295 mgv.tsv Caudovirales crAss-phage NULL OTU-61123 MGV-GENOME-0364295 Custom lineages with the first row as rank names and treating one column as accession. $ csvtk pretty -t example/taxonomy.tsv id superkingdom phylum class order family genus species --------------- ------------ -------------- ------------------- ---------------- ------------------ -------------- -------------------------- GCF_001027105.1 Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus aureus GCF_001096185.1 Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus Streptococcus pneumoniae GCF_001544255.1 Bacteria Firmicutes Bacilli Lactobacillales Enterococcaceae Enterococcus Enterococcus faecium GCF_002949675.1 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Shigella Shigella dysenteriae GCF_002950215.1 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Shigella Shigella flexneri GCF_006742205.1 Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus epidermidis GCF_000006945.2 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Salmonella Salmonella enterica GCF_000017205.1 Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas Pseudomonas aeruginosa GCF_003697165.2 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli GCF_009759685.1 Bacteria Proteobacteria Gammaproteobacteria Moraxellales Moraxellaceae Acinetobacter Acinetobacter baumannii GCF_000148585.2 Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus Streptococcus mitis GCF_000392875.1 Bacteria Firmicutes Bacilli Lactobacillales Enterococcaceae Enterococcus Enterococcus faecalis GCF_000742135.1 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Klebsiella Klebsiella pneumonia # the first column as accession $ taxonkit create-taxdump -A 1 example/taxonomy.tsv -O example/taxdump 16:31:31.828 [INFO] I will use the first row of input as rank names 16:31:31.843 [INFO] 13 records saved to example/taxdump/taxid.map 16:31:31.843 [INFO] 39 records saved to example/taxdump/nodes.dmp 16:31:31.843 [INFO] 39 records saved to example/taxdump/names.dmp 16:31:31.843 [INFO] 0 records saved to example/taxdump/merged.dmp 16:31:31.843 [INFO] 0 records saved to example/taxdump/delnodes.dmp $ export TAXONKIT_DB=example/taxdump $ taxonkit list --ids 1 | taxonkit filter -E species | taxonkit lineage -r 1527235303 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus mitis species 2983929374 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus pneumoniae species 3809813362 Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecalis species 4145431389 Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecium species 1569132721 Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus aureus species 1920251658 Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus epidermidis species 3843752343 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas aeruginosa species 72054943 Bacteria;Proteobacteria;Gammaproteobacteria;Moraxellales;Moraxellaceae;Acinetobacter;Acinetobacter baumannii species 1678121664 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Salmonella;Salmonella enterica species 524994882 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Shigella;Shigella dysenteriae species 2695851945 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Shigella;Shigella flexneri species 3958205156 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae species 4093283224 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli species $ head -n 3 example/taxdump/taxid.map GCF_001027105.1 1569132721 GCF_001096185.1 2983929374 GCF_001544255.1 4145431389 Custom lineages with the first row as rank names (pure lineage data) $ csvtk cut -t -f 2- example/taxonomy.tsv | head -n 2 | csvtk pretty -t superkingdom phylum class order family genus species ------------ ---------- ------- ---------- ----------------- -------------- --------------------- Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus aureus $ csvtk cut -t -f 2- example/taxonomy.tsv \\ | taxonkit create-taxdump -O example/taxdump2 16:53:08.604 [INFO] I will use the first row of input as rank names 16:53:08.614 [INFO] 39 records saved to example/taxdump2/nodes.dmp 16:53:08.614 [INFO] 39 records saved to example/taxdump2/names.dmp 16:53:08.614 [INFO] 0 records saved to example/taxdump2/merged.dmp 16:53:08.615 [INFO] 0 records saved to example/taxdump2/delnodes.dmp $ export TAXONKIT_DB=example/taxdump2 $ taxonkit list --ids 1 | taxonkit filter -E species | taxonkit lineage -r | head -n 2 1527235303 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus mitis species 2983929374 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus pneumoniae species","title":"create-taxdump"},{"location":"usage/#genautocomplete","text":"Usage Generate shell autocompletion script Supported shell: bash|zsh|fish|powershell Bash: # generate completion shell taxonkit genautocomplete --shell bash # configure if never did. # install bash-completion if the \"complete\" command is not found. echo \"for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\" >> ~/.bash_completion echo \"source ~/.bash_completion\" >> ~/.bashrc Zsh: # generate completion shell taxonkit genautocomplete --shell zsh --file ~/.zfunc/_taxonkit # configure if never did echo 'fpath=( ~/.zfunc \"${fpath[@]}\" )' >> ~/.zshrc echo \"autoload -U compinit; compinit\" >> ~/.zshrc fish: taxonkit genautocomplete --shell fish --file ~/.config/fish/completions/taxonkit.fish Usage: taxonkit genautocomplete [flags] Flags: --file string autocompletion file (default \"/home/shenwei/.bash_completion.d/taxonkit.sh\") -h, --help help for genautocomplete --type string autocompletion type (currently only bash supported) (default \"bash\")","title":"genautocomplete"},{"location":"usage/#profile2cami","text":"Usage Convert metagenomic profile table to CAMI format Input format: 1. The input file should be tab-delimited 2. At least two columns needed: a) TaxId of taxon at species or lower rank. b) Abundance (could be percentage, automatically detected or use -p/--percentage). Attentions: 1. Some TaxIds may be merged to another ones in current taxonomy version, the abundances will be summed up. 2. Some TaxIds may be deleted in current taxonomy version, the abundances can be optionally recomputed with the flag -R/--recompute-abd. Usage: taxonkit profile2cami [flags] Flags: -a, --abundance-field int field index of abundance. input data should be tab-separated (default 2) -h, --help help for profile2cami -0, --keep-zero keep taxons with abundance of zero -p, --percentage abundance is in percentage -R, --recompute-abd recompute abundance if some TaxIds are deleted in current taxonomy version -s, --sample-id string sample ID in result file -r, --show-rank strings only show TaxIds and names of these ranks (default [superkingdom,phylum,class,order,family,genus,species,strain]) -i, --taxid-field int field index of taxid. input data should be tab-separated (default 1) -t, --taxonomy-id string taxonomy ID in result file Examples Test data, note that 2824115 is merged to 483329 and 1657696 is deleted in current taxonomy version. $ cat example/abundance.tsv 2824115 0.2 merged to 483329 483329 0.2 absord 2824115 239935 0.5 no change 1657696 0.1 deleted Example: $ taxonkit profile2cami -s sample1 -t 2021-10-01 \\ example/abundance.tsv 13:17:40.552 [WARN] taxid is deleted in current taxonomy version: 1657696 13:17:40.552 [WARN] you may recomputed abundance with the flag -R/--recompute-abd @SampleID:sample1 @Version:0.10.0 @Ranks:superkingdom|phylum|class|order|family|genus|species|strain @TaxonomyID:2021-10-01 @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 50.000000000000000 2759 superkingdom 2759 Eukaryota 40.000000000000000 74201 phylum 2|74201 Bacteria|Verrucomicrobia 50.000000000000000 6656 phylum 2759|6656 Eukaryota|Arthropoda 40.000000000000000 203494 class 2|74201|203494 Bacteria|Verrucomicrobia|Verrucomicrobiae 50.000000000000000 50557 class 2759|6656|50557 Eukaryota|Arthropoda|Insecta 40.000000000000000 48461 order 2|74201|203494|48461 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales 50.000000000000000 7041 order 2759|6656|50557|7041 Eukaryota|Arthropoda|Insecta|Coleoptera 40.000000000000000 1647988 family 2|74201|203494|48461|1647988 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae 50.000000000000000 57514 family 2759|6656|50557|7041|57514 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae 40.000000000000000 239934 genus 2|74201|203494|48461|1647988|239934 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia 50.000000000000000 57515 genus 2759|6656|50557|7041|57514|57515 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus 40.000000000000000 239935 species 2|74201|203494|48461|1647988|239934|239935 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila 50.000000000000000 483329 species 2759|6656|50557|7041|57514|57515|483329 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus|Nicrophorus carolina 40.000000000000000 Recompute (normalize) the abundance $ taxonkit profile2cami -s sample1 -t 2021-10-01 \\ example/abundance.tsv --recompute-abd 13:19:23.647 [WARN] taxid is deleted in current taxonomy version: 1657696 @SampleID:sample1 @Version:0.10.0 @Ranks:superkingdom|phylum|class|order|family|genus|species|strain @TaxonomyID:2021-10-01 @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 55.555555555555557 2759 superkingdom 2759 Eukaryota 44.444444444444450 74201 phylum 2|74201 Bacteria|Verrucomicrobia 55.555555555555557 6656 phylum 2759|6656 Eukaryota|Arthropoda 44.444444444444450 203494 class 2|74201|203494 Bacteria|Verrucomicrobia|Verrucomicrobiae 55.555555555555557 50557 class 2759|6656|50557 Eukaryota|Arthropoda|Insecta 44.444444444444450 48461 order 2|74201|203494|48461 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales 55.555555555555557 7041 order 2759|6656|50557|7041 Eukaryota|Arthropoda|Insecta|Coleoptera 44.444444444444450 1647988 family 2|74201|203494|48461|1647988 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae 55.555555555555557 57514 family 2759|6656|50557|7041|57514 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae 44.444444444444450 239934 genus 2|74201|203494|48461|1647988|239934 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia 55.555555555555557 57515 genus 2759|6656|50557|7041|57514|57515 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus 44.444444444444450 239935 species 2|74201|203494|48461|1647988|239934|239935 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila 55.555555555555557 483329 species 2759|6656|50557|7041|57514|57515|483329 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus|Nicrophorus carolina 44.444444444444450 See https://github.com/shenwei356/sun2021-cami-profiles","title":"profile2cami"},{"location":"usage/#cami-filter","text":"Usage Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile Input format: The CAMI (Taxonomic) Profiling Output Format - https://github.com/CAMI-challenge/contest_information/blob/master/file_formats/CAMI_TP_specification.mkd - One file with mutiple samples is also supported. How to: - No extra taxonomy data needed, so the original taxonomic information are used and not changed. - A mini taxonomic tree is built from records with abundance greater than zero, and only leaves are retained for later use. The rank of leaves may be \"strain\", \"species\", or \"no rank\". - Relative abundances (in percentage) are recomputed for all leaves (reference genome). - A new taxonomic tree is built from these leaves, and abundances are cumulatively added up from leaves to the root. Examples: 1. Remove Archaea, Bacteria, and EukaryoteS, only keep Viruses: taxonkit cami-filter -t 2,2157,2759 test.profile -o test.filter.profile 2. Remove Viruses: taxonkit cami-filter -t 10239 test.profile -o test.filter.profile Usage: taxonkit cami-filter [flags] Flags: --field-percentage int field index of PERCENTAGE (default 5) --field-rank int field index of taxid (default 2) --field-taxid int field index of taxid (default 1) --field-taxpath int field index of TAXPATH (default 3) --field-taxpathsn int field index of TAXPATHSN (default 4) -h, --help help for cami-filter --leaf-ranks strings only consider leaves at these ranks (default [species,strain,no rank]) --show-rank strings only show TaxIds and names of these ranks (default [superkingdom,phylum,class,order,family,genus,species,strain]) --taxid-sep string separator of taxid in TAXPATH and TAXPATHSN (default \"|\") -t, --taxids strings the parent taxid(s) to filter out -f, --taxids-file strings file(s) for the parent taxid(s) to filter out, one taxid per line Examples: Remove Eukaryota taxonkit profile2cami -s sample1 -t 2021-10-01 \\ example/abundance.tsv --recompute-abd \\ | taxonkit cami-filter -t 2759 @SampleID:sample1 @Version:0.10.0 @Ranks:superkingdom|phylum|class|order|family|genus|species|strain @TaxonomyID:2021-10-01 @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 100.000000000000000 74201 phylum 2|74201 Bacteria|Verrucomicrobia 100.000000000000000 203494 class 2|74201|203494 Bacteria|Verrucomicrobia|Verrucomicrobiae 100.000000000000000 48461 order 2|74201|203494|48461 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales 100.000000000000000 1647988 family 2|74201|203494|48461|1647988 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae 100.000000000000000 239934 genus 2|74201|203494|48461|1647988|239934 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia 100.000000000000000 239935 species 2|74201|203494|48461|1647988|239934|239935 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila 100.000000000000000 /** * RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS. * LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/ /* var disqus_config = function () { this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable }; */ (function() { // DON'T EDIT BELOW THIS LINE var d = document, s = d.createElement('script'); s.src = '//taxonkit.disqus.com/embed.js'; s.setAttribute('data-timestamp', +new Date()); (d.head || d.body).appendChild(s); })(); Please enable JavaScript to view the comments powered by Disqus.","title":"cami-filter"},{"location":"bench/","text":"Benchmark Benchmark 1: Getting lineage Data set NCBI taxonomy , version 2021-01-21 TaxIDs. Root node 1 is removed. And These data should be updated along with NCBI taxonomy dataset. Seven sizes of TaxIds are sampled from nodes.dmp . # shuffle all taxids cut -f 1 nodes.dmp | grep -w -v 1 | shuf > ids.txt # extract n taxids for testing for n in 1 10 100 1000 2000 4000 6000 8000 10000 20000 40000 60000 80000 100000; do head -n $n ids.txt > taxids.n$n.txt done Software Loading database from local database: ETE, version: 3.1.2 Directly parsing dump files: taxopy, version: 0.5.0 TaxonKit, version: 0.7.2 Environment OS: Linux 5.4.89-1-MANJARO CPU: AMD Ryzen 7 2700X Eight-Core Processor, 3.7GHz RAM: 64GB DDR4 3000MHz SSD: Samsung 970EVO 500G NVMe SSD Installation and Configurations ETE sudo pip3 install ete3 # create database # http://etetoolkit.org/docs/latest/tutorial/tutorial_ncbitaxonomy.html#upgrading-the-local-database from ete3 import NCBITaxa ncbi = NCBITaxa() ncbi.update_taxonomy_database() TaxonKit mkdir -p $HOME/.taxonkit mkdir -p $HOME/bin/ # data wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz -C $HOME/.taxonkit # binary wget https://github.com/shenwei356/taxonkit/releases/download/v0.7.2/taxonkit_linux_amd64.tar.gz tar -zxvf taxonkit_linux_amd64.tar.gz -C $HOME/bin/ taxopy sudo pip3 install -U taxopy # taxoopy identical dump files copied from taxonkit mkdir -p ~/.taxopy cp ~/.taxonkit/{nodes.dmp,names.dmp} ~/.taxopy Scripts and Commands Scripts/Command as listed below. Python scripts were written following to the official documents, and parallelized querying were not used, including TaxonKit . ETE get_lineage.ete.py < $infile > $outfile taxopy get_lineage.taxopy.py < $infile > $outfile taxonkit taxonkit lineage --threads 1 --delimiter \"; \" < $infile > $outfile A Python script memusg was used to computate running time and peak memory usage of a process. A Perl scripts run.pl is used to automatically running tests and generate data for plotting. Running benchmark: $ # emptying the buffers cache $ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\" time perl run.pl -n 3 run_benchmark.sh -o bench.get_lineage.tsv Checking result: $ md5sum taxids.n*.lineage # clear $ rm *.lineage *.out Plotting benchmark result. R libraries dplyr , ggplot2 , scales , ggthemes , ggrepel are needed. # reformat dataset # tools: https://github.com/shenwei356/csvtk/ for f in taxids.n*.txt; do wc -l $f; done \\ | sort -k 1,1n \\ | awk '{ print($2\"\\t\"$1) }' \\ > dataset_rename.tsv cat bench.get_lineage.tsv \\ | csvtk sort -t -L dataset:<(cut -f 1 dataset_rename.tsv) -k dataset:u -k app \\ | csvtk replace -t -f dataset -k dataset_rename.tsv -p '(.+)' -r '{kv}' \\ > bench.get_lineage.reformat.tsv ./plot2.R -i bench.get_lineage.reformat.tsv --width 6 --height 4 --dpi 600 \\ --labcolor \"log10(queries)\" --labshape \"Tools\" Result Benchmark 2: TaxonKit multi-threaded scalability Running benchmark: $ # emptying the buffers cache $ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\" $ time perl run.pl -n 3 run_benchmark_taxonkit.sh -o bench.taxonkit.tsv $ rm *.lineage *.out Plotting benchmark result. cat bench.taxonkit.tsv \\ | csvtk sort -t -L dataset:<(cut -f 1 dataset_rename.tsv) -k dataset:u -k app \\ | csvtk replace -t -f dataset -k dataset_rename.tsv -p '(.+)' -r '{kv}' \\ > bench.taxonkit.reformat.tsv ./plot_threads2.R -i bench.taxonkit.reformat.tsv --width 6 --height 4 --dpi 600 \\ --labcolor \"log10(queries)\" --labshape \"Threads\" Result /** * RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS. * LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/ /* var disqus_config = function () { this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable }; */ (function() { // DON'T EDIT BELOW THIS LINE var d = document, s = d.createElement('script'); s.src = '//taxonkit.disqus.com/embed.js'; s.setAttribute('data-timestamp', +new Date()); (d.head || d.body).appendChild(s); })(); Please enable JavaScript to view the comments powered by Disqus.","title":"Benchmark"},{"location":"bench/#benchmark","text":"","title":"Benchmark"},{"location":"bench/#benchmark-1-getting-lineage","text":"","title":"Benchmark 1: Getting lineage"},{"location":"bench/#data-set","text":"NCBI taxonomy , version 2021-01-21 TaxIDs. Root node 1 is removed. And These data should be updated along with NCBI taxonomy dataset. Seven sizes of TaxIds are sampled from nodes.dmp . # shuffle all taxids cut -f 1 nodes.dmp | grep -w -v 1 | shuf > ids.txt # extract n taxids for testing for n in 1 10 100 1000 2000 4000 6000 8000 10000 20000 40000 60000 80000 100000; do head -n $n ids.txt > taxids.n$n.txt done","title":"Data set"},{"location":"bench/#software","text":"Loading database from local database: ETE, version: 3.1.2 Directly parsing dump files: taxopy, version: 0.5.0 TaxonKit, version: 0.7.2","title":"Software"},{"location":"bench/#environment","text":"OS: Linux 5.4.89-1-MANJARO CPU: AMD Ryzen 7 2700X Eight-Core Processor, 3.7GHz RAM: 64GB DDR4 3000MHz SSD: Samsung 970EVO 500G NVMe SSD","title":"Environment"},{"location":"bench/#installation-and-configurations","text":"ETE sudo pip3 install ete3 # create database # http://etetoolkit.org/docs/latest/tutorial/tutorial_ncbitaxonomy.html#upgrading-the-local-database from ete3 import NCBITaxa ncbi = NCBITaxa() ncbi.update_taxonomy_database() TaxonKit mkdir -p $HOME/.taxonkit mkdir -p $HOME/bin/ # data wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz -C $HOME/.taxonkit # binary wget https://github.com/shenwei356/taxonkit/releases/download/v0.7.2/taxonkit_linux_amd64.tar.gz tar -zxvf taxonkit_linux_amd64.tar.gz -C $HOME/bin/ taxopy sudo pip3 install -U taxopy # taxoopy identical dump files copied from taxonkit mkdir -p ~/.taxopy cp ~/.taxonkit/{nodes.dmp,names.dmp} ~/.taxopy","title":"Installation and Configurations"},{"location":"bench/#scripts-and-commands","text":"Scripts/Command as listed below. Python scripts were written following to the official documents, and parallelized querying were not used, including TaxonKit . ETE get_lineage.ete.py < $infile > $outfile taxopy get_lineage.taxopy.py < $infile > $outfile taxonkit taxonkit lineage --threads 1 --delimiter \"; \" < $infile > $outfile A Python script memusg was used to computate running time and peak memory usage of a process. A Perl scripts run.pl is used to automatically running tests and generate data for plotting. Running benchmark: $ # emptying the buffers cache $ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\" time perl run.pl -n 3 run_benchmark.sh -o bench.get_lineage.tsv Checking result: $ md5sum taxids.n*.lineage # clear $ rm *.lineage *.out Plotting benchmark result. R libraries dplyr , ggplot2 , scales , ggthemes , ggrepel are needed. # reformat dataset # tools: https://github.com/shenwei356/csvtk/ for f in taxids.n*.txt; do wc -l $f; done \\ | sort -k 1,1n \\ | awk '{ print($2\"\\t\"$1) }' \\ > dataset_rename.tsv cat bench.get_lineage.tsv \\ | csvtk sort -t -L dataset:<(cut -f 1 dataset_rename.tsv) -k dataset:u -k app \\ | csvtk replace -t -f dataset -k dataset_rename.tsv -p '(.+)' -r '{kv}' \\ > bench.get_lineage.reformat.tsv ./plot2.R -i bench.get_lineage.reformat.tsv --width 6 --height 4 --dpi 600 \\ --labcolor \"log10(queries)\" --labshape \"Tools\" Result","title":"Scripts and Commands"},{"location":"bench/#benchmark-2-taxonkit-multi-threaded-scalability","text":"Running benchmark: $ # emptying the buffers cache $ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\" $ time perl run.pl -n 3 run_benchmark_taxonkit.sh -o bench.taxonkit.tsv $ rm *.lineage *.out Plotting benchmark result. cat bench.taxonkit.tsv \\ | csvtk sort -t -L dataset:<(cut -f 1 dataset_rename.tsv) -k dataset:u -k app \\ | csvtk replace -t -f dataset -k dataset_rename.tsv -p '(.+)' -r '{kv}' \\ > bench.taxonkit.reformat.tsv ./plot_threads2.R -i bench.taxonkit.reformat.tsv --width 6 --height 4 --dpi 600 \\ --labcolor \"log10(queries)\" --labshape \"Threads\" Result /** * RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS. * LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/ /* var disqus_config = function () { this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable }; */ (function() { // DON'T EDIT BELOW THIS LINE var d = document, s = d.createElement('script'); s.src = '//taxonkit.disqus.com/embed.js'; s.setAttribute('data-timestamp', +new Date()); (d.head || d.body).appendChild(s); })(); Please enable JavaScript to view the comments powered by Disqus.","title":"Benchmark 2: TaxonKit multi-threaded scalability"}]}
\ No newline at end of file
+{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"TaxonKit - A Practical and Efficient NCBI Taxonomy Toolkit Documents: https://bioinf.shenwei.me/taxonkit ( Usage&Examples , Tutorial , \u4e2d\u6587\u4ecb\u7ecd ) Source code: https://github.com/shenwei356/taxonkit Latest version: Please cite : https://doi.org/10.1016/j.jgg.2021.03.006 pytaxonkit , Python bindings for TaxonKit. Related projects: Taxid-Changelog : Tracking all changes of TaxIds, including deletion, new adding, merge, reuse, and rank/name changes. GTDB taxdump : GTDB taxonomy taxdump files with trackable TaxIds. ICTV taxdump : NCBI-style taxdump files for International Committee on Taxonomy of Viruses (ICTV) Table of Contents Features Subcommands Benchmark Dataset Installation Command-line completion Citation Contact License Features Easy to install ( download ) Statically linked executable binaries for multiple platforms (Linux/Windows/macOS, amd64/arm64) Light weight and out-of-the-box, no dependencies, no compilation, no configuration No database building, just download NCBI taxonomy data and uncompress to $HOME/.taxonkit Easy to use ( usages and examples ) Supporting bash-completion Fast (see benchmark ), multiple-CPUs supported, most operations cost 2-10s. Detailed usages and examples Supporting STDIN and (gzipped) input/output file, easily integrated in pipe Versatile commands Usage and examples Featured command: tracking monthly changelog of all TaxIds Featured command: reformating lineage into format of seven-level (\"superkingdom/kingdom, phylum, class, order, family, genus, species\" Featured command: filtering taxiDs by a rank range , e.g., at or below genus rank. Featured command: Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV Subcommands Subcommand Function list List taxonomic subtrees (TaxIds) bellow given TaxIds lineage Query taxonomic lineage of given TaxIds reformat Reformat lineage in canonical ranks name2taxid Convert scientific names to TaxIds filter Filter TaxIds by taxonomic rank range lca Compute lowest common ancestor (LCA) for TaxIds taxid-changelog Create TaxId changelog from dump archives profile2cami * Convert metagenomic profile table to CAMI format cami-filter * Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile create-taxdump * Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV Note: * New commands since the publication. Benchmark Getting complete lineage for given TaxIds Versions: ETE=3.1.2, taxopy=0.5.0 ( faster since 0.6.0 ), TaxonKit=0.7.2. Dataset Download and uncompress taxdump.tar.gz : ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz Copy names.dmp , nodes.dmp , delnodes.dmp and merged.dmp to data directory: $HOME/.taxonkit , e.g., /home/shenwei/.taxonkit , Optionally copy to some other directories, and later you can refer to using flag --data-dir , or environment variable TAXONKIT_DB . All-in-one command: wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz mkdir -p $HOME/.taxonkit cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit Update dataset : Simply re-download the taxdump files, uncompress and override old ones. Installation Go to Download Page for more download options and changelogs. TaxonKit is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page. Method 1: Download binaries (latest stable/dev version) Just download compressed executable file of your operating system, and uncompress it with tar -zxvf *.tar.gz command or other tools. And then: For Linux-like systems If you have root privilege simply copy it to /usr/local/bin : sudo cp taxonkit /usr/local/bin/ Or copy to anywhere in the environment variable PATH : mkdir -p $HOME/bin/; cp taxonkit $HOME/bin/ For Windows , just copy taxonkit.exe to C:\\WINDOWS\\system32 . Method 2: Install via conda (latest stable version) conda install -c bioconda taxonkit Method 3: Install via homebrew (out of date) brew install brewsci/bio/taxonkit Method 4: Compile from source (latest stable/dev version) Install go wget https://go.dev/dl/go1.17.13.linux-amd64.tar.gz tar -zxf go1.17.13.linux-amd64.tar.gz -C $HOME/ # or # echo \"export PATH=$PATH:$HOME/go/bin\" >> ~/.bashrc # source ~/.bashrc export PATH=$PATH:$HOME/go/bin Compile TaxonKit # ------------- the latest stable version ------------- go get -v -u github.com/shenwei356/taxonkit/taxonkit # The executable binary file is located in: # ~/go/bin/taxonkit # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ~/go/bin/taxonkit $HOME/bin/ # --------------- the development version -------------- git clone https://github.com/shenwei356/taxonkit cd taxonkit/taxonkit/ go build # The executable binary file is located in: # ./taxonkit # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ./taxonkit $HOME/bin/ Bash-completion Supported shell: bash|zsh|fish|powershell Bash: # generate completion shell taxonkit genautocomplete --shell bash # configure if never did. # install bash-completion if the \"complete\" command is not found. echo \"for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\" >> ~/.bash_completion echo \"source ~/.bash_completion\" >> ~/.bashrc Zsh: # generate completion shell taxonkit genautocomplete --shell zsh --file ~/.zfunc/_taxonkit # configure if never did echo 'fpath=( ~/.zfunc \"${fpath[@]}\" )' >> ~/.zshrc echo \"autoload -U compinit; compinit\" >> ~/.zshrc fish: taxonkit genautocomplete --shell fish --file ~/.config/fish/completions/taxonkit.fish Citation If you use TaxonKit in your work, please cite: Shen, W., Ren, H., TaxonKit: a practical and efficient NCBI Taxonomy toolkit, Journal of Genetics and Genomics, https://doi.org/10.1016/j.jgg.2021.03.006 Contact Create an issue to report bugs, propose new functions or ask for help. License MIT License Starchart","title":"Home"},{"location":"#taxonkit-a-practical-and-efficient-ncbi-taxonomy-toolkit","text":"Documents: https://bioinf.shenwei.me/taxonkit ( Usage&Examples , Tutorial , \u4e2d\u6587\u4ecb\u7ecd ) Source code: https://github.com/shenwei356/taxonkit Latest version: Please cite : https://doi.org/10.1016/j.jgg.2021.03.006 pytaxonkit , Python bindings for TaxonKit. Related projects: Taxid-Changelog : Tracking all changes of TaxIds, including deletion, new adding, merge, reuse, and rank/name changes. GTDB taxdump : GTDB taxonomy taxdump files with trackable TaxIds. ICTV taxdump : NCBI-style taxdump files for International Committee on Taxonomy of Viruses (ICTV)","title":"TaxonKit - A Practical and Efficient NCBI Taxonomy Toolkit"},{"location":"#table-of-contents","text":"Features Subcommands Benchmark Dataset Installation Command-line completion Citation Contact License","title":"Table of Contents"},{"location":"#features","text":"Easy to install ( download ) Statically linked executable binaries for multiple platforms (Linux/Windows/macOS, amd64/arm64) Light weight and out-of-the-box, no dependencies, no compilation, no configuration No database building, just download NCBI taxonomy data and uncompress to $HOME/.taxonkit Easy to use ( usages and examples ) Supporting bash-completion Fast (see benchmark ), multiple-CPUs supported, most operations cost 2-10s. Detailed usages and examples Supporting STDIN and (gzipped) input/output file, easily integrated in pipe Versatile commands Usage and examples Featured command: tracking monthly changelog of all TaxIds Featured command: reformating lineage into format of seven-level (\"superkingdom/kingdom, phylum, class, order, family, genus, species\" Featured command: filtering taxiDs by a rank range , e.g., at or below genus rank. Featured command: Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV","title":"Features"},{"location":"#subcommands","text":"Subcommand Function list List taxonomic subtrees (TaxIds) bellow given TaxIds lineage Query taxonomic lineage of given TaxIds reformat Reformat lineage in canonical ranks name2taxid Convert scientific names to TaxIds filter Filter TaxIds by taxonomic rank range lca Compute lowest common ancestor (LCA) for TaxIds taxid-changelog Create TaxId changelog from dump archives profile2cami * Convert metagenomic profile table to CAMI format cami-filter * Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile create-taxdump * Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV Note: * New commands since the publication.","title":"Subcommands"},{"location":"#benchmark","text":"Getting complete lineage for given TaxIds Versions: ETE=3.1.2, taxopy=0.5.0 ( faster since 0.6.0 ), TaxonKit=0.7.2.","title":"Benchmark"},{"location":"#dataset","text":"Download and uncompress taxdump.tar.gz : ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz Copy names.dmp , nodes.dmp , delnodes.dmp and merged.dmp to data directory: $HOME/.taxonkit , e.g., /home/shenwei/.taxonkit , Optionally copy to some other directories, and later you can refer to using flag --data-dir , or environment variable TAXONKIT_DB . All-in-one command: wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz mkdir -p $HOME/.taxonkit cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit Update dataset : Simply re-download the taxdump files, uncompress and override old ones.","title":"Dataset"},{"location":"#installation","text":"Go to Download Page for more download options and changelogs. TaxonKit is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.","title":"Installation"},{"location":"#method-1-download-binaries-latest-stabledev-version","text":"Just download compressed executable file of your operating system, and uncompress it with tar -zxvf *.tar.gz command or other tools. And then: For Linux-like systems If you have root privilege simply copy it to /usr/local/bin : sudo cp taxonkit /usr/local/bin/ Or copy to anywhere in the environment variable PATH : mkdir -p $HOME/bin/; cp taxonkit $HOME/bin/ For Windows , just copy taxonkit.exe to C:\\WINDOWS\\system32 .","title":"Method 1: Download binaries (latest stable/dev version)"},{"location":"#method-2-install-via-conda-latest-stable-version","text":"conda install -c bioconda taxonkit","title":"Method 2: Install via conda (latest stable version)"},{"location":"#method-3-install-via-homebrew-out-of-date","text":"brew install brewsci/bio/taxonkit","title":"Method 3: Install via homebrew (out of date)"},{"location":"#method-4-compile-from-source-latest-stabledev-version","text":"Install go wget https://go.dev/dl/go1.17.13.linux-amd64.tar.gz tar -zxf go1.17.13.linux-amd64.tar.gz -C $HOME/ # or # echo \"export PATH=$PATH:$HOME/go/bin\" >> ~/.bashrc # source ~/.bashrc export PATH=$PATH:$HOME/go/bin Compile TaxonKit # ------------- the latest stable version ------------- go get -v -u github.com/shenwei356/taxonkit/taxonkit # The executable binary file is located in: # ~/go/bin/taxonkit # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ~/go/bin/taxonkit $HOME/bin/ # --------------- the development version -------------- git clone https://github.com/shenwei356/taxonkit cd taxonkit/taxonkit/ go build # The executable binary file is located in: # ./taxonkit # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ./taxonkit $HOME/bin/","title":"Method 4: Compile from source (latest stable/dev version)"},{"location":"#bash-completion","text":"Supported shell: bash|zsh|fish|powershell Bash: # generate completion shell taxonkit genautocomplete --shell bash # configure if never did. # install bash-completion if the \"complete\" command is not found. echo \"for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\" >> ~/.bash_completion echo \"source ~/.bash_completion\" >> ~/.bashrc Zsh: # generate completion shell taxonkit genautocomplete --shell zsh --file ~/.zfunc/_taxonkit # configure if never did echo 'fpath=( ~/.zfunc \"${fpath[@]}\" )' >> ~/.zshrc echo \"autoload -U compinit; compinit\" >> ~/.zshrc fish: taxonkit genautocomplete --shell fish --file ~/.config/fish/completions/taxonkit.fish","title":"Bash-completion"},{"location":"#citation","text":"If you use TaxonKit in your work, please cite: Shen, W., Ren, H., TaxonKit: a practical and efficient NCBI Taxonomy toolkit, Journal of Genetics and Genomics, https://doi.org/10.1016/j.jgg.2021.03.006","title":"Citation"},{"location":"#contact","text":"Create an issue to report bugs, propose new functions or ask for help.","title":"Contact"},{"location":"#license","text":"MIT License","title":"License"},{"location":"#starchart","text":"","title":"Starchart"},{"location":"bioinf/","text":"","title":"Bioinf"},{"location":"chinese-dev/","text":"\u73b0\u6709\u5de5\u5177\u6bd4\u8f83 \u60f3\u8981\u4eceNCBI\u83b7\u53d6\u751f\u7269\u7684\u8c31\u7cfb\u4fe1\u606f\uff0c\u53ef\u4ee5\u5728 NCBI Taxonomy\u7f51\u7ad9\u4e0a\u7528TaxID\u6216\u8005\u540d\u79f0\u67e5\u8be2\u3002 \u6bd4\u5982\u53ef\u4ee5\u7528 Homo sapiens \u6216 9606 \u641c\u7d22\u201c\u4eba\u201d\u7684\u5206\u7c7b\u5b66\u4fe1\u606f\uff0c\u4ee5\u53ca\u5bc6\u7801\u5b50\u8868\uff0cEntrez\u8bb0\u5f55\u7edf\u8ba1\u7b49\u3002 \u540c\u65f6\u4e5f\u53ef\u4ee5\u901a\u8fc7NCBI\u7684\u5b98\u65b9\u5de5\u5177\u5305 E-utilities ( ftp )\u3002 $ esearch -db taxonomy -query \"txid9606 [Organism]\" \\ | efetch -format xml \\ | xtract -pattern Lineage -element Lineage \u6b64\u5916\u4e5f\u6709\u4e00\u4e9b\u5de5\u5177\u63d0\u4f9b\u7c7b\u4f3c\u7684\u529f\u80fd\uff0c\u90e8\u5206\u8f6f\u4ef6\uff1a \u5de5\u5177 \u7f16\u7a0b\u8bed\u8a00 \u6570\u636e\u83b7\u53d6\u65b9\u5f0f \u4f7f\u7528\u65b9\u5f0f \u5907\u6ce8 E-utilities shell/Perl/C++ \u8fdc\u7a0bWeb\u8c03\u7528 \u547d\u4ee4\u884c \u5b98\u65b9\u7a0b\u5e8f\uff0cTaxonomy\u64cd\u4f5c\u4ec5\u4e3a\u5176\u90e8\u5206\u529f\u80fd BioPython Python \u8fdc\u7a0bWeb\u8c03\u7528 \u811a\u672c \u5305\u88c5entrez\u63a5\u53e3\uff0cTaxonomy\u64cd\u4f5c\u4ec5\u4e3a\u5176\u90e8\u5206\u529f\u80fd ETE Toolkit Python \u672c\u5730\u6570\u636e\u5e93 \u811a\u672c/\u547d\u4ee4\u884c Taxonomy\u64cd\u4f5c\u4ec5\u4e3a\u5176\u90e8\u5206\u529f\u80fd Taxize R \u8fdc\u7a0bWeb\u8c03\u7528 \u811a\u672c ropensci\uff1b\u652f\u6301\u591a\u79cd\u6570\u636e\u5e93\uff1b\u529f\u80fd\u8f83\u4e30\u5bcc Taxopy Python \u672c\u5730\u6570\u636e\u6587\u4ef6 \u811a\u672c/\u547d\u4ee4\u884c \u4ec5\u57fa\u672c\u529f\u80fd \u9009\u62e9\u5de5\u5177\u4e00\u822c\u8003\u8651\u51e0\u4e2a\u65b9\u9762\uff1a \u662f\u5426\u6ee1\u8db3\u529f\u80fd\u9700\u6c42\u3002\u5927\u591a\u5de5\u5177\u4ec5\u6709\u57fa\u672c\u7684\u67e5\u8be2\u3001\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb\u7684\u529f\u80fd\uff0c\u90fd\u6ca1\u6cd5\u5c06\u5b8c\u6574\u8c31\u7cfb\u683c\u5f0f\u5316\u4e3a\"\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\"\u7684\u683c\u5f0f\uff1b \u8f6f\u4ef6\u5b89\u88c5\u4fbf\u5229\u6027\u3002\u4e0a\u8ff0\u5de5\u5177\u90fd\u4e0d\u9700\u8981\u624b\u52a8\u7f16\u8bd1\u5b89\u88c5\uff0c\u9664\u4e86E-utilities\u7684\u90e8\u5206\u7ec4\u4ef6\u9700\u8981\u624b\u52a8\u5b8c\u6210\uff0c\u5176\u5b83\u57fa\u672c\u90fd\u80fd\u7528\u5bf9\u5e94\u7f16\u7a0b\u8bed\u8a00\u7684\u5305\u7ba1\u7406\u5de5\u5177\u5b89\u88c5\uff1b \u914d\u7f6e\u4fbf\u5229\u6027\u3002\u90e8\u5206\u5efa\u7acb\u672c\u5730\u6570\u636e\u5e93\u7684\u8f6f\u4ef6\u5219\u9700\u8981\u5148\u6784\u5efa\u6570\u636e\u5e93\uff0c\u4e0d\u8fc7\u57fa\u672c\u90fd\u662f\u5d4c\u5165\u5f0f\u7684sqlite\uff0c\u6bd4\u8f83\u7b80\u5355\u5feb\u6377\uff0c\u7a7a\u95f4\u5360\u7528\u4e5f\u80fd\u63a5\u53d7\uff1b \u4f7f\u7528\u4fbf\u5229\u6027\u3002\u63d0\u4f9b\u547d\u4ee4\u884c\u63a5\u53e3\u7684\u5de5\u5177\u5b9e\u7528\u8f83\u4e3a\u4fbf\u6377\uff0c\u4e5f\u4fbf\u4e8e\u6574\u5408\u5230\u5206\u6790\u6d41\u7a0b\uff1b \u800c\u4ec5\u63d0\u4f9b\u5305/\u5e93\u7684\u5de5\u5177\uff0c\u9700\u8981\u4f7f\u7528\u8005\u5728\u8bed\u8a00\u7ec8\u7aef\u6216\u7f16\u5199\u811a\u672c\u8fdb\u884c\u8c03\u7528\uff0c\u7075\u6d3b\u4f46\u9700\u8981\u4e00\u5b9a\u7f16\u7a0b\u57fa\u7840\u3002 \u8ba1\u7b97\u6548\u7387\u3002\u901a\u8fc7\u7f51\u7edc\u8c03\u7528\u7684\u8f6f\u4ef6\u53d7\u7f51\u7edc\u72b6\u6001\u5f71\u54cd\u5927\uff0c\u4e14\u5728\u5927\u6279\u91cf\u8c03\u7528\u7684\u65f6\u5019\u901f\u5ea6\u8f83\u6162\uff1b\u5b9e\u7528\u672c\u5730\u6570\u636e\u5e93\u5219\u8f83\u4e3a\u9ad8\u6548\u3002 \u6700\u521d\u6211\u60f3\u8981\u7684\u529f\u80fd\u53ea\u662f\u6839\u636e\u83b7\u53d6\"\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\"\u683c\u5f0f\u7684\u8c31\u7cfb\uff0c\u53d1\u73b0\u6ca1\u6709\u73b0\u6210\u5de5\u5177\uff0c\u800c\u540e\u53c8\u6709\u65b0\u7684\u9700\u6c42\u65e0\u6cd5\u6ee1\u8db3\uff0c\u5373\u83b7\u53d6\u67d0\u4e2a\u7c7b\u522b\u6240\u6709\u7684TaxID\u3002 \u6545\u5f00\u59cb\u7f16\u5199\u5de5\u5177\u6765\u5b9e\u73b0\uff0c\u5e76\u9010\u6b65\u6269\u5c55\u5176\u529f\u80fd\u3002 \u5176\u5b9e\u6700\u7b80\u5355\u7684\u65b9\u6cd5\u5c31\u662f\u81ea\u5df1\u4e0b\u8f7d\u6570\u636e\u6587\u4ef6\u8fdb\u884c\u89e3\u6790\u3002 NCBI Taxonomy \u6570\u636e\u6587\u4ef6 NCBI Taxonomy\u6570\u636e\u5e93\u5c06\u6240\u6709\u751f\u7269\u7684 \u5206\u7c7b\u5b66\u5173\u7cfb \u7ec4\u7ec7\u4e3a\u4e00\u68f5\u201c\u6709\u6839\u6811\u201d\uff08rooted tree\uff09, \u4e0e\u8fdb\u5316\u6811\uff08Phylogenetic tree\uff09\u4e0d\u540c: \u8fdb\u5316\u6811\u662f\u6309 \u8fdb\u5316\u5173\u7cfb \u201d\u7ec4\u7ec7\uff0c\u4e14\u53ef\u4ee5\u4e3a\u201c\u65e0\u6839\u6811\u201d(unrooted tree)\u3002 NCBI Taxonomy\u516c\u5f00\u6570\u636e\u683c\u5f0f\u6709\u4e24\u79cd\uff0c\u65e7\u7684\u540d\u79f0\u4e3a taxdump.tar.gz \uff0c\u6587\u4ef6\u5927\u5c0f\u7ea650Mb\uff0c\u5185\u542b\u4ee5\u4e0b\u6587\u4ef6\u3002 nodes.dmp # [\u5f53\u524d\u7248\u672c] \u8282\u70b9\u4fe1\u606f # \u91cd\u8981\u5185\u5bb9\uff1a tax_id, parent tax_id, rank names.dmp # [\u5f53\u524d\u7248\u672c] \u540d\u79f0\u4fe1\u606f # \u91cd\u8981\u5185\u5bb9\uff1a tax_id, name_txt merged.dmp # [\u76ee\u524d\u4e3a\u6b62] \u88ab\u5408\u5e76\u7684\u8282\u70b9\u4fe1\u606f # \u91cd\u8981\u5185\u5bb9\uff1a old_tax_id, new_tax_id delnodes.dmp # [\u76ee\u524d\u4e3a\u6b62] \u88ab\u5220\u9664\u7684nodes\u4fe1\u606f # \u91cd\u8981\u5185\u5bb9\uff1a tax_id citations.dmp # \u5f15\u7528\u4fe1\u606f division.dmp # division\u4fe1\u606f gencode.dmp # \u9057\u4f20\u7f16\u7801\u4fe1\u606f gc.prt # \u9057\u4f20\u7f16\u7801\u8868 readme.txt # \u8bf4\u660e\u6587\u6863 \u5176\u4e2d\u6700\u4e3b\u8981\u7684\u662f\u524d4\u4e2a\u6587\u4ef6\uff1a nodes.dmp \u4e3b\u8981\u5305\u542b\u5f53\u524d\u7248\u672c\u7684\u6240\u6709\u5206\u7c7b\u5b66\u5355\u5143\u8282\u70b9\uff08taxon\uff09 \u7684\u552f\u4e00\u6807\u8bc6\u7b26\uff08taxonomic identifier, \u7b80\u79f0TaxId, taxid, tax_id)\uff0c \u5206\u7c7b\u5b66\u6c34\u5e73(rank\uff09\uff0c\u53ca\u5176\u7236\u8282\u70b9\u7684TaxID\u3002 names.dmp \u4e3b\u8981\u5305\u542b\u5305\u542b\u5f53\u524d\u7248\u672c\u7684\u6240\u6709TaxID\u53ca\u5176\u7edf\u4e00\u79d1\u5b66\u540d\u79f0\uff08scientific name\uff09\u548c\u522b\u540d\u3002 merged.dmp \u5305\u542b\u4e86\u5230\u5f53\u524d\u7248\u672c\u4e3a\u6b62\uff0c\u6240\u6709\u88ab\u5408\u5e76\u7684TaxID\u4e0e\u5408\u5e76\u5230\u7684\u65b0TaxID\u3002 delnodes.dmp \u5305\u542b\u4e86\u5230\u5f53\u524d\u7248\u672c\u4e3a\u6b62\uff0c\u6240\u6709\u88ab\u5220\u9664\u7684TaxID\u3002 \u57282018\u5e742\u6708\u7684\u65f6\u5019\uff0c \u63a8\u51fa\u4e86\u65b0\u7684\u683c\u5f0f \uff0c \u989d\u5916\u5305\u542b\u4e86\u8c31\u7cfb\uff08lineage\uff09\uff0c\u7c7b\u578b\uff08type\uff09\u548c\u5bbf\u4e3b\uff08host\uff09\u4fe1\u606f\u3002 \u6587\u4ef6\u540d\u79f0\u4e3a new_taxdump.tar.gz \uff0c\u6587\u4ef6\u5927\u5c0f\u7ea6110Mb\u3002 \u76f8\u5bf9\u65e7\u7248\uff0c\u65b0\u7248\u672c\u6587\u4ef6\u6570\u91cf\u548c\u5185\u5bb9\u66f4\u591a\uff0c\u4e3b\u8981\u662f\u56e0\u4e3a\u589e\u52a0\u4e86lineage\u548c\u7c7b\u578b\u4fe1\u606f\u3002 \u4e8b\u5b9e\u4e0alineage\u662f\u53ef\u4ee5\u4ece nodes.dmp \u548c names.dmp \u8ba1\u7b97\u800c\u6765\u3002 \u65b0\u7248\u683c\u5f0f\u6240\u542b\u6587\u4ef6\u5982\u4e0b\uff1a nodes.dmp names.dmp merged.dmp delnodes.dmp fullnamelineage.dmp TaxIDlineage.dmp rankedlineage.dmp host.dmp typeoftype.dmp typematerial.dmp citations.dmp division.dmp gencode.dmp readme.txt NCBI Taxonomy\u6570\u636e\u6bcf\u5929\u90fd\u5728\u66f4\u65b0\uff0c\u6bcf\u6708\u521d\uff08\u5927\u591a\u4e3a1\u53f7\uff09\u7684\u6570\u636e\u4f5c\u4e3a\u5b58\u6863\u4fdd\u5b58\u5728 taxdump_archive/ \u76ee\u5f55\uff0c \u65e7\u7248\u672c\u6700\u65e9\u6570\u636e\u52302014\u5e748\u6708\uff0c\u65b0\u7248\u672c\u53ea\u52302018\u5e7412\u6708\u3002 TaxonKit \u5f00\u53d1\u601d\u8def \u5927\u5bb6\u5e94\u8be5\u90fd\u6709\u5b89\u88c5\u751f\u7269\u4fe1\u606f\u8f6f\u4ef6\u7684\u75db\u82e6\u56de\u5fc6\uff0c\u5728conda\u51fa\u73b0\u4e4b\u524d\uff0c\u5f88\u591a\u8f6f\u4ef6\u90fd\u9700\u8981\u624b\u52a8\u5b89\u88c5\u4f9d\u8d56\u3001\u518d\u7f16\u8bd1\u5b89\u88c5\u3002 \u4e0d\u540c\u64cd\u4f5c\u7cfb\u7edf\uff0c\u64cd\u4f5c\u7cfb\u7edf\u7248\u672c\uff0c\u7f16\u8bd1\u5668\u7248\u672c\u7ed9\u8f6f\u4ef6\u5b89\u88c5\u5e26\u6765\u4e86\u5de8\u5927\u7684\u56f0\u96be\u3002 \u5982\u679c\u5f00\u53d1\u8005\u6ca1\u6ce8\u610f\u8f6f\u4ef6\u7684\u8de8\u5e73\u53f0\u3001\u53ef\u79fb\u690d\u6027\u66f4\u662f\u5982\u6b64\u3002 \u597d\u7684\u8f6f\u4ef6\u4e00\u5b9a\u8981\u8003\u8651\u4ee5\u4e0b\u51e0\u4e2a\u65b9\u9762\uff1a \u5b89\u88c5\u4fbf\u5229\u6027\u3002 \u5c3d\u53ef\u80fd\u7b80\u5316\u5b89\u88c5\u6b65\u9aa4\uff0c\u751a\u81f3\u4e00\u952e/\u4e00\u6761\u547d\u4ee4\u5b89\u88c5\u3002 \u51cf\u5c11\u5bf9\u5916\u90e8\u8f6f\u4ef6/\u5305\u7684\u4f9d\u8d56\u3002 \u5bf9\u591a\u5e73\u53f0\uff08windows/linux\uff09\u7684\u517c\u5bb9\u6027\u3002 \u5c3d\u91cf\u63d0\u4f9b\u7f16\u8bd1\u597d\u7684 \u9759\u6001\u94fe\u63a5\u53ef\u6267\u884c\u7a0b\u5e8f\uff08Statically linked executable binaries\uff09\u3002 \u914d\u7f6e\u4fbf\u5229\u6027\u3002 \u5c3d\u53ef\u80fd\u7b80\u5316\u914d\u7f6e\uff0c\u81ea\u52a8\u5316\u914d\u7f6e\uff0c\u751a\u81f3\u96f6\u914d\u7f6e\u3002 \u4f7f\u7528\u4fbf\u5229\u6027\u3002 \u4e30\u5bcc\u7684\u6587\u6863\uff1a\u5b89\u88c5\uff0c\u4f7f\u7528\uff0c\u4f8b\u5b50\u3002 \u8f6f\u4ef6\u7ed3\u6784\u5408\u7406\uff0c\u6a21\u5757\u5316\u3002 \u53cb\u597d\u7684\u62a5\u9519\u4fe1\u606f\uff0c\u6307\u51fa\u8be6\u7ec6\u7684\u9519\u8bef\u539f\u56e0\uff0c\u800c\u4e0d\u662f\u53ea\u62a5segmentation fault\uff0c\u6216\u6254\u51fa\u4e00\u5806\u9519\u8bef\u4fe1\u606f\u3002 \u4e30\u5bcc\u7684\u547d\u4ee4\u884c\u53c2\u6570\uff0c\u6ee1\u8db3\u4e0d\u540c\u529f\u80fd\u9700\u6c42\u3002 \u652f\u6301\u6807\u51c6\u8f93\u5165/\u8f93\u51fa\uff0c\u4ece\u800c\u4fbf\u4e8e\u6574\u5408\u5230\u5206\u6790\u6d41\u7a0b\u3002 \u53ef\u9009\u652f\u6301shell\u8865\u5168\uff0c\u4fbf\u4e8e\u5feb\u901f\u8c03\u7528\u5b50\u547d\u4ee4\u548c\u53c2\u6570\u3002 \u8ba1\u7b97\u6548\u7387\u3002 \u5c3d\u53ef\u80fd\u5360\u7528\u4f4e\u5185\u5b58\u3001\u4f4e\u5b58\u50a8\u3002 \u5c3d\u91cf\u51cf\u5c11\u8ba1\u7b97\u65f6\u95f4\uff0c\u5145\u5206\u5229\u7528\u591aCPU\u3002 \u6301\u7eed\u7684\u652f\u6301\u3002 \u6839\u636e\u7528\u6237\u9700\u6c42\u4fee\u590dbug\u3001\u589e\u52a0\u65b0\u529f\u80fd\u3002 \u5b9a\u671f\u66f4\u65b0\u53d1\u5e03\u65b0\u7248\u672c\u3002 \u5728\u5b9e\u73b0TaxonKit\u7684\u65f6\u5019\uff0c\u6211\u5df2\u7ecf\u5f00\u59cb\u7f16\u5199seqkit\u548ccsvtk\u8f6f\u4ef6\uff0c\u6709\u4e86\u4e00\u5b9a\u7684\u7ecf\u9a8c\uff0c\u4e5f\u57fa\u672c\u80fd\u8fbe\u5230\u4e0a\u8ff0\u6240\u6709\u8981\u6c42\u3002 TaxonKit\u4f7f\u7528Go\u8bed\u8a00\u7f16\u5199\uff0c\u8fd9\u6837\u53ef\u4ee5\u8f7b\u677e\u7f16\u8bd1\u51fa\u652f\u6301Linux, Windows, macOS\u7b49\u64cd\u4f5c\u7cfb\u7edf\u7684\u4e0d\u540c\u67b6\u6784\uff08x86/arm\uff09\u7684\u53ef\u6267\u884c\u4e8c\u8fdb\u5236\u6587\u4ef6\u3002 \u7531\u4e8eGo\u662f\u7f16\u8bd1\u578b\u8bed\u8a00\uff0c\u5728\u8fd0\u884c\u6548\u7387\u4e0a\u4e5f\u6709\u4fdd\u8bc1\u3002 \u81f3\u4e8e\u914d\u7f6e\u3001\u4f7f\u7528\u7b49\u4fbf\u5229\u6027\u5219\u4f9d\u8d56\u4e8e\u5f00\u53d1\u8005\u3002 \u5206\u7c7b\u5b66\u6570\u636e\u4f7f\u7528NCBI taxonomy\u7684\u516c\u5f00\u6570\u636e\u3002 \u6570\u636e\u8bbf\u95ee\u65b9\u5f0f\u7684\u9009\u62e9\uff1a\u901a\u8fc7\u7f51\u7edc\u8bbf\u95ee\u5b98\u65b9Web\u63a5\u53e3\u7684\u65b9\u5f0f\u592a\u6162\uff0c\u53ea\u8003\u8651\u672c\u5730\u8bbf\u95ee\u3002 \u672c\u5730\u8bbf\u95ee\u6709\u51e0\u79cd\u65b9\u5f0f\uff1a \u76f4\u63a5\u8bbf\u95ee\u6570\u636e\u5e93\uff1a\u53c8\u5206\u5d4c\u5165\u5f0f\u6570\u636e\u5e93\u5982SQLite\uff0c\u7b2c\u4e09\u65b9\u6570\u636e\u5e93\u5165MySQL\u3002\u540e\u8005\u4e0d\u8003\u8651\uff0c\u914d\u7f6e\u592a\u9ebb\u70e6\u3002 Client-Server\u6a21\u5f0f\uff1a Web\u63a5\u53e3\uff1a\u670d\u52a1\u7aef\u542f\u52a8\u5b88\u62a4\u8fdb\u7a0b\uff0c\u957f\u671f\u4fdd\u6301\u6570\u636e\u5e93\u8fde\u63a5\uff0c\u5bf9\u5916\u63d0\u4f9bWeb\uff08RESTful\uff09\u63a5\u53e3\uff0c \u5ba2\u6237\u7aef\u672c\u5730\u6216\u8fdc\u7a0b\u8c03\u7528\u3002\u5148\u524d\u5df2\u7ecf\u5f00\u53d1\u4e86\u4e00\u4e2a\u539f\u578b\uff08https://github.com/shenwei356/gtaxon\uff09\uff0c \u4f46\u901a\u8fc7RESTful\u63a5\u53e3\uff08HTTP\uff09\u5927\u6279\u91cf\u8c03\u7528\uff0c\u8bbf\u95ee\u901f\u5ea6\u8f83\u6162\u3002 Socket\u63a5\u53e3\uff1a\u4e0eWeb\u501f\u53e3\u7c7b\u4f3c\uff0c\u56e0\u4e3a\u6ca1\u6709\u4f7f\u7528http\u534f\u8bae\uff0c\u901f\u5ea6\u5e94\u8be5\u4f1a\u9ad8\u4e00\u4e9b\u3002\u4f46\u6ca1\u6709\u5c1d\u8bd5\u3002 \u6700\u540e\u6d4b\u8bd5\u53d1\u73b0\uff0c\u76f4\u63a5\u89e3\u6790\u6570\u636e\u6587\u4ef6\u7684\u901f\u5ea6\u4e5f\u5f88\u5feb\uff0c5\u79d2\u5de6\u53f3\uff08\u5b58\u50a8\u4e3aNVMe SSD\uff09\uff0c\u5b8c\u5168\u6ee1\u8db3\u8981\u6c42\u3002 \u5b8c\u5168\u4e0d\u7528\u642d\u5efa\u6570\u636e\u5e93\uff0c\u914d\u7f6e\u66f4\u5bb9\u6613\uff0c\u4e5f\u4fbf\u4e8e\u66f4\u65b0\u6570\u636e\u3002 \u8fd1\u65e5\u53c8\u8fdb\u4e00\u6b65\u4f18\u5316\u52302\u79d2\u5de6\u53f3\uff0c\u975e\u5e38\u5feb\u901f\u3002\u5185\u5b58\u4e5f\u5728500Mb-1.5G\u5de6\u53f3\uff0c\u5b8c\u5168\u53ef\u4ee5\u63a5\u53d7\u3002 TaxonKit\u4e3a\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u91c7\u7528\u5b50\u547d\u4ee4\u7684\u65b9\u5f0f\u6765\u6267\u884c\u4e0d\u540c\u529f\u80fd\uff0c\u5927\u591a\u6570\u5b50\u547d\u4ee4\u652f\u6301\u6807\u51c6\u8f93\u5165/\u8f93\u51fa\uff0c\u4fbf\u4e8e\u4f7f\u7528\u547d\u4ee4\u884c\u7ba1\u9053\u8fdb\u884c\u6d41\u6c34\u4f5c\u4e1a\u3002 \u5c40\u9650\u6027 \u5206\u7c7b\u5b66\u6570\u636e\u5e93\u6709\u5f88\u591a\uff0cTaxonKit\u76ee\u524d\u53ea\u652f\u6301\u5e94\u7528\u6700\u5e7f\u6cdb\u7684NCBI Taxonomy\u3002 \u5bf9\u4e8eGTDB Taxonomy\uff0c\u53ef\u4ee5\u901a\u8fc7\u73b0\u6709\u5de5\u5177\uff0c\u5982 gtdb_to_taxdump \uff0c \u5c06\u5176\u6570\u636e\u8f6c\u6362\u4e3aNCBI taxdump\u6587\u4ef6\u3002","title":"\u5f00\u53d1\u7b14\u8bb0"},{"location":"chinese-dev/#_1","text":"\u60f3\u8981\u4eceNCBI\u83b7\u53d6\u751f\u7269\u7684\u8c31\u7cfb\u4fe1\u606f\uff0c\u53ef\u4ee5\u5728 NCBI Taxonomy\u7f51\u7ad9\u4e0a\u7528TaxID\u6216\u8005\u540d\u79f0\u67e5\u8be2\u3002 \u6bd4\u5982\u53ef\u4ee5\u7528 Homo sapiens \u6216 9606 \u641c\u7d22\u201c\u4eba\u201d\u7684\u5206\u7c7b\u5b66\u4fe1\u606f\uff0c\u4ee5\u53ca\u5bc6\u7801\u5b50\u8868\uff0cEntrez\u8bb0\u5f55\u7edf\u8ba1\u7b49\u3002 \u540c\u65f6\u4e5f\u53ef\u4ee5\u901a\u8fc7NCBI\u7684\u5b98\u65b9\u5de5\u5177\u5305 E-utilities ( ftp )\u3002 $ esearch -db taxonomy -query \"txid9606 [Organism]\" \\ | efetch -format xml \\ | xtract -pattern Lineage -element Lineage \u6b64\u5916\u4e5f\u6709\u4e00\u4e9b\u5de5\u5177\u63d0\u4f9b\u7c7b\u4f3c\u7684\u529f\u80fd\uff0c\u90e8\u5206\u8f6f\u4ef6\uff1a \u5de5\u5177 \u7f16\u7a0b\u8bed\u8a00 \u6570\u636e\u83b7\u53d6\u65b9\u5f0f \u4f7f\u7528\u65b9\u5f0f \u5907\u6ce8 E-utilities shell/Perl/C++ \u8fdc\u7a0bWeb\u8c03\u7528 \u547d\u4ee4\u884c \u5b98\u65b9\u7a0b\u5e8f\uff0cTaxonomy\u64cd\u4f5c\u4ec5\u4e3a\u5176\u90e8\u5206\u529f\u80fd BioPython Python \u8fdc\u7a0bWeb\u8c03\u7528 \u811a\u672c \u5305\u88c5entrez\u63a5\u53e3\uff0cTaxonomy\u64cd\u4f5c\u4ec5\u4e3a\u5176\u90e8\u5206\u529f\u80fd ETE Toolkit Python \u672c\u5730\u6570\u636e\u5e93 \u811a\u672c/\u547d\u4ee4\u884c Taxonomy\u64cd\u4f5c\u4ec5\u4e3a\u5176\u90e8\u5206\u529f\u80fd Taxize R \u8fdc\u7a0bWeb\u8c03\u7528 \u811a\u672c ropensci\uff1b\u652f\u6301\u591a\u79cd\u6570\u636e\u5e93\uff1b\u529f\u80fd\u8f83\u4e30\u5bcc Taxopy Python \u672c\u5730\u6570\u636e\u6587\u4ef6 \u811a\u672c/\u547d\u4ee4\u884c \u4ec5\u57fa\u672c\u529f\u80fd \u9009\u62e9\u5de5\u5177\u4e00\u822c\u8003\u8651\u51e0\u4e2a\u65b9\u9762\uff1a \u662f\u5426\u6ee1\u8db3\u529f\u80fd\u9700\u6c42\u3002\u5927\u591a\u5de5\u5177\u4ec5\u6709\u57fa\u672c\u7684\u67e5\u8be2\u3001\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb\u7684\u529f\u80fd\uff0c\u90fd\u6ca1\u6cd5\u5c06\u5b8c\u6574\u8c31\u7cfb\u683c\u5f0f\u5316\u4e3a\"\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\"\u7684\u683c\u5f0f\uff1b \u8f6f\u4ef6\u5b89\u88c5\u4fbf\u5229\u6027\u3002\u4e0a\u8ff0\u5de5\u5177\u90fd\u4e0d\u9700\u8981\u624b\u52a8\u7f16\u8bd1\u5b89\u88c5\uff0c\u9664\u4e86E-utilities\u7684\u90e8\u5206\u7ec4\u4ef6\u9700\u8981\u624b\u52a8\u5b8c\u6210\uff0c\u5176\u5b83\u57fa\u672c\u90fd\u80fd\u7528\u5bf9\u5e94\u7f16\u7a0b\u8bed\u8a00\u7684\u5305\u7ba1\u7406\u5de5\u5177\u5b89\u88c5\uff1b \u914d\u7f6e\u4fbf\u5229\u6027\u3002\u90e8\u5206\u5efa\u7acb\u672c\u5730\u6570\u636e\u5e93\u7684\u8f6f\u4ef6\u5219\u9700\u8981\u5148\u6784\u5efa\u6570\u636e\u5e93\uff0c\u4e0d\u8fc7\u57fa\u672c\u90fd\u662f\u5d4c\u5165\u5f0f\u7684sqlite\uff0c\u6bd4\u8f83\u7b80\u5355\u5feb\u6377\uff0c\u7a7a\u95f4\u5360\u7528\u4e5f\u80fd\u63a5\u53d7\uff1b \u4f7f\u7528\u4fbf\u5229\u6027\u3002\u63d0\u4f9b\u547d\u4ee4\u884c\u63a5\u53e3\u7684\u5de5\u5177\u5b9e\u7528\u8f83\u4e3a\u4fbf\u6377\uff0c\u4e5f\u4fbf\u4e8e\u6574\u5408\u5230\u5206\u6790\u6d41\u7a0b\uff1b \u800c\u4ec5\u63d0\u4f9b\u5305/\u5e93\u7684\u5de5\u5177\uff0c\u9700\u8981\u4f7f\u7528\u8005\u5728\u8bed\u8a00\u7ec8\u7aef\u6216\u7f16\u5199\u811a\u672c\u8fdb\u884c\u8c03\u7528\uff0c\u7075\u6d3b\u4f46\u9700\u8981\u4e00\u5b9a\u7f16\u7a0b\u57fa\u7840\u3002 \u8ba1\u7b97\u6548\u7387\u3002\u901a\u8fc7\u7f51\u7edc\u8c03\u7528\u7684\u8f6f\u4ef6\u53d7\u7f51\u7edc\u72b6\u6001\u5f71\u54cd\u5927\uff0c\u4e14\u5728\u5927\u6279\u91cf\u8c03\u7528\u7684\u65f6\u5019\u901f\u5ea6\u8f83\u6162\uff1b\u5b9e\u7528\u672c\u5730\u6570\u636e\u5e93\u5219\u8f83\u4e3a\u9ad8\u6548\u3002 \u6700\u521d\u6211\u60f3\u8981\u7684\u529f\u80fd\u53ea\u662f\u6839\u636e\u83b7\u53d6\"\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\"\u683c\u5f0f\u7684\u8c31\u7cfb\uff0c\u53d1\u73b0\u6ca1\u6709\u73b0\u6210\u5de5\u5177\uff0c\u800c\u540e\u53c8\u6709\u65b0\u7684\u9700\u6c42\u65e0\u6cd5\u6ee1\u8db3\uff0c\u5373\u83b7\u53d6\u67d0\u4e2a\u7c7b\u522b\u6240\u6709\u7684TaxID\u3002 \u6545\u5f00\u59cb\u7f16\u5199\u5de5\u5177\u6765\u5b9e\u73b0\uff0c\u5e76\u9010\u6b65\u6269\u5c55\u5176\u529f\u80fd\u3002 \u5176\u5b9e\u6700\u7b80\u5355\u7684\u65b9\u6cd5\u5c31\u662f\u81ea\u5df1\u4e0b\u8f7d\u6570\u636e\u6587\u4ef6\u8fdb\u884c\u89e3\u6790\u3002","title":"\u73b0\u6709\u5de5\u5177\u6bd4\u8f83"},{"location":"chinese-dev/#ncbi-taxonomy","text":"NCBI Taxonomy\u6570\u636e\u5e93\u5c06\u6240\u6709\u751f\u7269\u7684 \u5206\u7c7b\u5b66\u5173\u7cfb \u7ec4\u7ec7\u4e3a\u4e00\u68f5\u201c\u6709\u6839\u6811\u201d\uff08rooted tree\uff09, \u4e0e\u8fdb\u5316\u6811\uff08Phylogenetic tree\uff09\u4e0d\u540c: \u8fdb\u5316\u6811\u662f\u6309 \u8fdb\u5316\u5173\u7cfb \u201d\u7ec4\u7ec7\uff0c\u4e14\u53ef\u4ee5\u4e3a\u201c\u65e0\u6839\u6811\u201d(unrooted tree)\u3002 NCBI Taxonomy\u516c\u5f00\u6570\u636e\u683c\u5f0f\u6709\u4e24\u79cd\uff0c\u65e7\u7684\u540d\u79f0\u4e3a taxdump.tar.gz \uff0c\u6587\u4ef6\u5927\u5c0f\u7ea650Mb\uff0c\u5185\u542b\u4ee5\u4e0b\u6587\u4ef6\u3002 nodes.dmp # [\u5f53\u524d\u7248\u672c] \u8282\u70b9\u4fe1\u606f # \u91cd\u8981\u5185\u5bb9\uff1a tax_id, parent tax_id, rank names.dmp # [\u5f53\u524d\u7248\u672c] \u540d\u79f0\u4fe1\u606f # \u91cd\u8981\u5185\u5bb9\uff1a tax_id, name_txt merged.dmp # [\u76ee\u524d\u4e3a\u6b62] \u88ab\u5408\u5e76\u7684\u8282\u70b9\u4fe1\u606f # \u91cd\u8981\u5185\u5bb9\uff1a old_tax_id, new_tax_id delnodes.dmp # [\u76ee\u524d\u4e3a\u6b62] \u88ab\u5220\u9664\u7684nodes\u4fe1\u606f # \u91cd\u8981\u5185\u5bb9\uff1a tax_id citations.dmp # \u5f15\u7528\u4fe1\u606f division.dmp # division\u4fe1\u606f gencode.dmp # \u9057\u4f20\u7f16\u7801\u4fe1\u606f gc.prt # \u9057\u4f20\u7f16\u7801\u8868 readme.txt # \u8bf4\u660e\u6587\u6863 \u5176\u4e2d\u6700\u4e3b\u8981\u7684\u662f\u524d4\u4e2a\u6587\u4ef6\uff1a nodes.dmp \u4e3b\u8981\u5305\u542b\u5f53\u524d\u7248\u672c\u7684\u6240\u6709\u5206\u7c7b\u5b66\u5355\u5143\u8282\u70b9\uff08taxon\uff09 \u7684\u552f\u4e00\u6807\u8bc6\u7b26\uff08taxonomic identifier, \u7b80\u79f0TaxId, taxid, tax_id)\uff0c \u5206\u7c7b\u5b66\u6c34\u5e73(rank\uff09\uff0c\u53ca\u5176\u7236\u8282\u70b9\u7684TaxID\u3002 names.dmp \u4e3b\u8981\u5305\u542b\u5305\u542b\u5f53\u524d\u7248\u672c\u7684\u6240\u6709TaxID\u53ca\u5176\u7edf\u4e00\u79d1\u5b66\u540d\u79f0\uff08scientific name\uff09\u548c\u522b\u540d\u3002 merged.dmp \u5305\u542b\u4e86\u5230\u5f53\u524d\u7248\u672c\u4e3a\u6b62\uff0c\u6240\u6709\u88ab\u5408\u5e76\u7684TaxID\u4e0e\u5408\u5e76\u5230\u7684\u65b0TaxID\u3002 delnodes.dmp \u5305\u542b\u4e86\u5230\u5f53\u524d\u7248\u672c\u4e3a\u6b62\uff0c\u6240\u6709\u88ab\u5220\u9664\u7684TaxID\u3002 \u57282018\u5e742\u6708\u7684\u65f6\u5019\uff0c \u63a8\u51fa\u4e86\u65b0\u7684\u683c\u5f0f \uff0c \u989d\u5916\u5305\u542b\u4e86\u8c31\u7cfb\uff08lineage\uff09\uff0c\u7c7b\u578b\uff08type\uff09\u548c\u5bbf\u4e3b\uff08host\uff09\u4fe1\u606f\u3002 \u6587\u4ef6\u540d\u79f0\u4e3a new_taxdump.tar.gz \uff0c\u6587\u4ef6\u5927\u5c0f\u7ea6110Mb\u3002 \u76f8\u5bf9\u65e7\u7248\uff0c\u65b0\u7248\u672c\u6587\u4ef6\u6570\u91cf\u548c\u5185\u5bb9\u66f4\u591a\uff0c\u4e3b\u8981\u662f\u56e0\u4e3a\u589e\u52a0\u4e86lineage\u548c\u7c7b\u578b\u4fe1\u606f\u3002 \u4e8b\u5b9e\u4e0alineage\u662f\u53ef\u4ee5\u4ece nodes.dmp \u548c names.dmp \u8ba1\u7b97\u800c\u6765\u3002 \u65b0\u7248\u683c\u5f0f\u6240\u542b\u6587\u4ef6\u5982\u4e0b\uff1a nodes.dmp names.dmp merged.dmp delnodes.dmp fullnamelineage.dmp TaxIDlineage.dmp rankedlineage.dmp host.dmp typeoftype.dmp typematerial.dmp citations.dmp division.dmp gencode.dmp readme.txt NCBI Taxonomy\u6570\u636e\u6bcf\u5929\u90fd\u5728\u66f4\u65b0\uff0c\u6bcf\u6708\u521d\uff08\u5927\u591a\u4e3a1\u53f7\uff09\u7684\u6570\u636e\u4f5c\u4e3a\u5b58\u6863\u4fdd\u5b58\u5728 taxdump_archive/ \u76ee\u5f55\uff0c \u65e7\u7248\u672c\u6700\u65e9\u6570\u636e\u52302014\u5e748\u6708\uff0c\u65b0\u7248\u672c\u53ea\u52302018\u5e7412\u6708\u3002","title":"NCBI Taxonomy \u6570\u636e\u6587\u4ef6"},{"location":"chinese-dev/#taxonkit","text":"\u5927\u5bb6\u5e94\u8be5\u90fd\u6709\u5b89\u88c5\u751f\u7269\u4fe1\u606f\u8f6f\u4ef6\u7684\u75db\u82e6\u56de\u5fc6\uff0c\u5728conda\u51fa\u73b0\u4e4b\u524d\uff0c\u5f88\u591a\u8f6f\u4ef6\u90fd\u9700\u8981\u624b\u52a8\u5b89\u88c5\u4f9d\u8d56\u3001\u518d\u7f16\u8bd1\u5b89\u88c5\u3002 \u4e0d\u540c\u64cd\u4f5c\u7cfb\u7edf\uff0c\u64cd\u4f5c\u7cfb\u7edf\u7248\u672c\uff0c\u7f16\u8bd1\u5668\u7248\u672c\u7ed9\u8f6f\u4ef6\u5b89\u88c5\u5e26\u6765\u4e86\u5de8\u5927\u7684\u56f0\u96be\u3002 \u5982\u679c\u5f00\u53d1\u8005\u6ca1\u6ce8\u610f\u8f6f\u4ef6\u7684\u8de8\u5e73\u53f0\u3001\u53ef\u79fb\u690d\u6027\u66f4\u662f\u5982\u6b64\u3002 \u597d\u7684\u8f6f\u4ef6\u4e00\u5b9a\u8981\u8003\u8651\u4ee5\u4e0b\u51e0\u4e2a\u65b9\u9762\uff1a \u5b89\u88c5\u4fbf\u5229\u6027\u3002 \u5c3d\u53ef\u80fd\u7b80\u5316\u5b89\u88c5\u6b65\u9aa4\uff0c\u751a\u81f3\u4e00\u952e/\u4e00\u6761\u547d\u4ee4\u5b89\u88c5\u3002 \u51cf\u5c11\u5bf9\u5916\u90e8\u8f6f\u4ef6/\u5305\u7684\u4f9d\u8d56\u3002 \u5bf9\u591a\u5e73\u53f0\uff08windows/linux\uff09\u7684\u517c\u5bb9\u6027\u3002 \u5c3d\u91cf\u63d0\u4f9b\u7f16\u8bd1\u597d\u7684 \u9759\u6001\u94fe\u63a5\u53ef\u6267\u884c\u7a0b\u5e8f\uff08Statically linked executable binaries\uff09\u3002 \u914d\u7f6e\u4fbf\u5229\u6027\u3002 \u5c3d\u53ef\u80fd\u7b80\u5316\u914d\u7f6e\uff0c\u81ea\u52a8\u5316\u914d\u7f6e\uff0c\u751a\u81f3\u96f6\u914d\u7f6e\u3002 \u4f7f\u7528\u4fbf\u5229\u6027\u3002 \u4e30\u5bcc\u7684\u6587\u6863\uff1a\u5b89\u88c5\uff0c\u4f7f\u7528\uff0c\u4f8b\u5b50\u3002 \u8f6f\u4ef6\u7ed3\u6784\u5408\u7406\uff0c\u6a21\u5757\u5316\u3002 \u53cb\u597d\u7684\u62a5\u9519\u4fe1\u606f\uff0c\u6307\u51fa\u8be6\u7ec6\u7684\u9519\u8bef\u539f\u56e0\uff0c\u800c\u4e0d\u662f\u53ea\u62a5segmentation fault\uff0c\u6216\u6254\u51fa\u4e00\u5806\u9519\u8bef\u4fe1\u606f\u3002 \u4e30\u5bcc\u7684\u547d\u4ee4\u884c\u53c2\u6570\uff0c\u6ee1\u8db3\u4e0d\u540c\u529f\u80fd\u9700\u6c42\u3002 \u652f\u6301\u6807\u51c6\u8f93\u5165/\u8f93\u51fa\uff0c\u4ece\u800c\u4fbf\u4e8e\u6574\u5408\u5230\u5206\u6790\u6d41\u7a0b\u3002 \u53ef\u9009\u652f\u6301shell\u8865\u5168\uff0c\u4fbf\u4e8e\u5feb\u901f\u8c03\u7528\u5b50\u547d\u4ee4\u548c\u53c2\u6570\u3002 \u8ba1\u7b97\u6548\u7387\u3002 \u5c3d\u53ef\u80fd\u5360\u7528\u4f4e\u5185\u5b58\u3001\u4f4e\u5b58\u50a8\u3002 \u5c3d\u91cf\u51cf\u5c11\u8ba1\u7b97\u65f6\u95f4\uff0c\u5145\u5206\u5229\u7528\u591aCPU\u3002 \u6301\u7eed\u7684\u652f\u6301\u3002 \u6839\u636e\u7528\u6237\u9700\u6c42\u4fee\u590dbug\u3001\u589e\u52a0\u65b0\u529f\u80fd\u3002 \u5b9a\u671f\u66f4\u65b0\u53d1\u5e03\u65b0\u7248\u672c\u3002 \u5728\u5b9e\u73b0TaxonKit\u7684\u65f6\u5019\uff0c\u6211\u5df2\u7ecf\u5f00\u59cb\u7f16\u5199seqkit\u548ccsvtk\u8f6f\u4ef6\uff0c\u6709\u4e86\u4e00\u5b9a\u7684\u7ecf\u9a8c\uff0c\u4e5f\u57fa\u672c\u80fd\u8fbe\u5230\u4e0a\u8ff0\u6240\u6709\u8981\u6c42\u3002 TaxonKit\u4f7f\u7528Go\u8bed\u8a00\u7f16\u5199\uff0c\u8fd9\u6837\u53ef\u4ee5\u8f7b\u677e\u7f16\u8bd1\u51fa\u652f\u6301Linux, Windows, macOS\u7b49\u64cd\u4f5c\u7cfb\u7edf\u7684\u4e0d\u540c\u67b6\u6784\uff08x86/arm\uff09\u7684\u53ef\u6267\u884c\u4e8c\u8fdb\u5236\u6587\u4ef6\u3002 \u7531\u4e8eGo\u662f\u7f16\u8bd1\u578b\u8bed\u8a00\uff0c\u5728\u8fd0\u884c\u6548\u7387\u4e0a\u4e5f\u6709\u4fdd\u8bc1\u3002 \u81f3\u4e8e\u914d\u7f6e\u3001\u4f7f\u7528\u7b49\u4fbf\u5229\u6027\u5219\u4f9d\u8d56\u4e8e\u5f00\u53d1\u8005\u3002 \u5206\u7c7b\u5b66\u6570\u636e\u4f7f\u7528NCBI taxonomy\u7684\u516c\u5f00\u6570\u636e\u3002 \u6570\u636e\u8bbf\u95ee\u65b9\u5f0f\u7684\u9009\u62e9\uff1a\u901a\u8fc7\u7f51\u7edc\u8bbf\u95ee\u5b98\u65b9Web\u63a5\u53e3\u7684\u65b9\u5f0f\u592a\u6162\uff0c\u53ea\u8003\u8651\u672c\u5730\u8bbf\u95ee\u3002 \u672c\u5730\u8bbf\u95ee\u6709\u51e0\u79cd\u65b9\u5f0f\uff1a \u76f4\u63a5\u8bbf\u95ee\u6570\u636e\u5e93\uff1a\u53c8\u5206\u5d4c\u5165\u5f0f\u6570\u636e\u5e93\u5982SQLite\uff0c\u7b2c\u4e09\u65b9\u6570\u636e\u5e93\u5165MySQL\u3002\u540e\u8005\u4e0d\u8003\u8651\uff0c\u914d\u7f6e\u592a\u9ebb\u70e6\u3002 Client-Server\u6a21\u5f0f\uff1a Web\u63a5\u53e3\uff1a\u670d\u52a1\u7aef\u542f\u52a8\u5b88\u62a4\u8fdb\u7a0b\uff0c\u957f\u671f\u4fdd\u6301\u6570\u636e\u5e93\u8fde\u63a5\uff0c\u5bf9\u5916\u63d0\u4f9bWeb\uff08RESTful\uff09\u63a5\u53e3\uff0c \u5ba2\u6237\u7aef\u672c\u5730\u6216\u8fdc\u7a0b\u8c03\u7528\u3002\u5148\u524d\u5df2\u7ecf\u5f00\u53d1\u4e86\u4e00\u4e2a\u539f\u578b\uff08https://github.com/shenwei356/gtaxon\uff09\uff0c \u4f46\u901a\u8fc7RESTful\u63a5\u53e3\uff08HTTP\uff09\u5927\u6279\u91cf\u8c03\u7528\uff0c\u8bbf\u95ee\u901f\u5ea6\u8f83\u6162\u3002 Socket\u63a5\u53e3\uff1a\u4e0eWeb\u501f\u53e3\u7c7b\u4f3c\uff0c\u56e0\u4e3a\u6ca1\u6709\u4f7f\u7528http\u534f\u8bae\uff0c\u901f\u5ea6\u5e94\u8be5\u4f1a\u9ad8\u4e00\u4e9b\u3002\u4f46\u6ca1\u6709\u5c1d\u8bd5\u3002 \u6700\u540e\u6d4b\u8bd5\u53d1\u73b0\uff0c\u76f4\u63a5\u89e3\u6790\u6570\u636e\u6587\u4ef6\u7684\u901f\u5ea6\u4e5f\u5f88\u5feb\uff0c5\u79d2\u5de6\u53f3\uff08\u5b58\u50a8\u4e3aNVMe SSD\uff09\uff0c\u5b8c\u5168\u6ee1\u8db3\u8981\u6c42\u3002 \u5b8c\u5168\u4e0d\u7528\u642d\u5efa\u6570\u636e\u5e93\uff0c\u914d\u7f6e\u66f4\u5bb9\u6613\uff0c\u4e5f\u4fbf\u4e8e\u66f4\u65b0\u6570\u636e\u3002 \u8fd1\u65e5\u53c8\u8fdb\u4e00\u6b65\u4f18\u5316\u52302\u79d2\u5de6\u53f3\uff0c\u975e\u5e38\u5feb\u901f\u3002\u5185\u5b58\u4e5f\u5728500Mb-1.5G\u5de6\u53f3\uff0c\u5b8c\u5168\u53ef\u4ee5\u63a5\u53d7\u3002 TaxonKit\u4e3a\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u91c7\u7528\u5b50\u547d\u4ee4\u7684\u65b9\u5f0f\u6765\u6267\u884c\u4e0d\u540c\u529f\u80fd\uff0c\u5927\u591a\u6570\u5b50\u547d\u4ee4\u652f\u6301\u6807\u51c6\u8f93\u5165/\u8f93\u51fa\uff0c\u4fbf\u4e8e\u4f7f\u7528\u547d\u4ee4\u884c\u7ba1\u9053\u8fdb\u884c\u6d41\u6c34\u4f5c\u4e1a\u3002","title":"TaxonKit \u5f00\u53d1\u601d\u8def"},{"location":"chinese-dev/#_2","text":"\u5206\u7c7b\u5b66\u6570\u636e\u5e93\u6709\u5f88\u591a\uff0cTaxonKit\u76ee\u524d\u53ea\u652f\u6301\u5e94\u7528\u6700\u5e7f\u6cdb\u7684NCBI Taxonomy\u3002 \u5bf9\u4e8eGTDB Taxonomy\uff0c\u53ef\u4ee5\u901a\u8fc7\u73b0\u6709\u5de5\u5177\uff0c\u5982 gtdb_to_taxdump \uff0c \u5c06\u5176\u6570\u636e\u8f6c\u6362\u4e3aNCBI taxdump\u6587\u4ef6\u3002","title":"\u5c40\u9650\u6027"},{"location":"chinese/","text":"TaxonKit: \u5c0f\u5de7\u3001\u9ad8\u6548\u3001\u5b9e\u7528\u7684NCBI\u5206\u7c7b\u5b66\u6570\u636e\u547d\u4ee4\u884c\u5de5\u5177\u96c6 NCBI Taxonomy \u6570\u636e\u5e93 \u4ece\u4e8b\u751f\u7269\u591a\u6837\u6027\u7684\u7814\u7a76\u8005\u5bf9NCBI Taxonomy\u6570\u636e\u5e93\u4e00\u5b9a\u4e0d\u4f1a\u964c\u751f\uff0c \u5b83\u5305\u542b\u4e86NCBI\u6240\u6709\u6838\u9178\u548c\u86cb\u767d\u5e8f\u5217\u6570\u636e\u5e93\u4e2d\u6bcf\u6761\u5e8f\u5217\u5bf9\u5e94\u7684\u7269\u79cd\u540d\u79f0\u4e0e\u5206\u7c7b\u5b66\u4fe1\u606f\u3002 \u5927\u591a\u6570\u751f\u6001\u5b66\u7814\u7a76\u5bf9\u7269\u79cd\u7ec4\u6210\u7684\u63cf\u8ff0\u90fd\u662f\u57fa\u4e8eNCBI Taxonomy\u6570\u636e\u5e93\uff0c \u5f53\u7136\u76ee\u524d\u4e5f\u5f00\u59cb\u4f7f\u7528\u5176\u4ed6\u6570\u636e\u5e93\uff0c\u5982GTDB\u7b49\u3002 NCBI Taxonomy\u6570\u636e\u5e93\u59cb\u4e8e1991\u5e74\uff0c\u4e00\u76f4\u968f\u7740Entrez\u6570\u636e\u5e93\u548c\u5176\u4ed6\u6570\u636e\u5e93\u66f4\u65b0\uff0c 1996\u5e74\u63a8\u51fa\u7f51\u9875\u7248\u3002NCBI Taxonomy\u6570\u636e\u5e93\u5b98\u65b9\u5730\u5740\u4e3a https://www.ncbi.nlm.nih.gov/taxonomy \uff0c \u516c\u5f00\u6570\u636e\u4e0b\u8f7d\u5730\u5740\u4e3a https://ftp.ncbi.nih.gov/pub/taxonomy/ \uff0c \u6570\u636e\u6bcf\u5c0f\u65f6\u66f4\u65b0\uff0c\u6bcf\u4e2a\u6708\u521d\u751f\u6210\u4e00\u4efd\u6570\u636e\u5f52\u6863\u5b58\u4e8e taxdump_archive \u76ee\u5f55\uff0c\u6700\u65e9\u53ef\u8ffd\u6eaf\u52302014\u5e748\u6708\u3002 TaxonKit \u4f7f\u7528 TaxonKit\u662f\u91c7\u7528Go\u8bed\u8a00\u7f16\u5199\u7684\u547d\u4ee4\u884c\u5de5\u5177\uff0c \u63d0\u4f9bLinux, Windows, macOS\u64cd\u4f5c\u7cfb\u7edf\u4e0d\u540c\u67b6\u6784\uff08x86-64/arm64\uff09\u7684\u9759\u6001\u7f16\u8bd1\u7684\u53ef\u6267\u884c\u4e8c\u8fdb\u5236\u6587\u4ef6\u3002 \u53d1\u5e03\u7684\u538b\u7f29\u5305\u4e0d\u8db33Mb\uff0c\u9664\u4e86Github\u6258\u7ba1\u5916\uff0c\u8fd8\u63d0\u4f9b\u56fd\u5185\u955c\u50cf\u4f9b\u4e0b\u8f7d\uff0c\u540c\u65f6\u8fd8\u652f\u6301conda\u548chomebrew\u5b89\u88c5\u3002 \u7528\u6237\u53ea\u9700\u8981 \u4e0b\u8f7d\u3001\u89e3\u538b\uff0c\u5f00\u7bb1\u5373\u7528\uff0c\u65e0\u9700\u914d\u7f6e \uff0c\u4ec5\u9700\u4e0b\u8f7d\u89e3\u538bNCBI Taxonomy\u6570\u636e\u6587\u4ef6\u89e3\u538b\u5230\u6307\u5b9a\u76ee\u5f55\u5373\u53ef\u3002 \u6e90\u4ee3\u7801 https://github.com/shenwei356/taxonkit \uff0c \u6587\u6863 http://bioinf.shenwei.me/taxonkit \uff08\u4ecb\u7ecd\u3001\u4f7f\u7528\u8bf4\u660e\u3001\u4f8b\u5b50\u3001\u6559\u7a0b\uff09 \u9009\u62e9\u7cfb\u7edf\u5bf9\u5e94\u7684\u7248\u672c\u4e0b\u8f7d\u6700\u65b0\u7248 https://github.com/shenwei356/taxonkit/releases \uff0c\u89e3\u538b\u540e\u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u5373\u53ef\u4f7f\u7528\u3002\u6216\u53ef\u9009conda\u5b89\u88c5 conda install taxonkit -c bioconda -y # \u8868\u683c\u6570\u636e\u5904\u7406\uff0c\u63a8\u8350\u4f7f\u7528 csvtk \u66f4\u9ad8\u6548 conda install csvtk -c bioconda -y \u6d4b\u8bd5\u6570\u636e\u4e0b\u8f7d\u53ef\u76f4\u63a5 https://github.com/shenwei356/taxonkit \u4e0b\u8f7d\u9879\u76ee\u538b\u7f29\u5305\uff0c\u6216\u4f7f\u7528git clone\u4e0b\u8f7d\u9879\u76ee\u6587\u4ef6\u5939\uff0c\u5176\u4e2d\u7684example\u4e3a\u6d4b\u8bd5\u6570\u636e git clone https://github.com/shenwei356/taxonkit TaxonKit\u4e3a\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u91c7\u7528\u5b50\u547d\u4ee4\u7684\u65b9\u5f0f\u6765\u6267\u884c\u4e0d\u540c\u529f\u80fd\uff0c \u5927\u591a\u6570\u5b50\u547d\u4ee4\u652f\u6301\u6807\u51c6\u8f93\u5165/\u8f93\u51fa\uff0c\u4fbf\u4e8e\u4f7f\u7528\u547d\u4ee4\u884c\u7ba1\u9053\u8fdb\u884c\u6d41\u6c34\u4f5c\u4e1a\uff0c \u8f7b\u677e\u6574\u5408\u8fdb\u5206\u6790\u6d41\u7a0b\u4e2d\u3002 \u5b50\u547d\u4ee4 \u529f\u80fd list \u5217\u51fa\u6307\u5b9aTaxId\u4e0b\u6240\u6709\u5b50\u5355\u5143\u7684\u7684TaxID lineage \u6839\u636eTaxID\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb\uff08lineage\uff09 reformat \u5c06\u5b8c\u6574\u8c31\u7cfb\u8f6c\u5316\u4e3a\u201c\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\u682a\"\u7684\u81ea\u5b9a\u4e49\u683c\u5f0f name2taxid \u5c06\u5206\u7c7b\u5355\u5143\u540d\u79f0\u8f6c\u5316\u4e3aTaxID filter \u6309\u5206\u7c7b\u5b66\u6c34\u5e73\u8303\u56f4\u8fc7\u6ee4TaxIDs lca \u8ba1\u7b97\u6700\u4f4e\u516c\u5171\u7956\u5148(LCA) taxid-changelog \u8ffd\u8e2aTaxID\u53d8\u66f4\u8bb0\u5f55 version \u663e\u793a\u7248\u672c\u4fe1\u606f\u3001\u68c0\u6d4b\u65b0\u7248\u672c genautocomplete \u751f\u6210shell\u81ea\u52a8\u8865\u5168\u914d\u7f6e\u811a\u672c \u5907\u6ce8\uff1a \u8f93\u51fa\uff1a \u6240\u6709\u547d\u4ee4\u8f93\u51fa\u4e2d\u5305\u542b\u8f93\u5165\u6570\u636e\u5185\u5bb9\uff0c\u5728\u6b64\u57fa\u7840\u4e0a\u589e\u52a0\u5217\u3002 \u6240\u6709\u547d\u4ee4\u9ed8\u8ba4\u8f93\u51fa\u5230\u6807\u51c6\u8f93\u51fa\uff08stdout\uff09\uff0c\u53ef\u901a\u8fc7\u91cd\u5b9a\u5411\uff08 > \uff09\u5199\u5165\u6587\u4ef6\u3002 \u6216\u901a\u8fc7\u5168\u5c40\u53c2\u6570 -o \u6216 --out-file \u6307\u5b9a\u8f93\u51fa\u6587\u4ef6\uff0c\u4e14\u53ef\u81ea\u52a8\u8bc6\u522b\u8f93\u51fa\u6587\u4ef6\u540e\u7f00\uff08 .gz \uff09\u8f93\u51fagzip\u683c\u5f0f\u3002 \u8f93\u5165\uff1a \u9664\u4e86 list \u4e0e taxid-changelog \u4e4b\u5916\uff0c lineage , reformat , name2taxid , filter \u4e0e lca \u5747\u53ef\u4ece\u6807\u51c6\u8f93\u5165\uff08stdin\uff09\u8bfb\u53d6\u8f93\u5165\u6570\u636e\uff0c\u4e5f\u53ef\u901a\u8fc7\u4f4d\u7f6e\u53c2\u6570\uff08positional arguments\uff09\u8f93\u5165\uff0c\u5373\u547d\u4ee4\u540e\u9762\u4e0d\u5e26 \u4efb\u4f55flag\u7684\u53c2\u6570\uff0c\u5982 taxonkit lineage taxids.txt \u8f93\u5165\u683c\u5f0f\u4e3a\u5355\u5217\uff0c\u6216\u8005\u5236\u8868\u7b26\u5206\u9694\u7684\u683c\u5f0f\uff0c\u8f93\u5165\u6570\u636e\u6240\u5728\u5217\u7528 -i \u6216 --taxid-field \u6307\u5b9a\u3002 TaxonKit\u76f4\u63a5\u89e3\u6790NCBI Taxonomy\u6570\u636e\u6587\u4ef6\uff082\u79d2\u5de6\u53f3\uff09\uff0c\u914d\u7f6e\u66f4\u5bb9\u6613\uff0c\u4e5f\u4fbf\u4e8e\u66f4\u65b0\u6570\u636e\uff0c\u5360\u7528\u5185\u5b58\u5728500Mb-1.5G\u5de6\u53f3\u3002 \u6570\u636e\u4e0b\u8f7d\uff1a # \u6709\u65f6\u4e0b\u8f7d\u5931\u8d25\uff0c\u53ef\u591a\u8bd5\u51e0\u6b21\uff1b\u6216\u5c1d\u8bd5\u6d4f\u89c8\u5668\u4e0b\u8f7d\u6b64\u94fe\u63a5 wget -c https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz # \u89e3\u538b\u6587\u4ef6\u5b58\u4e8e\u5bb6\u76ee\u5f55\u4e2d.taxonkit/\uff0c\u7a0b\u5e8f\u9ed8\u8ba4\u6570\u636e\u5e93\u9ed8\u8ba4\u76ee\u5f55 mkdir -p $HOME/.taxonkit cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit list \u5217\u51fa\u6307\u5b9aTaxId\u6240\u5728\u5b50\u6811\u7684\u6240\u6709TaxID taxonkit list \u7528\u4e8e\u5217\u51fa\u6307\u5b9aTaxID\u6240\u5728\u5206\u7c7b\u5b66\u5355\u5143\uff08taxon\uff09\u7684\u5b50\u6811\uff08subtree\uff09\u7684\u6240\u6709taxon\u7684TaxID\uff0c\u53ef\u9009\u663e\u793a\u540d\u79f0\u548c\u5206\u7c7b\u5b66\u6c34\u5e73\u3002 \u6b64\u529f\u80fd\u4e0eNCBI Taxonomy\u7f51\u9875\u7248\u7c7b\u4f3c\u3002 \u5982\uff0c # \u4ee5\u4eba\u5c5e(9605)\u548c\u80a0\u9053\u4e2d\u8457\u540d\u7684Akk\u83cc\u5c5e(239934)\u4e3a\u4f8b $ taxonkit list --show-rank --show-name --indent \" \" --ids 9605,239934 9605 [genus] Homo 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' 1425170 [species] Homo heidelbergensis 2665952 [no rank] environmental samples 2665953 [species] Homo sapiens environmental sample 239934 [genus] Akkermansia 239935 [species] Akkermansia muciniphila 349741 [strain] Akkermansia muciniphila ATCC BAA-835 512293 [no rank] environmental samples 512294 [species] uncultured Akkermansia sp. 1131822 [species] uncultured Akkermansia sp. SMG25 1262691 [species] Akkermansia sp. CAG:344 1263034 [species] Akkermansia muciniphila CAG:154 1679444 [species] Akkermansia glycaniphila 2608915 [no rank] unclassified Akkermansia 1131336 [species] Akkermansia sp. KLE1605 ... list\u4f7f\u7528\u6700\u5e7f\u6cdb\u7684\u7684\u529f\u80fd\u662f\u83b7\u53d6\u67d0\u4e2a\u7c7b\u522b\uff08\u6bd4\u5982\u7ec6\u83cc\u3001\u75c5\u6bd2\u3001\u67d0\u4e2a\u5c5e\u7b49\uff09\u4e0b\u6240\u6709\u7684TaxID\uff0c \u7528\u6765\u4eceNCBI nt/nr\u4e2d\u83b7\u53d6\u5bf9\u5e94\u7684\u6838\u9178/\u86cb\u767d\u5e8f\u5217\uff0c\u4ece\u800c\u642d\u5efa\u7279\u5f02\u6027\u7684BLAST\u6570\u636e\u5e93\u3002 \u5b98\u7f51\u63d0\u4f9b\u4e86\u76f8\u5e94\u7684\u8be6\u7ec6\u6b65\u9aa4\uff1a http://bioinf.shenwei.me/taxonkit/tutorial \u3002 # \u6240\u6709\u7ec6\u83cc\u7684TaxID $ taxonkit list --show-rank --show-name --ids 2 > /dev/null lineage \u6839\u636eTaxID\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb \u5206\u7c7b\u5b66\u6570\u636e\u76f8\u5173\u6700\u5e38\u89c1\u7684\u529f\u80fd\u5c31\u662f\u6839\u636eTaxID\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb\u3002 TaxonKit\u53ef\u6839\u636e\u8f93\u5165\u6587\u4ef6\u63d0\u4f9b\u7684TaxID\u5217\u8868\u5feb\u901f\u8ba1\u7b97lineage\uff0c\u5e76\u53ef\u9009\u63d0\u4f9b\u540d\u79f0\uff0c\u5206\u7c7b\u5b66\u6c34\u5e73\uff0c\u4ee5\u53ca\u8c31\u7cfb\u5bf9\u5e94\u7684TaxID\u3002 \u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0c\u968f\u7740Taxonomy\u6570\u636e\u7684\u9891\u7e41\u66f4\u65b0\uff0c\u6709\u7684TaxID\u53ef\u80fd\u88ab\u5220\u9664\u3001\u6216\u5408\u5e76\uff08merge\uff09\u5230\u5176\u5b83TaxID\u4e2d\uff0c TaxonKit\u4f1a\u81ea\u52a8\u8bc6\u522b\uff0c\u5e76\u8fdb\u884c\u63d0\u793a\uff0c\u5bf9\u4e8e\u88ab\u5408\u5e76\u7684TaxID\uff0cTaxonKit\u4f1a\u6309\u65b0TaxID\u8fdb\u884c\u8ba1\u7b97\u3002 # \u4f7f\u7528example\u4e2d\u7684\u6d4b\u8bd5\u6570\u636e $ head taxids.txt 9606 9913 376619 # \u67e5\u627e\u6307\u5b9ataxids\u5217\u8868\u7684\u7269\u79cd\u4fe1\u606f\uff0ctee\u53ef\u8f93\u51fa\u5c4f\u5e55\u5e76\u5199\u5165\u6587\u4ef6 $ taxonkit lineage taxids.txt | tee lineage.txt 19:22:13.077 [WARN] taxid 92489 was merged into 796334 19:22:13.077 [WARN] taxid 1458427 was merged into 1458425 19:22:13.077 [WARN] taxid 123124124 not found 19:22:13.077 [WARN] taxid 3 was deleted 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus 376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y 123124124 3 92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raicheisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei \u4e0e\u5176\u5b83\u8f6f\u4ef6\u7684\u6027\u80fd\u76f8\u6bd4\uff0c\u5f53\u67e5\u8be2\u6570\u91cf\u8f83\u5c11\u65f6ETE\u8f83\u5feb\uff0c\u6570\u91cf\u8f83\u591a\u65f6\u5219TaxonKit\u66f4\u5feb\u3002 \u5728\u4e0d\u540c\u6570\u636e\u91cf\u89c4\u6a21\u4e0a TaxonKit\u901f\u5ea6\u4e00\u76f4\u5f88\u7a33\u5b9a\uff0c\u5747\u4e3a2-3\u79d2\uff0c\u65f6\u95f4\u4e3b\u8981\u82b1\u5728\u89e3\u6790Taxonomy\u6570\u636e\u6587\u4ef6\u4e0a\u3002 \u5217\u51falineage\u6bcf\u4e2a\u5206\u7c7b\u5b66\u5355\u5143\u7684\u7684TaxId\u548crank\u548c\u540d\u79f0\uff0c\u6bd4\u5982SARS-COV-2\u3002 # lineage\u63d0\u53d6SARS-COV-2\u7684\u4e16\u7cfb $ echo \"2697049\" \\ | taxonkit lineage -t -R \\ | sed \"s/\\t/\\n/g\" 2697049 Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 superkingdom;clade;kingdom;phylum;class;order;suborder;family;subfamily;genus;subgenus;species;no rank reformat \u751f\u6210\u6807\u51c6\u5c42\u7ea7\u7269\u79cd\u6ce8\u91ca \u6709\u65f6\u5019\uff0c\u6211\u4eec\u5e76\u4e0d\u9700\u8981\u5b8c\u6574\u7684\u5206\u7c7b\u5b66\u8c31\u7cfb\uff08complete lineage\uff09\uff0c\u56e0\u4e3a\u5f88\u591a\u7ea7\u522b\u5373\u4e0d\u5e38\u7528\uff0c\u800c\u4e14\u4e0d\u5b8c\u6574\u3002\u901a\u5e38\u53ea\u60f3\u4fdd\u7559\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\u3002 \u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0c \u4e0d\u662f\u6240\u6709\u7269\u79cd\u90fd\u6709\u5b8c\u6574\u7684\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\u6c34\u5e73\uff0c\u7279\u522b\u662f\u75c5\u6bd2\u4ee5\u53ca\u4e00\u4e9b\u73af\u5883\u6837\u54c1 \u3002 TaxonKit\u53ef\u4ee5\u7528\u81ea\u5b9a\u4e49\u5185\u5bb9\u66ff\u4ee3\u7f3a\u5931\u7684\u5206\u7c7b\u5355\u5143\uff0c\u5982\u7528\u201c__\u201d\u66ff\u4ee3\u3002 \u66f4 \u5389\u5bb3 \u6709\u7528\u7684\u662f\uff0c TaxonKit\u8fd8\u53ef\u4ee5\u7528\u66f4\u9ad8\u5c42\u7ea7\u7684\u5206\u7c7b\u5355\u5143\u4fe1\u606f\u6765\u8865\u9f50\u7f3a\u5931\u7684\u5c42\u7ea7 ( -F/--fill-miss-rank )\uff0c\u6bd4\u5982 # \u6ca1\u6709genus\u7684\u75c5\u6bd2 $ echo 1327037 | taxonkit lineage | taxonkit reformat | cut -f 1,3 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y # -F\u53c2\u6570\u4f1a\u7528family\u4fe1\u606f\u6765\u8865\u9f50genus\u4fe1\u606f $ echo 1327037 | taxonkit lineage | taxonkit reformat -F | cut -f 1,3 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae genus;Croceibacter phage P2559Y \u8f93\u51fa\u683c\u5f0f\u53ef\u9009\u53ea\u8f93\u51fa\u90e8\u5206\u5206\u7c7b\u5b66\u6c34\u5e73\uff0c\u8fd8\u652f\u6301\u5236\u8868\u7b26\uff08 \"\\t\" \uff09\uff0c\u518d\u914d\u5408\u4f5c\u8005\u7684\u53e6\u4e00\u4e2a\u5de5\u5177csvtk\uff0c\u53ef\u4ee5\u8f93\u51fa\u6f02\u4eae\u7684\u7ed3\u679c\u3002 \u5176\u5b83\u6709\u7528\u7684\u9009\u9879\uff1a -P/--add-prefix \uff1a\u7ed9\u6bcf\u4e2a\u5206\u7c7b\u5b66\u6c34\u5e73\u6dfb\u52a0\u524d\u7f00\uff0c\u6bd4\u5982 s__species \u3002 -t/--show-lineage-taxids \uff1a\u8f93\u51fa\u5206\u7c7b\u5b66\u5355\u5143\u5bf9\u5e94\u7684TaxID\u3002 -r/--miss-rank-repl : \u66ff\u4ee3\u6ca1\u6709\u5bf9\u5e94rank\u7684taxon\u540d\u79f0 -S/--pseudo-strain : \u5bf9\u4e8e\u4f4e\u4e8especies\u4e14rank\u65e2\u4e0d\u662fsubspecies\u4e5f\u4e0d\u662fstrain\u7684taxid\uff0c\u4f7f\u7528\u6c34\u5e73\u6700\u4f4etaxon\u540d\u79f0\u505a\u4e3a\u83cc\u682a\u540d\u79f0\u3002 \u4f8b\uff0c $ echo -ne \"349741\\n1327037\"\\ | taxonkit lineage \\ | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" -F -P \\ | csvtk cut -t -f -2 \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk pretty -t taxid kindom phylum class order family genus species 349741 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila 1327037 k__Viruses p__Uroviricota c__Caudoviricetes o__Caudovirales f__Siphoviridae g__unclassified Siphoviridae genus s__Croceibacter phage P2559Y # \u4fbf\u4e8e\u5c0f\u5c4f\u5e55\u67e5\u770b\uff0c\u7528csvtk\u8fdb\u884c\u8f6c\u7f6e $ echo -ne \"349741\\n1327037\"\\ | taxonkit lineage \\ | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" -F -P \\ | csvtk cut -t -f -2 \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk transpose -t \\ | csvtk pretty -H -t taxid 349741 1327037 kindom k__Bacteria k__Viruses phylum p__Verrucomicrobia p__Uroviricota class c__Verrucomicrobiae c__Caudoviricetes order o__Verrucomicrobiales o__Caudovirales family f__Akkermansiaceae f__Siphoviridae genus g__Akkermansia g__unclassified Siphoviridae genus species s__Akkermansia muciniphila s__Croceibacter phage P2559Y # \u5230\u682a\u6c34\u5e73\uff0c\u4ee5sars-cov-2\u4e3a\u4f8b $ echo -ne \"2697049\"\\ | taxonkit lineage \\ | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" -F -P -S \\ | csvtk cut -t -f -2 \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species,strain \\ | csvtk transpose -t \\ | csvtk pretty -H -t taxid 2697049 kindom k__Viruses phylum p__Pisuviricota class c__Pisoniviricetes order o__Nidovirales family f__Coronaviridae genus g__Betacoronavirus species s__Severe acute respiratory syndrome-related coronavirus strain t__Severe acute respiratory syndrome coronavirus 2 name2taxid \u5c06\u5206\u7c7b\u5355\u5143\u540d\u79f0\u8f6c\u5316\u4e3aTaxID \u5c06\u5206\u7c7b\u5355\u5143\u540d\u79f0\u8f6c\u5316\u4e3aTaxID\u975e\u5e38\u5bb9\u6613\u7406\u89e3\uff0c\u552f\u4e00\u8981\u6ce8\u610f\u7684\u662f \u67d0\u4e9bTaxId\u5bf9\u5e94\u76f8\u540c\u7684\u540d\u79f0 \uff0c\u6bd4\u5982 # -i\u6307\u5b9a\u5217\uff0c-r\u663e\u793a\u7ea7\u522b\uff0c-L\u4e0d\u663e\u793a\u4e16\u7cfb $ echo Drosophila | taxonkit name2taxid | taxonkit lineage -i 2 -r -L Drosophila 7215 genus Drosophila 32281 subgenus Drosophila 2081351 genus \u83b7\u53d6TaxID\u4e4b\u540e\uff0c\u53ef\u4ee5\u7acb\u5373\u4f20\u7ed9taxonkit\u8fdb\u884c\u540e\u7eed\u64cd\u4f5c\uff0c\u4f46\u8981\u6ce8\u610f\u7528 -i \u6307\u5b9aTaxId\u6240\u5728\u5217\u3002 filter \u6309\u5206\u7c7b\u5b66\u6c34\u5e73\u8303\u56f4\u8fc7\u6ee4TaxIDs filter\u53ef\u4ee5\u6309 \u5206\u7c7b\u5b66\u6c34\u5e73\u8303\u56f4 \u8fc7\u6ee4TaxIDs\uff0c\u6ce8\u610f\uff0c\u4e0d\u4ec5\u4ec5\u662f\u7279\u5b9a\u7684Rank\uff0c\u800c\u662f\u4e00\u4e2a \u8303\u56f4 \u3002 \u6bd4\u5982genus\u53ca\u4ee5\u4e0b\u7684\u5206\u7c7b\u5b66\u6c34\u5e73\uff0c\u7528 -L genus -E genus \uff0c\u7c7b\u4f3c\u4e8e <= genus \u3002 $ cat taxids2.txt \\ | taxonkit filter -L genus -E genus \\ | taxonkit lineage -r -n -L \\ | csvtk -Ht cut -f 1,3,2 \\ | csvtk pretty -H -t 239934 genus Akkermansia 239935 species Akkermansia muciniphila 349741 strain Akkermansia muciniphila ATCC BAA-835 lca \u8ba1\u7b97\u6700\u4f4e\u516c\u5171\u7956\u5148(LCA) \u6bd4\u5982\u4eba\u5c5e\u7684\u4f8b\u5b50 $ taxonkit list --ids 9605 -nr --indent \" \" 9605 [genus] Homo 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' 1425170 [species] Homo heidelbergensis 2665952 [no rank] environmental samples 2665953 [species] Homo sapiens environmental sample TaxID\u7684\u5206\u9694\u7b26\u53ef\u7528 -s/--separater \u6307\u5b9a\uff0c\u9ed8\u8ba4\u4e3a\" \"\u3002 # \u8ba1\u7b97\u4e24\u4e2a\u7269\u79cd\u7684\u6700\u8fd1\u5171\u540c\u7956\u5148\uff0c\u4ee5\u4e0a\u9762\u5c3c\u5b89\u5fb7\u7279\u4eba\u4e9a\u79cd\u548c\u6d77\u5fb7\u5821\u4eba\u79cd $ echo 63221 2665953 | taxonkit lca 63221 2665953 9605 # \u5176\u5b83\u5206\u9694\u7b26\uff0c\u4e14\u4e0d\u5c0f\u5fc3\u591a\u4e86\u7a7a\u683c $ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" a 63221,2665953 b 63221, 741158 $ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" \\ | taxonkit lca -i 2 -s \",\" a 63221,2665953 9605 b 63221, 741158 9606 TaxID changelog \u8ffd\u8e2aTaxID\u53d8\u66f4\u8bb0\u5f55 NCBI Taxonomy\u6570\u636e\u6bcf\u5929\u90fd\u5728\u66f4\u65b0\uff0c\u6bcf\u6708\u521d\uff08\u5927\u591a\u4e3a1\u53f7\uff09\u7684\u6570\u636e\u4f5c\u4e3a\u5b58\u6863\u4fdd\u5b58\u5728 taxdump_archive/ \u76ee\u5f55\uff0c \u65e7\u7248\u672c\u6700\u65e9\u6570\u636e\u52302014\u5e748\u6708\uff0c\u65b0\u7248\u672c\u53ea\u52302018\u5e7412\u6708\u3002 TaxonKit\u53ef\u4ee5\u8ffd\u8e2a\u6240\u6709TaxID\u6bcf\u4e2a\u6708\u7684\u53d8\u5316\uff0c\u8f93\u51fa\u5230csv\u6587\u4ef6\u4e2d\uff0c\u53ef\u4ee5\u901a\u8fc7\u547d\u4ee4\u884c\u5de5\u5177\u8fdb\u884c\u67e5\u8be2\u3002 \u6570\u636e\u548c\u6587\u6863\u5355\u72ec\u6258\u7ba1\u5728 https://github.com/shenwei356/taxid-changelog \u3002 \u9664\u4e86\u7b80\u5355\u7684\u589e\u52a0\u3001\u5220\u9664\u3001\u5408\u5e76\u4e4b\u5916\uff0c\u4f5c\u8005\u5c06TaxID\u6539\u53d8\u505a\u4e86\u7ec6\u5206\u3002\u8f93\u51fa\u683c\u5f0f\u5982\u4e0b # \u5217 \u5907\u6ce8 taxid # taxid version # version / time of archive, e.g, 2019-07-01 change # change, values: # NEW \u65b0\u589e # REUSE_DEL \u524d\u671f\u88ab\u5220\u9664\uff0c\u73b0\u5728\u53c8\u91cd\u65b0\u52a0\u5165 # REUSE_MER \u524d\u671f\u88ab\u5408\u5e76\uff0c\u73b0\u5728\u53c8\u91cd\u65b0\u52a0\u5165 # DELETE \u5220\u9664 # MERGE \u5408\u5e76\u5230\u53e6\u4e00\u4e2aTaxID # ABSORB \u5176\u4ed6TaxID\u5408\u5e76\u5230\u5f53\u524dTaxID # CHANGE_NAME \u540d\u79f0\u6539\u53d8 # CHANGE_RANK \u5206\u7c7b\u5b66\u6c34\u5e73\u6539\u53d8 # CHANGE_LIN_LIN \u8c31\u7cfb\u7684TaxID\u6ca1\u6709\u53d8\u5316\uff0c\u8c31\u7cfb\u6539\u53d8\uff08\u67d0\u4e9bTaxID\u7684\u540d\u79f0\u53d8\u4e86\uff09 # CHANGE_LIN_TAX \u8c31\u7cfb\u7684TaxID\u6539\u53d8 # CHANGE_LIN_LEN \u8c31\u7cfb\u7684\u957f\u5ea6/\u6df1\u5ea6\u53d1\u751f\u53d8\u5316 change-value # variable values for changes: # 1) new taxid for MERGE # 2) merged taxids for ABSORB # 3) empty for others name # scientific name rank # rank lineage # complete lineage of the taxid lineage-taxids # taxids of the lineage \u6570\u636e\u6587\u4ef6\u53ef\u4ee5\u5728\u524d\u9762\u7f51\u7ad9\u4e0a\u4e0b\u8f7d\uff0c taxid-changelog.csv.gz \uff0c130M\u5de6\u53f3\uff0c\u89e3\u538b\u540e2.2G\uff0c\u56e0\u4e3a\u662fgzip\u683c\u5f0f\uff0c\u5b8c\u5168\u4e0d\u9700\u8981\u89e3\u538b\u5373\u53ef\u5206\u6790\u3002 \u4e0b\u6587\u4f7f\u7528\u4e86 pigz \u4ee3\u66ff zcat \u548c gzip \u63d0\u9ad8\u89e3\u538b\u901f\u5ea6\u3002 \u4f8b1 superkingdom\u4e5f\u80fd\u6d88\u5931 \uff0c\u6bd4\u5982\u7c7b\u75c5\u6bd2(Viroids)\u57282019\u5e745\u6708\u88ab\u5220\u9664\u4e86\u3002 \u4f5c\u8005\u662f\u5728\u67d0\u4e00\u5929\u65e0\u610f\u4e2d\u53d1\u73b0\u6b64\u4e8b\uff0c\u6240\u4ee5\u51b3\u5b9a\u5228\u6839\u95ee\u5e95\uff0c\u5f00\u53d1\u4e86\u8fd9\u4e2a\u5b50\u547d\u4ee4\u3002 # \u4e0b\u8f7d wget -c https://github.com/shenwei356/taxid-changelog/releases/download/v2021.01/taxid-changelog.csv.gz # \u5b89\u88c5\u591a\u7ebf\u7a0b\u89e3\u538b\u7d22\u8f6f\u4ef6\u3002\u6216\u8005\u7528gzip\u66ff\u6362\u3002 conda install pigz $ pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f rank -p superkingdom \\ | csvtk pretty taxid version change change-value name rank lineage lineage-taxids 2 2014-08-01 NEW Bacteria superkingdom cellular organisms;Bacteria 131567;2 2157 2014-08-01 NEW Archaea superkingdom cellular organisms;Archaea 131567;2157 2759 2014-08-01 NEW Eukaryota superkingdom cellular organisms;Eukaryota 131567;2759 10239 2014-08-01 NEW Viruses superkingdom Viruses 10239 12884 2014-08-01 NEW Viroids superkingdom Viroids 12884 12884 2019-05-01 DELETE Viroids superkingdom Viroids 12884 \u4f8b2 SARS-CoV-2 \u3002\u53ef\u89c1\u65b0\u51a0\u75c5\u6bd2\u57282020\u5e742\u6708\u52a0\u5165\uff0c\u968f\u540e3\u6708\u548c6\u6708\u4efd\u6539\u4e86\u540d\u79f0\uff0c\u8c31\u7cfb\u7b49\u4fe1\u606f\u3002\u67e5\u8be2\u901f\u5ea6\u4e5f\u5f88\u5feb\u3002 # \u672c\u4f8b\u5b50\u53ea\u663e\u793a\u4e86\u90e8\u5206\u5217\u3002 $ time pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -p 2697049 \\ | csvtk cut -f version,change,name,rank \\ | csvtk pretty version change name rank 2020-02-01 NEW Wuhan seafood market pneumonia virus species 2020-03-01 CHANGE_NAME Severe acute respiratory syndrome coronavirus 2 no rank 2020-03-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank 2020-03-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank 2020-06-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank 2020-07-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 isolate 2020-08-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank real 0m7.644s user 0m16.749s sys 0m3.985s \u66f4\u591a\u6709\u610f\u601d\u7684\u53d1\u73b0\u8be6\u89c1 taxid-changelog","title":"\u4e2d\u6587\u4ecb\u7ecd"},{"location":"chinese/#taxonkit-ncbi","text":"","title":"TaxonKit: \u5c0f\u5de7\u3001\u9ad8\u6548\u3001\u5b9e\u7528\u7684NCBI\u5206\u7c7b\u5b66\u6570\u636e\u547d\u4ee4\u884c\u5de5\u5177\u96c6"},{"location":"chinese/#ncbi-taxonomy","text":"\u4ece\u4e8b\u751f\u7269\u591a\u6837\u6027\u7684\u7814\u7a76\u8005\u5bf9NCBI Taxonomy\u6570\u636e\u5e93\u4e00\u5b9a\u4e0d\u4f1a\u964c\u751f\uff0c \u5b83\u5305\u542b\u4e86NCBI\u6240\u6709\u6838\u9178\u548c\u86cb\u767d\u5e8f\u5217\u6570\u636e\u5e93\u4e2d\u6bcf\u6761\u5e8f\u5217\u5bf9\u5e94\u7684\u7269\u79cd\u540d\u79f0\u4e0e\u5206\u7c7b\u5b66\u4fe1\u606f\u3002 \u5927\u591a\u6570\u751f\u6001\u5b66\u7814\u7a76\u5bf9\u7269\u79cd\u7ec4\u6210\u7684\u63cf\u8ff0\u90fd\u662f\u57fa\u4e8eNCBI Taxonomy\u6570\u636e\u5e93\uff0c \u5f53\u7136\u76ee\u524d\u4e5f\u5f00\u59cb\u4f7f\u7528\u5176\u4ed6\u6570\u636e\u5e93\uff0c\u5982GTDB\u7b49\u3002 NCBI Taxonomy\u6570\u636e\u5e93\u59cb\u4e8e1991\u5e74\uff0c\u4e00\u76f4\u968f\u7740Entrez\u6570\u636e\u5e93\u548c\u5176\u4ed6\u6570\u636e\u5e93\u66f4\u65b0\uff0c 1996\u5e74\u63a8\u51fa\u7f51\u9875\u7248\u3002NCBI Taxonomy\u6570\u636e\u5e93\u5b98\u65b9\u5730\u5740\u4e3a https://www.ncbi.nlm.nih.gov/taxonomy \uff0c \u516c\u5f00\u6570\u636e\u4e0b\u8f7d\u5730\u5740\u4e3a https://ftp.ncbi.nih.gov/pub/taxonomy/ \uff0c \u6570\u636e\u6bcf\u5c0f\u65f6\u66f4\u65b0\uff0c\u6bcf\u4e2a\u6708\u521d\u751f\u6210\u4e00\u4efd\u6570\u636e\u5f52\u6863\u5b58\u4e8e taxdump_archive \u76ee\u5f55\uff0c\u6700\u65e9\u53ef\u8ffd\u6eaf\u52302014\u5e748\u6708\u3002","title":"NCBI Taxonomy \u6570\u636e\u5e93"},{"location":"chinese/#taxonkit","text":"TaxonKit\u662f\u91c7\u7528Go\u8bed\u8a00\u7f16\u5199\u7684\u547d\u4ee4\u884c\u5de5\u5177\uff0c \u63d0\u4f9bLinux, Windows, macOS\u64cd\u4f5c\u7cfb\u7edf\u4e0d\u540c\u67b6\u6784\uff08x86-64/arm64\uff09\u7684\u9759\u6001\u7f16\u8bd1\u7684\u53ef\u6267\u884c\u4e8c\u8fdb\u5236\u6587\u4ef6\u3002 \u53d1\u5e03\u7684\u538b\u7f29\u5305\u4e0d\u8db33Mb\uff0c\u9664\u4e86Github\u6258\u7ba1\u5916\uff0c\u8fd8\u63d0\u4f9b\u56fd\u5185\u955c\u50cf\u4f9b\u4e0b\u8f7d\uff0c\u540c\u65f6\u8fd8\u652f\u6301conda\u548chomebrew\u5b89\u88c5\u3002 \u7528\u6237\u53ea\u9700\u8981 \u4e0b\u8f7d\u3001\u89e3\u538b\uff0c\u5f00\u7bb1\u5373\u7528\uff0c\u65e0\u9700\u914d\u7f6e \uff0c\u4ec5\u9700\u4e0b\u8f7d\u89e3\u538bNCBI Taxonomy\u6570\u636e\u6587\u4ef6\u89e3\u538b\u5230\u6307\u5b9a\u76ee\u5f55\u5373\u53ef\u3002 \u6e90\u4ee3\u7801 https://github.com/shenwei356/taxonkit \uff0c \u6587\u6863 http://bioinf.shenwei.me/taxonkit \uff08\u4ecb\u7ecd\u3001\u4f7f\u7528\u8bf4\u660e\u3001\u4f8b\u5b50\u3001\u6559\u7a0b\uff09 \u9009\u62e9\u7cfb\u7edf\u5bf9\u5e94\u7684\u7248\u672c\u4e0b\u8f7d\u6700\u65b0\u7248 https://github.com/shenwei356/taxonkit/releases \uff0c\u89e3\u538b\u540e\u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u5373\u53ef\u4f7f\u7528\u3002\u6216\u53ef\u9009conda\u5b89\u88c5 conda install taxonkit -c bioconda -y # \u8868\u683c\u6570\u636e\u5904\u7406\uff0c\u63a8\u8350\u4f7f\u7528 csvtk \u66f4\u9ad8\u6548 conda install csvtk -c bioconda -y \u6d4b\u8bd5\u6570\u636e\u4e0b\u8f7d\u53ef\u76f4\u63a5 https://github.com/shenwei356/taxonkit \u4e0b\u8f7d\u9879\u76ee\u538b\u7f29\u5305\uff0c\u6216\u4f7f\u7528git clone\u4e0b\u8f7d\u9879\u76ee\u6587\u4ef6\u5939\uff0c\u5176\u4e2d\u7684example\u4e3a\u6d4b\u8bd5\u6570\u636e git clone https://github.com/shenwei356/taxonkit TaxonKit\u4e3a\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u91c7\u7528\u5b50\u547d\u4ee4\u7684\u65b9\u5f0f\u6765\u6267\u884c\u4e0d\u540c\u529f\u80fd\uff0c \u5927\u591a\u6570\u5b50\u547d\u4ee4\u652f\u6301\u6807\u51c6\u8f93\u5165/\u8f93\u51fa\uff0c\u4fbf\u4e8e\u4f7f\u7528\u547d\u4ee4\u884c\u7ba1\u9053\u8fdb\u884c\u6d41\u6c34\u4f5c\u4e1a\uff0c \u8f7b\u677e\u6574\u5408\u8fdb\u5206\u6790\u6d41\u7a0b\u4e2d\u3002 \u5b50\u547d\u4ee4 \u529f\u80fd list \u5217\u51fa\u6307\u5b9aTaxId\u4e0b\u6240\u6709\u5b50\u5355\u5143\u7684\u7684TaxID lineage \u6839\u636eTaxID\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb\uff08lineage\uff09 reformat \u5c06\u5b8c\u6574\u8c31\u7cfb\u8f6c\u5316\u4e3a\u201c\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\u682a\"\u7684\u81ea\u5b9a\u4e49\u683c\u5f0f name2taxid \u5c06\u5206\u7c7b\u5355\u5143\u540d\u79f0\u8f6c\u5316\u4e3aTaxID filter \u6309\u5206\u7c7b\u5b66\u6c34\u5e73\u8303\u56f4\u8fc7\u6ee4TaxIDs lca \u8ba1\u7b97\u6700\u4f4e\u516c\u5171\u7956\u5148(LCA) taxid-changelog \u8ffd\u8e2aTaxID\u53d8\u66f4\u8bb0\u5f55 version \u663e\u793a\u7248\u672c\u4fe1\u606f\u3001\u68c0\u6d4b\u65b0\u7248\u672c genautocomplete \u751f\u6210shell\u81ea\u52a8\u8865\u5168\u914d\u7f6e\u811a\u672c \u5907\u6ce8\uff1a \u8f93\u51fa\uff1a \u6240\u6709\u547d\u4ee4\u8f93\u51fa\u4e2d\u5305\u542b\u8f93\u5165\u6570\u636e\u5185\u5bb9\uff0c\u5728\u6b64\u57fa\u7840\u4e0a\u589e\u52a0\u5217\u3002 \u6240\u6709\u547d\u4ee4\u9ed8\u8ba4\u8f93\u51fa\u5230\u6807\u51c6\u8f93\u51fa\uff08stdout\uff09\uff0c\u53ef\u901a\u8fc7\u91cd\u5b9a\u5411\uff08 > \uff09\u5199\u5165\u6587\u4ef6\u3002 \u6216\u901a\u8fc7\u5168\u5c40\u53c2\u6570 -o \u6216 --out-file \u6307\u5b9a\u8f93\u51fa\u6587\u4ef6\uff0c\u4e14\u53ef\u81ea\u52a8\u8bc6\u522b\u8f93\u51fa\u6587\u4ef6\u540e\u7f00\uff08 .gz \uff09\u8f93\u51fagzip\u683c\u5f0f\u3002 \u8f93\u5165\uff1a \u9664\u4e86 list \u4e0e taxid-changelog \u4e4b\u5916\uff0c lineage , reformat , name2taxid , filter \u4e0e lca \u5747\u53ef\u4ece\u6807\u51c6\u8f93\u5165\uff08stdin\uff09\u8bfb\u53d6\u8f93\u5165\u6570\u636e\uff0c\u4e5f\u53ef\u901a\u8fc7\u4f4d\u7f6e\u53c2\u6570\uff08positional arguments\uff09\u8f93\u5165\uff0c\u5373\u547d\u4ee4\u540e\u9762\u4e0d\u5e26 \u4efb\u4f55flag\u7684\u53c2\u6570\uff0c\u5982 taxonkit lineage taxids.txt \u8f93\u5165\u683c\u5f0f\u4e3a\u5355\u5217\uff0c\u6216\u8005\u5236\u8868\u7b26\u5206\u9694\u7684\u683c\u5f0f\uff0c\u8f93\u5165\u6570\u636e\u6240\u5728\u5217\u7528 -i \u6216 --taxid-field \u6307\u5b9a\u3002 TaxonKit\u76f4\u63a5\u89e3\u6790NCBI Taxonomy\u6570\u636e\u6587\u4ef6\uff082\u79d2\u5de6\u53f3\uff09\uff0c\u914d\u7f6e\u66f4\u5bb9\u6613\uff0c\u4e5f\u4fbf\u4e8e\u66f4\u65b0\u6570\u636e\uff0c\u5360\u7528\u5185\u5b58\u5728500Mb-1.5G\u5de6\u53f3\u3002 \u6570\u636e\u4e0b\u8f7d\uff1a # \u6709\u65f6\u4e0b\u8f7d\u5931\u8d25\uff0c\u53ef\u591a\u8bd5\u51e0\u6b21\uff1b\u6216\u5c1d\u8bd5\u6d4f\u89c8\u5668\u4e0b\u8f7d\u6b64\u94fe\u63a5 wget -c https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz # \u89e3\u538b\u6587\u4ef6\u5b58\u4e8e\u5bb6\u76ee\u5f55\u4e2d.taxonkit/\uff0c\u7a0b\u5e8f\u9ed8\u8ba4\u6570\u636e\u5e93\u9ed8\u8ba4\u76ee\u5f55 mkdir -p $HOME/.taxonkit cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit","title":"TaxonKit \u4f7f\u7528"},{"location":"chinese/#list-taxidtaxid","text":"taxonkit list \u7528\u4e8e\u5217\u51fa\u6307\u5b9aTaxID\u6240\u5728\u5206\u7c7b\u5b66\u5355\u5143\uff08taxon\uff09\u7684\u5b50\u6811\uff08subtree\uff09\u7684\u6240\u6709taxon\u7684TaxID\uff0c\u53ef\u9009\u663e\u793a\u540d\u79f0\u548c\u5206\u7c7b\u5b66\u6c34\u5e73\u3002 \u6b64\u529f\u80fd\u4e0eNCBI Taxonomy\u7f51\u9875\u7248\u7c7b\u4f3c\u3002 \u5982\uff0c # \u4ee5\u4eba\u5c5e(9605)\u548c\u80a0\u9053\u4e2d\u8457\u540d\u7684Akk\u83cc\u5c5e(239934)\u4e3a\u4f8b $ taxonkit list --show-rank --show-name --indent \" \" --ids 9605,239934 9605 [genus] Homo 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' 1425170 [species] Homo heidelbergensis 2665952 [no rank] environmental samples 2665953 [species] Homo sapiens environmental sample 239934 [genus] Akkermansia 239935 [species] Akkermansia muciniphila 349741 [strain] Akkermansia muciniphila ATCC BAA-835 512293 [no rank] environmental samples 512294 [species] uncultured Akkermansia sp. 1131822 [species] uncultured Akkermansia sp. SMG25 1262691 [species] Akkermansia sp. CAG:344 1263034 [species] Akkermansia muciniphila CAG:154 1679444 [species] Akkermansia glycaniphila 2608915 [no rank] unclassified Akkermansia 1131336 [species] Akkermansia sp. KLE1605 ... list\u4f7f\u7528\u6700\u5e7f\u6cdb\u7684\u7684\u529f\u80fd\u662f\u83b7\u53d6\u67d0\u4e2a\u7c7b\u522b\uff08\u6bd4\u5982\u7ec6\u83cc\u3001\u75c5\u6bd2\u3001\u67d0\u4e2a\u5c5e\u7b49\uff09\u4e0b\u6240\u6709\u7684TaxID\uff0c \u7528\u6765\u4eceNCBI nt/nr\u4e2d\u83b7\u53d6\u5bf9\u5e94\u7684\u6838\u9178/\u86cb\u767d\u5e8f\u5217\uff0c\u4ece\u800c\u642d\u5efa\u7279\u5f02\u6027\u7684BLAST\u6570\u636e\u5e93\u3002 \u5b98\u7f51\u63d0\u4f9b\u4e86\u76f8\u5e94\u7684\u8be6\u7ec6\u6b65\u9aa4\uff1a http://bioinf.shenwei.me/taxonkit/tutorial \u3002 # \u6240\u6709\u7ec6\u83cc\u7684TaxID $ taxonkit list --show-rank --show-name --ids 2 > /dev/null","title":"list \u5217\u51fa\u6307\u5b9aTaxId\u6240\u5728\u5b50\u6811\u7684\u6240\u6709TaxID"},{"location":"chinese/#lineage-taxid","text":"\u5206\u7c7b\u5b66\u6570\u636e\u76f8\u5173\u6700\u5e38\u89c1\u7684\u529f\u80fd\u5c31\u662f\u6839\u636eTaxID\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb\u3002 TaxonKit\u53ef\u6839\u636e\u8f93\u5165\u6587\u4ef6\u63d0\u4f9b\u7684TaxID\u5217\u8868\u5feb\u901f\u8ba1\u7b97lineage\uff0c\u5e76\u53ef\u9009\u63d0\u4f9b\u540d\u79f0\uff0c\u5206\u7c7b\u5b66\u6c34\u5e73\uff0c\u4ee5\u53ca\u8c31\u7cfb\u5bf9\u5e94\u7684TaxID\u3002 \u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0c\u968f\u7740Taxonomy\u6570\u636e\u7684\u9891\u7e41\u66f4\u65b0\uff0c\u6709\u7684TaxID\u53ef\u80fd\u88ab\u5220\u9664\u3001\u6216\u5408\u5e76\uff08merge\uff09\u5230\u5176\u5b83TaxID\u4e2d\uff0c TaxonKit\u4f1a\u81ea\u52a8\u8bc6\u522b\uff0c\u5e76\u8fdb\u884c\u63d0\u793a\uff0c\u5bf9\u4e8e\u88ab\u5408\u5e76\u7684TaxID\uff0cTaxonKit\u4f1a\u6309\u65b0TaxID\u8fdb\u884c\u8ba1\u7b97\u3002 # \u4f7f\u7528example\u4e2d\u7684\u6d4b\u8bd5\u6570\u636e $ head taxids.txt 9606 9913 376619 # \u67e5\u627e\u6307\u5b9ataxids\u5217\u8868\u7684\u7269\u79cd\u4fe1\u606f\uff0ctee\u53ef\u8f93\u51fa\u5c4f\u5e55\u5e76\u5199\u5165\u6587\u4ef6 $ taxonkit lineage taxids.txt | tee lineage.txt 19:22:13.077 [WARN] taxid 92489 was merged into 796334 19:22:13.077 [WARN] taxid 1458427 was merged into 1458425 19:22:13.077 [WARN] taxid 123124124 not found 19:22:13.077 [WARN] taxid 3 was deleted 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus 376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y 123124124 3 92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raicheisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei \u4e0e\u5176\u5b83\u8f6f\u4ef6\u7684\u6027\u80fd\u76f8\u6bd4\uff0c\u5f53\u67e5\u8be2\u6570\u91cf\u8f83\u5c11\u65f6ETE\u8f83\u5feb\uff0c\u6570\u91cf\u8f83\u591a\u65f6\u5219TaxonKit\u66f4\u5feb\u3002 \u5728\u4e0d\u540c\u6570\u636e\u91cf\u89c4\u6a21\u4e0a TaxonKit\u901f\u5ea6\u4e00\u76f4\u5f88\u7a33\u5b9a\uff0c\u5747\u4e3a2-3\u79d2\uff0c\u65f6\u95f4\u4e3b\u8981\u82b1\u5728\u89e3\u6790Taxonomy\u6570\u636e\u6587\u4ef6\u4e0a\u3002 \u5217\u51falineage\u6bcf\u4e2a\u5206\u7c7b\u5b66\u5355\u5143\u7684\u7684TaxId\u548crank\u548c\u540d\u79f0\uff0c\u6bd4\u5982SARS-COV-2\u3002 # lineage\u63d0\u53d6SARS-COV-2\u7684\u4e16\u7cfb $ echo \"2697049\" \\ | taxonkit lineage -t -R \\ | sed \"s/\\t/\\n/g\" 2697049 Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 superkingdom;clade;kingdom;phylum;class;order;suborder;family;subfamily;genus;subgenus;species;no rank","title":"lineage \u6839\u636eTaxID\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb"},{"location":"chinese/#reformat","text":"\u6709\u65f6\u5019\uff0c\u6211\u4eec\u5e76\u4e0d\u9700\u8981\u5b8c\u6574\u7684\u5206\u7c7b\u5b66\u8c31\u7cfb\uff08complete lineage\uff09\uff0c\u56e0\u4e3a\u5f88\u591a\u7ea7\u522b\u5373\u4e0d\u5e38\u7528\uff0c\u800c\u4e14\u4e0d\u5b8c\u6574\u3002\u901a\u5e38\u53ea\u60f3\u4fdd\u7559\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\u3002 \u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0c \u4e0d\u662f\u6240\u6709\u7269\u79cd\u90fd\u6709\u5b8c\u6574\u7684\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\u6c34\u5e73\uff0c\u7279\u522b\u662f\u75c5\u6bd2\u4ee5\u53ca\u4e00\u4e9b\u73af\u5883\u6837\u54c1 \u3002 TaxonKit\u53ef\u4ee5\u7528\u81ea\u5b9a\u4e49\u5185\u5bb9\u66ff\u4ee3\u7f3a\u5931\u7684\u5206\u7c7b\u5355\u5143\uff0c\u5982\u7528\u201c__\u201d\u66ff\u4ee3\u3002 \u66f4 \u5389\u5bb3 \u6709\u7528\u7684\u662f\uff0c TaxonKit\u8fd8\u53ef\u4ee5\u7528\u66f4\u9ad8\u5c42\u7ea7\u7684\u5206\u7c7b\u5355\u5143\u4fe1\u606f\u6765\u8865\u9f50\u7f3a\u5931\u7684\u5c42\u7ea7 ( -F/--fill-miss-rank )\uff0c\u6bd4\u5982 # \u6ca1\u6709genus\u7684\u75c5\u6bd2 $ echo 1327037 | taxonkit lineage | taxonkit reformat | cut -f 1,3 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y # -F\u53c2\u6570\u4f1a\u7528family\u4fe1\u606f\u6765\u8865\u9f50genus\u4fe1\u606f $ echo 1327037 | taxonkit lineage | taxonkit reformat -F | cut -f 1,3 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae genus;Croceibacter phage P2559Y \u8f93\u51fa\u683c\u5f0f\u53ef\u9009\u53ea\u8f93\u51fa\u90e8\u5206\u5206\u7c7b\u5b66\u6c34\u5e73\uff0c\u8fd8\u652f\u6301\u5236\u8868\u7b26\uff08 \"\\t\" \uff09\uff0c\u518d\u914d\u5408\u4f5c\u8005\u7684\u53e6\u4e00\u4e2a\u5de5\u5177csvtk\uff0c\u53ef\u4ee5\u8f93\u51fa\u6f02\u4eae\u7684\u7ed3\u679c\u3002 \u5176\u5b83\u6709\u7528\u7684\u9009\u9879\uff1a -P/--add-prefix \uff1a\u7ed9\u6bcf\u4e2a\u5206\u7c7b\u5b66\u6c34\u5e73\u6dfb\u52a0\u524d\u7f00\uff0c\u6bd4\u5982 s__species \u3002 -t/--show-lineage-taxids \uff1a\u8f93\u51fa\u5206\u7c7b\u5b66\u5355\u5143\u5bf9\u5e94\u7684TaxID\u3002 -r/--miss-rank-repl : \u66ff\u4ee3\u6ca1\u6709\u5bf9\u5e94rank\u7684taxon\u540d\u79f0 -S/--pseudo-strain : \u5bf9\u4e8e\u4f4e\u4e8especies\u4e14rank\u65e2\u4e0d\u662fsubspecies\u4e5f\u4e0d\u662fstrain\u7684taxid\uff0c\u4f7f\u7528\u6c34\u5e73\u6700\u4f4etaxon\u540d\u79f0\u505a\u4e3a\u83cc\u682a\u540d\u79f0\u3002 \u4f8b\uff0c $ echo -ne \"349741\\n1327037\"\\ | taxonkit lineage \\ | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" -F -P \\ | csvtk cut -t -f -2 \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk pretty -t taxid kindom phylum class order family genus species 349741 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila 1327037 k__Viruses p__Uroviricota c__Caudoviricetes o__Caudovirales f__Siphoviridae g__unclassified Siphoviridae genus s__Croceibacter phage P2559Y # \u4fbf\u4e8e\u5c0f\u5c4f\u5e55\u67e5\u770b\uff0c\u7528csvtk\u8fdb\u884c\u8f6c\u7f6e $ echo -ne \"349741\\n1327037\"\\ | taxonkit lineage \\ | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" -F -P \\ | csvtk cut -t -f -2 \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk transpose -t \\ | csvtk pretty -H -t taxid 349741 1327037 kindom k__Bacteria k__Viruses phylum p__Verrucomicrobia p__Uroviricota class c__Verrucomicrobiae c__Caudoviricetes order o__Verrucomicrobiales o__Caudovirales family f__Akkermansiaceae f__Siphoviridae genus g__Akkermansia g__unclassified Siphoviridae genus species s__Akkermansia muciniphila s__Croceibacter phage P2559Y # \u5230\u682a\u6c34\u5e73\uff0c\u4ee5sars-cov-2\u4e3a\u4f8b $ echo -ne \"2697049\"\\ | taxonkit lineage \\ | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" -F -P -S \\ | csvtk cut -t -f -2 \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species,strain \\ | csvtk transpose -t \\ | csvtk pretty -H -t taxid 2697049 kindom k__Viruses phylum p__Pisuviricota class c__Pisoniviricetes order o__Nidovirales family f__Coronaviridae genus g__Betacoronavirus species s__Severe acute respiratory syndrome-related coronavirus strain t__Severe acute respiratory syndrome coronavirus 2","title":"reformat \u751f\u6210\u6807\u51c6\u5c42\u7ea7\u7269\u79cd\u6ce8\u91ca"},{"location":"chinese/#name2taxid-taxid","text":"\u5c06\u5206\u7c7b\u5355\u5143\u540d\u79f0\u8f6c\u5316\u4e3aTaxID\u975e\u5e38\u5bb9\u6613\u7406\u89e3\uff0c\u552f\u4e00\u8981\u6ce8\u610f\u7684\u662f \u67d0\u4e9bTaxId\u5bf9\u5e94\u76f8\u540c\u7684\u540d\u79f0 \uff0c\u6bd4\u5982 # -i\u6307\u5b9a\u5217\uff0c-r\u663e\u793a\u7ea7\u522b\uff0c-L\u4e0d\u663e\u793a\u4e16\u7cfb $ echo Drosophila | taxonkit name2taxid | taxonkit lineage -i 2 -r -L Drosophila 7215 genus Drosophila 32281 subgenus Drosophila 2081351 genus \u83b7\u53d6TaxID\u4e4b\u540e\uff0c\u53ef\u4ee5\u7acb\u5373\u4f20\u7ed9taxonkit\u8fdb\u884c\u540e\u7eed\u64cd\u4f5c\uff0c\u4f46\u8981\u6ce8\u610f\u7528 -i \u6307\u5b9aTaxId\u6240\u5728\u5217\u3002","title":"name2taxid \u5c06\u5206\u7c7b\u5355\u5143\u540d\u79f0\u8f6c\u5316\u4e3aTaxID"},{"location":"chinese/#filter-taxids","text":"filter\u53ef\u4ee5\u6309 \u5206\u7c7b\u5b66\u6c34\u5e73\u8303\u56f4 \u8fc7\u6ee4TaxIDs\uff0c\u6ce8\u610f\uff0c\u4e0d\u4ec5\u4ec5\u662f\u7279\u5b9a\u7684Rank\uff0c\u800c\u662f\u4e00\u4e2a \u8303\u56f4 \u3002 \u6bd4\u5982genus\u53ca\u4ee5\u4e0b\u7684\u5206\u7c7b\u5b66\u6c34\u5e73\uff0c\u7528 -L genus -E genus \uff0c\u7c7b\u4f3c\u4e8e <= genus \u3002 $ cat taxids2.txt \\ | taxonkit filter -L genus -E genus \\ | taxonkit lineage -r -n -L \\ | csvtk -Ht cut -f 1,3,2 \\ | csvtk pretty -H -t 239934 genus Akkermansia 239935 species Akkermansia muciniphila 349741 strain Akkermansia muciniphila ATCC BAA-835","title":"filter \u6309\u5206\u7c7b\u5b66\u6c34\u5e73\u8303\u56f4\u8fc7\u6ee4TaxIDs"},{"location":"chinese/#lca-lca","text":"\u6bd4\u5982\u4eba\u5c5e\u7684\u4f8b\u5b50 $ taxonkit list --ids 9605 -nr --indent \" \" 9605 [genus] Homo 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' 1425170 [species] Homo heidelbergensis 2665952 [no rank] environmental samples 2665953 [species] Homo sapiens environmental sample TaxID\u7684\u5206\u9694\u7b26\u53ef\u7528 -s/--separater \u6307\u5b9a\uff0c\u9ed8\u8ba4\u4e3a\" \"\u3002 # \u8ba1\u7b97\u4e24\u4e2a\u7269\u79cd\u7684\u6700\u8fd1\u5171\u540c\u7956\u5148\uff0c\u4ee5\u4e0a\u9762\u5c3c\u5b89\u5fb7\u7279\u4eba\u4e9a\u79cd\u548c\u6d77\u5fb7\u5821\u4eba\u79cd $ echo 63221 2665953 | taxonkit lca 63221 2665953 9605 # \u5176\u5b83\u5206\u9694\u7b26\uff0c\u4e14\u4e0d\u5c0f\u5fc3\u591a\u4e86\u7a7a\u683c $ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" a 63221,2665953 b 63221, 741158 $ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" \\ | taxonkit lca -i 2 -s \",\" a 63221,2665953 9605 b 63221, 741158 9606","title":"lca \u8ba1\u7b97\u6700\u4f4e\u516c\u5171\u7956\u5148(LCA)"},{"location":"chinese/#taxid-changelog-taxid","text":"NCBI Taxonomy\u6570\u636e\u6bcf\u5929\u90fd\u5728\u66f4\u65b0\uff0c\u6bcf\u6708\u521d\uff08\u5927\u591a\u4e3a1\u53f7\uff09\u7684\u6570\u636e\u4f5c\u4e3a\u5b58\u6863\u4fdd\u5b58\u5728 taxdump_archive/ \u76ee\u5f55\uff0c \u65e7\u7248\u672c\u6700\u65e9\u6570\u636e\u52302014\u5e748\u6708\uff0c\u65b0\u7248\u672c\u53ea\u52302018\u5e7412\u6708\u3002 TaxonKit\u53ef\u4ee5\u8ffd\u8e2a\u6240\u6709TaxID\u6bcf\u4e2a\u6708\u7684\u53d8\u5316\uff0c\u8f93\u51fa\u5230csv\u6587\u4ef6\u4e2d\uff0c\u53ef\u4ee5\u901a\u8fc7\u547d\u4ee4\u884c\u5de5\u5177\u8fdb\u884c\u67e5\u8be2\u3002 \u6570\u636e\u548c\u6587\u6863\u5355\u72ec\u6258\u7ba1\u5728 https://github.com/shenwei356/taxid-changelog \u3002 \u9664\u4e86\u7b80\u5355\u7684\u589e\u52a0\u3001\u5220\u9664\u3001\u5408\u5e76\u4e4b\u5916\uff0c\u4f5c\u8005\u5c06TaxID\u6539\u53d8\u505a\u4e86\u7ec6\u5206\u3002\u8f93\u51fa\u683c\u5f0f\u5982\u4e0b # \u5217 \u5907\u6ce8 taxid # taxid version # version / time of archive, e.g, 2019-07-01 change # change, values: # NEW \u65b0\u589e # REUSE_DEL \u524d\u671f\u88ab\u5220\u9664\uff0c\u73b0\u5728\u53c8\u91cd\u65b0\u52a0\u5165 # REUSE_MER \u524d\u671f\u88ab\u5408\u5e76\uff0c\u73b0\u5728\u53c8\u91cd\u65b0\u52a0\u5165 # DELETE \u5220\u9664 # MERGE \u5408\u5e76\u5230\u53e6\u4e00\u4e2aTaxID # ABSORB \u5176\u4ed6TaxID\u5408\u5e76\u5230\u5f53\u524dTaxID # CHANGE_NAME \u540d\u79f0\u6539\u53d8 # CHANGE_RANK \u5206\u7c7b\u5b66\u6c34\u5e73\u6539\u53d8 # CHANGE_LIN_LIN \u8c31\u7cfb\u7684TaxID\u6ca1\u6709\u53d8\u5316\uff0c\u8c31\u7cfb\u6539\u53d8\uff08\u67d0\u4e9bTaxID\u7684\u540d\u79f0\u53d8\u4e86\uff09 # CHANGE_LIN_TAX \u8c31\u7cfb\u7684TaxID\u6539\u53d8 # CHANGE_LIN_LEN \u8c31\u7cfb\u7684\u957f\u5ea6/\u6df1\u5ea6\u53d1\u751f\u53d8\u5316 change-value # variable values for changes: # 1) new taxid for MERGE # 2) merged taxids for ABSORB # 3) empty for others name # scientific name rank # rank lineage # complete lineage of the taxid lineage-taxids # taxids of the lineage \u6570\u636e\u6587\u4ef6\u53ef\u4ee5\u5728\u524d\u9762\u7f51\u7ad9\u4e0a\u4e0b\u8f7d\uff0c taxid-changelog.csv.gz \uff0c130M\u5de6\u53f3\uff0c\u89e3\u538b\u540e2.2G\uff0c\u56e0\u4e3a\u662fgzip\u683c\u5f0f\uff0c\u5b8c\u5168\u4e0d\u9700\u8981\u89e3\u538b\u5373\u53ef\u5206\u6790\u3002 \u4e0b\u6587\u4f7f\u7528\u4e86 pigz \u4ee3\u66ff zcat \u548c gzip \u63d0\u9ad8\u89e3\u538b\u901f\u5ea6\u3002 \u4f8b1 superkingdom\u4e5f\u80fd\u6d88\u5931 \uff0c\u6bd4\u5982\u7c7b\u75c5\u6bd2(Viroids)\u57282019\u5e745\u6708\u88ab\u5220\u9664\u4e86\u3002 \u4f5c\u8005\u662f\u5728\u67d0\u4e00\u5929\u65e0\u610f\u4e2d\u53d1\u73b0\u6b64\u4e8b\uff0c\u6240\u4ee5\u51b3\u5b9a\u5228\u6839\u95ee\u5e95\uff0c\u5f00\u53d1\u4e86\u8fd9\u4e2a\u5b50\u547d\u4ee4\u3002 # \u4e0b\u8f7d wget -c https://github.com/shenwei356/taxid-changelog/releases/download/v2021.01/taxid-changelog.csv.gz # \u5b89\u88c5\u591a\u7ebf\u7a0b\u89e3\u538b\u7d22\u8f6f\u4ef6\u3002\u6216\u8005\u7528gzip\u66ff\u6362\u3002 conda install pigz $ pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f rank -p superkingdom \\ | csvtk pretty taxid version change change-value name rank lineage lineage-taxids 2 2014-08-01 NEW Bacteria superkingdom cellular organisms;Bacteria 131567;2 2157 2014-08-01 NEW Archaea superkingdom cellular organisms;Archaea 131567;2157 2759 2014-08-01 NEW Eukaryota superkingdom cellular organisms;Eukaryota 131567;2759 10239 2014-08-01 NEW Viruses superkingdom Viruses 10239 12884 2014-08-01 NEW Viroids superkingdom Viroids 12884 12884 2019-05-01 DELETE Viroids superkingdom Viroids 12884 \u4f8b2 SARS-CoV-2 \u3002\u53ef\u89c1\u65b0\u51a0\u75c5\u6bd2\u57282020\u5e742\u6708\u52a0\u5165\uff0c\u968f\u540e3\u6708\u548c6\u6708\u4efd\u6539\u4e86\u540d\u79f0\uff0c\u8c31\u7cfb\u7b49\u4fe1\u606f\u3002\u67e5\u8be2\u901f\u5ea6\u4e5f\u5f88\u5feb\u3002 # \u672c\u4f8b\u5b50\u53ea\u663e\u793a\u4e86\u90e8\u5206\u5217\u3002 $ time pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -p 2697049 \\ | csvtk cut -f version,change,name,rank \\ | csvtk pretty version change name rank 2020-02-01 NEW Wuhan seafood market pneumonia virus species 2020-03-01 CHANGE_NAME Severe acute respiratory syndrome coronavirus 2 no rank 2020-03-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank 2020-03-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank 2020-06-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank 2020-07-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 isolate 2020-08-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank real 0m7.644s user 0m16.749s sys 0m3.985s \u66f4\u591a\u6709\u610f\u601d\u7684\u53d1\u73b0\u8be6\u89c1 taxid-changelog","title":"TaxID changelog \u8ffd\u8e2aTaxID\u53d8\u66f4\u8bb0\u5f55"},{"location":"download/","text":"Download TaxonKit is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page. Current Version TaxonKit v0.15.0 taxonkit reformat : For lineages with more than one node, if it fails to query TaxId with the parent-child pair, use the last child only. #82 The flag -T/--trim also does not add the prefix for missing ranks lower than the current rank. #82 New flag -s/--miss-rank-repl-suffix to set the suffix for estimated taxon names. #85 Please cite Shen, W., Ren, H., TaxonKit: a practical and efficient NCBI Taxonomy toolkit, Journal of Genetics and Genomics, https://doi.org/10.1016/j.jgg.2021.03.006 Links Tips run taxonkit version to check update !!! run taxonkit genautocomplete to update Bash completion !!! OS Arch File, \u4e2d\u56fd\u955c\u50cf Download Count Linux 64-bit taxonkit_linux_amd64.tar.gz , \u4e2d\u56fd\u955c\u50cf Linux arm64 taxonkit_linux_arm64.tar.gz , \u4e2d\u56fd\u955c\u50cf macOS 64-bit taxonkit_darwin_amd64.tar.gz , \u4e2d\u56fd\u955c\u50cf macOS arm64 taxonkit_darwin_arm64.tar.gz , \u4e2d\u56fd\u955c\u50cf Windows 64-bit taxonkit_windows_amd64.exe.tar.gz , \u4e2d\u56fd\u955c\u50cf Installation Download Page TaxonKit is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page. Method 1: Download binaries (latest stable/dev version) Just download compressed executable file of your operating system, and uncompress it with tar -zxvf *.tar.gz command or other tools. And then: For Linux-like systems If you have root privilege simply copy it to /usr/local/bin : sudo cp taxonkit /usr/local/bin/ Or copy to anywhere in the environment variable PATH : mkdir -p $HOME/bin/; cp taxonkit $HOME/bin/ For windows , just copy taxonkit.exe to C:\\WINDOWS\\system32 . Method 2: Install via conda (latest stable version) conda install -c bioconda taxonkit Method 3: Install via homebrew (may not the lastest version) brew install brewsci/bio/taxonkit Method 4: Compile from source (latest stable/dev version) Install go wget https://go.dev/dl/go1.17.13.linux-amd64.tar.gz tar -zxf go1.17.13.linux-amd64.tar.gz -C $HOME/ # or # echo \"export PATH=$PATH:$HOME/go/bin\" >> ~/.bashrc # source ~/.bashrc export PATH=$PATH:$HOME/go/bin Compile TaxonKit # ------------- the latest stable version ------------- go get -v -u github.com/shenwei356/taxonkit/taxonkit # The executable binary file is located in: # ~/go/bin/taxonkit # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ~/go/bin/taxonkit $HOME/bin/ # --------------- the development version -------------- git clone https://github.com/shenwei356/taxonkit cd taxonkit/taxonkit/ go build # The executable binary file is located in: # ./taxonkit # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ./taxonkit $HOME/bin/ Bash-completion Supported shell: bash|zsh|fish|powershell Bash: # generate completion shell taxonkit genautocomplete --shell bash # configure if never did. # install bash-completion if the \"complete\" command is not found. echo \"for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\" >> ~/.bash_completion echo \"source ~/.bash_completion\" >> ~/.bashrc Zsh: # generate completion shell taxonkit genautocomplete --shell zsh --file ~/.zfunc/_taxonkit # configure if never did echo 'fpath=( ~/.zfunc \"${fpath[@]}\" )' >> ~/.zshrc echo \"autoload -U compinit; compinit\" >> ~/.zshrc fish: taxonkit genautocomplete --shell fish --file ~/.config/fish/completions/taxonkit.fish Dataset Download and uncompress taxdump.tar.gz : ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz Copy names.dmp , nodes.dmp , delnodes.dmp and merged.dmp to data directory: $HOME/.taxonkit , e.g., /home/shenwei/.taxonkit , Optionally copy to some other directories, and later you can refer to using flag --data-dir , or environment variable TAXONKIT_DB . All-in-one command: wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz mkdir -p $HOME/.taxonkit cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit Update dataset : Simply re-download the taxdump files, uncompress and override old ones. Release history TaxonKit v0.14.2 taxonkit filter : fix checking merged/deleted/not-found taxids. #80 taxonkit lca : add a new flag -b/--buffer-size to set the size of the line buffer. #75 fix typos: --separater -> --separater , the former is still available for backward compatibility. taxonkit reformat : output compatible format for TaxIds not found in the database. #79 taxonkit taxid-changelog : support gzip-compressed taxdump files for saving space. #78 TaxonKit v0.14.1 taxonkit reformat : The flag -S/--pseudo-strain does not require -F/--fill-miss-rank now. For taxa of rank >= species, {t} , {S} , and T outputs nothing when using -S/--pseudo-strain . TaxonKit v0.14.0 taxonkit create-taxdump : save taxIds in int32 instead of uint32 , as BLAST and DIAMOND do. #70 taxonkit list : do not skip visited subtrees when some of give taxids are descendants of others. #68 taxonkit : when environment variable TAXONKIT_DB is set, explicitly setting --data-dir will override the value of TAXONKIT_DB . TaxonKit v0.13.0 taxonkit reformat : add a new placeholder {K} for rank kingdom . #64 do not panic for invalid TaxIds, e.g., the column name, when using -I--taxid-field . taxonkit create-taxdump : fix merged.dmp and delnodes.dmp. Thanks to @apcamargo ! gtdb-taxdump/issues/2 . fix bug of handling non-GTDB data when using -A/--field-accession and no rank names given: the colname of the accession column would be treated as one of the ranks, which messed up all the ranks. fix the default option value of --field-accession-re which wrongly remove prefix like Sp_ . #65 taxonkit list : fix warning message of merged taxids. TaxonKit v0.12.0 taxonkit create-taxdump : accepts arbitrary ranks #60 better handle of taxa with same names. many flags changed. TaxonKit v0.11.1 taxonkit create-taxdump : fix bug of missing Class rank, contributed by @apcamargo. The flag --gtdb was not effected. #57 TaxonKit v0.11.0 new command taxonkit create-taxdump : Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV. #56 TaxonKit v0.10.1 taxonkit cami2-filter : fix option --show-rank which did not work in v0.10.0. TaxonKit v0.10.0 new command taxonkit cami2-filter : Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile taxonkit reformat : fix panic for deleted taxid using -F/--fill-miss-rank . #55 TaxonKit v0.9.0 new command taxonkit profile2cami : converting metagenomic profile table to CAMI format TaxonKit v0.8.0 taxonkit reformat : accept input of TaxIds via flag -I/--taxid-field . accept single taxonomy names . show warning message for TaxIds with the same lineage . #42 better flag checking. #40 taxonkit lca : slightly speedup. taxonkit genautocomplete : support bash|zsh|fish/powershell TaxonKit v0.7.2 taxonkit lineage : new flag -R/--show-lineage-ranks for appending ranks of all levels. reduce memory occupation and slightly speedup. taxonkit filter : flag -E/--equal-to supports multiple values. new flag -n/--save-predictable-norank : do not discard some special ranks without order when using -L, where rank of the closest higher node is still lower than rank cutoff. taxonkit reformat : new placeholder {t} for subspecies/strain , {T} for strain . Thanks @wqssf102 for feedback. new flag -S/--pseudo-strain for using the node with lowest rank as strain name, only if which rank is lower than \"species\" and not \"subpecies\" nor \"strain\". TaxonKit v0.7.1 taxonkit filter : disable unnecessary stdin check when using flag --list-order or --list-ranks . #36 better handling of black list, empty default value: \"no rank\" and \"clade\". And you need use -N/--discard-noranks to explicitly filter out \"no rank\", \"clade\". #37 update help message. Thanks @standage for improve this command! #38 TaxonKit v0.7.0 taxonkit : 2-3X faster taxonomy data loading . new command taxonkit filter : filtering TaxIds by taxonomic rank range . #32 new command taxonkit lca : Computing lowest common ancestor (LCA) for TaxIds. taxonkit reformat : new flag -P/--add-prefix : add prefixes for all ranks , single prefix for a rank is defined by flag --prefix-X , where X may be k , p , c , o , f , s , S . new flag -T/--trim : do not fill missing rank lower than current rank. taxonkit list : do not duplicate root node. TaxonKit v0.6.2 taxonkit reformat -F : fix taxids of abbreviated lineage containing names shared by different taxids. #35 TaxonKit v0.6.1 taxonkit lineage : new flag -n/--show-name for appending scientific name. new flag -L/--no-lineage for hide lineage, this is for fast retrieving names or/and ranks. taxonkit reformat : fix flag -F/--fill-miss-rank . discard order restriction of rank symbols. TaxonKit v0.6.0 taxonkit list : check merged and deleted taxids. fix bug of json output. #30 taxonkit name2taxid : new flag -s/--sci-name for limiting to searching scientific names. #29 taxonkit version : make checking update optional TaxonKit v0.5.0 taxonkit : requiring delnodes.dmp and merged.dmp. taxonkit lineage : detect deleted and merged taxids now. #19 taxonkit list/name2taxid : add short flag -r for --show-rank , -n for --show-name . TaxonKit v0.4.3 taxonkit taxid-changelog : rewrite logic, fix bug and add more change types TaxonKit v0.4.2 taxonkit taxid-changelog : change output of ABSORB , do not merged into one record for changes in different versions. TaxonKit v0.4.1 taxonkit taxid-changelog : add fields: name and rank . and fix sorting bug. detailed lineage change status TaxonKit v0.4.0 new command: taxonkit taxid-changelog : for creating taxid changelog from dump archive TaxonKit v0.3.0 this version is almost the same as v0.2.5 TaxonKit v0.2.5 add global flag: --line-buffered to disable output buffer. #11 replace global flags --names-file and --nodes-file with --data-dir , also support environment variable TAXONKIT_DB . #17 taxonkit reformat : detects lineages containing unofficial taxon name and won't show panic message. taxonkit name2taxid : supports synonyms names. #9 taxokit lineage : add flag -r/--show-rank to print rank at another new column. TaxonKit v0.2.4 taxonkit reformat : more accurate result when using flag -F/--fill-miss-rank to estimate and fill missing rank with original lineage information supporting escape strings like \\t , \\n , #5 outputting corresponding taxids for reformated lineage. #8 taxonkit lineage : fix bug for taxid 1 #7 add flag -d/--delimiter . TaxonKit v0.2.3 fix bug brought in v0.2.1 TaxonKit v0.2.2 make verbose information optional #4 TaxonKit v0.2.1 taxonkit list : fix bug of no output for leaf nodes of the taxonomic tree. #4 add new command genautocomplete to generate shell autocompletion script! TaxonKit v0.2.0 add command name2taxid to query taxid by taxon scientific name. lineage , reformat : changed flags and default operations , check the usage . TaxonKit v0.1.8 taxonkit lineage , add an extra column of lineage in Taxid. #3 . e.g., fix colorful output in windows. TaxonKit v0.1.7 taxonkit reformat : supports reading stdin from output of taxonkit lineage , reformated lineages are appended to input data. TaxonKit v0.1.6 remove flag -f/--formated-rank from taxonkit lineage , using taxonkit reformat can archieve same result. TaxonKit v0.1.5 reorganize code and flags TaxonKit v0.1.4 add flag --fill for taxonkit reformat , which estimates and fills missing rank with original lineage information TaxonKit v0.1.3 add command of taxonkit reformat which reformats full lineage to custom format TaxonKit v0.1.2 add command of taxonkit lineage , users can query lineage of given taxon IDs from file TaxonKit v0.1.1 add feature of taxonkit list , users can choose output in readable JSON format by flag --json so the taxonomy tree could be collapse and uncollapse in modern text editor. TaxonKit v0.1 first release /** * RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS. * LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/ /* var disqus_config = function () { this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable }; */ (function() { // DON'T EDIT BELOW THIS LINE var d = document, s = d.createElement('script'); s.src = '//taxonkit.disqus.com/embed.js'; s.setAttribute('data-timestamp', +new Date()); (d.head || d.body).appendChild(s); })(); Please enable JavaScript to view the comments powered by Disqus.","title":"Download"},{"location":"download/#download","text":"TaxonKit is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.","title":"Download"},{"location":"download/#current-version","text":"TaxonKit v0.15.0 taxonkit reformat : For lineages with more than one node, if it fails to query TaxId with the parent-child pair, use the last child only. #82 The flag -T/--trim also does not add the prefix for missing ranks lower than the current rank. #82 New flag -s/--miss-rank-repl-suffix to set the suffix for estimated taxon names. #85","title":"Current Version"},{"location":"download/#please-cite","text":"Shen, W., Ren, H., TaxonKit: a practical and efficient NCBI Taxonomy toolkit, Journal of Genetics and Genomics, https://doi.org/10.1016/j.jgg.2021.03.006","title":"Please cite"},{"location":"download/#links","text":"Tips run taxonkit version to check update !!! run taxonkit genautocomplete to update Bash completion !!! OS Arch File, \u4e2d\u56fd\u955c\u50cf Download Count Linux 64-bit taxonkit_linux_amd64.tar.gz , \u4e2d\u56fd\u955c\u50cf Linux arm64 taxonkit_linux_arm64.tar.gz , \u4e2d\u56fd\u955c\u50cf macOS 64-bit taxonkit_darwin_amd64.tar.gz , \u4e2d\u56fd\u955c\u50cf macOS arm64 taxonkit_darwin_arm64.tar.gz , \u4e2d\u56fd\u955c\u50cf Windows 64-bit taxonkit_windows_amd64.exe.tar.gz , \u4e2d\u56fd\u955c\u50cf","title":"Links"},{"location":"download/#installation","text":"Download Page TaxonKit is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.","title":"Installation"},{"location":"download/#method-1-download-binaries-latest-stabledev-version","text":"Just download compressed executable file of your operating system, and uncompress it with tar -zxvf *.tar.gz command or other tools. And then: For Linux-like systems If you have root privilege simply copy it to /usr/local/bin : sudo cp taxonkit /usr/local/bin/ Or copy to anywhere in the environment variable PATH : mkdir -p $HOME/bin/; cp taxonkit $HOME/bin/ For windows , just copy taxonkit.exe to C:\\WINDOWS\\system32 .","title":"Method 1: Download binaries (latest stable/dev version)"},{"location":"download/#method-2-install-via-conda-latest-stable-version","text":"conda install -c bioconda taxonkit","title":"Method 2: Install via conda (latest stable version)"},{"location":"download/#method-3-install-via-homebrew-may-not-the-lastest-version","text":"brew install brewsci/bio/taxonkit","title":"Method 3: Install via homebrew (may not the lastest version)"},{"location":"download/#method-4-compile-from-source-latest-stabledev-version","text":"Install go wget https://go.dev/dl/go1.17.13.linux-amd64.tar.gz tar -zxf go1.17.13.linux-amd64.tar.gz -C $HOME/ # or # echo \"export PATH=$PATH:$HOME/go/bin\" >> ~/.bashrc # source ~/.bashrc export PATH=$PATH:$HOME/go/bin Compile TaxonKit # ------------- the latest stable version ------------- go get -v -u github.com/shenwei356/taxonkit/taxonkit # The executable binary file is located in: # ~/go/bin/taxonkit # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ~/go/bin/taxonkit $HOME/bin/ # --------------- the development version -------------- git clone https://github.com/shenwei356/taxonkit cd taxonkit/taxonkit/ go build # The executable binary file is located in: # ./taxonkit # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ./taxonkit $HOME/bin/","title":"Method 4: Compile from source (latest stable/dev version)"},{"location":"download/#bash-completion","text":"Supported shell: bash|zsh|fish|powershell Bash: # generate completion shell taxonkit genautocomplete --shell bash # configure if never did. # install bash-completion if the \"complete\" command is not found. echo \"for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\" >> ~/.bash_completion echo \"source ~/.bash_completion\" >> ~/.bashrc Zsh: # generate completion shell taxonkit genautocomplete --shell zsh --file ~/.zfunc/_taxonkit # configure if never did echo 'fpath=( ~/.zfunc \"${fpath[@]}\" )' >> ~/.zshrc echo \"autoload -U compinit; compinit\" >> ~/.zshrc fish: taxonkit genautocomplete --shell fish --file ~/.config/fish/completions/taxonkit.fish","title":"Bash-completion"},{"location":"download/#dataset","text":"Download and uncompress taxdump.tar.gz : ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz Copy names.dmp , nodes.dmp , delnodes.dmp and merged.dmp to data directory: $HOME/.taxonkit , e.g., /home/shenwei/.taxonkit , Optionally copy to some other directories, and later you can refer to using flag --data-dir , or environment variable TAXONKIT_DB . All-in-one command: wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz mkdir -p $HOME/.taxonkit cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit Update dataset : Simply re-download the taxdump files, uncompress and override old ones.","title":"Dataset"},{"location":"download/#release-history","text":"TaxonKit v0.14.2 taxonkit filter : fix checking merged/deleted/not-found taxids. #80 taxonkit lca : add a new flag -b/--buffer-size to set the size of the line buffer. #75 fix typos: --separater -> --separater , the former is still available for backward compatibility. taxonkit reformat : output compatible format for TaxIds not found in the database. #79 taxonkit taxid-changelog : support gzip-compressed taxdump files for saving space. #78 TaxonKit v0.14.1 taxonkit reformat : The flag -S/--pseudo-strain does not require -F/--fill-miss-rank now. For taxa of rank >= species, {t} , {S} , and T outputs nothing when using -S/--pseudo-strain . TaxonKit v0.14.0 taxonkit create-taxdump : save taxIds in int32 instead of uint32 , as BLAST and DIAMOND do. #70 taxonkit list : do not skip visited subtrees when some of give taxids are descendants of others. #68 taxonkit : when environment variable TAXONKIT_DB is set, explicitly setting --data-dir will override the value of TAXONKIT_DB . TaxonKit v0.13.0 taxonkit reformat : add a new placeholder {K} for rank kingdom . #64 do not panic for invalid TaxIds, e.g., the column name, when using -I--taxid-field . taxonkit create-taxdump : fix merged.dmp and delnodes.dmp. Thanks to @apcamargo ! gtdb-taxdump/issues/2 . fix bug of handling non-GTDB data when using -A/--field-accession and no rank names given: the colname of the accession column would be treated as one of the ranks, which messed up all the ranks. fix the default option value of --field-accession-re which wrongly remove prefix like Sp_ . #65 taxonkit list : fix warning message of merged taxids. TaxonKit v0.12.0 taxonkit create-taxdump : accepts arbitrary ranks #60 better handle of taxa with same names. many flags changed. TaxonKit v0.11.1 taxonkit create-taxdump : fix bug of missing Class rank, contributed by @apcamargo. The flag --gtdb was not effected. #57 TaxonKit v0.11.0 new command taxonkit create-taxdump : Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV. #56 TaxonKit v0.10.1 taxonkit cami2-filter : fix option --show-rank which did not work in v0.10.0. TaxonKit v0.10.0 new command taxonkit cami2-filter : Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile taxonkit reformat : fix panic for deleted taxid using -F/--fill-miss-rank . #55 TaxonKit v0.9.0 new command taxonkit profile2cami : converting metagenomic profile table to CAMI format TaxonKit v0.8.0 taxonkit reformat : accept input of TaxIds via flag -I/--taxid-field . accept single taxonomy names . show warning message for TaxIds with the same lineage . #42 better flag checking. #40 taxonkit lca : slightly speedup. taxonkit genautocomplete : support bash|zsh|fish/powershell TaxonKit v0.7.2 taxonkit lineage : new flag -R/--show-lineage-ranks for appending ranks of all levels. reduce memory occupation and slightly speedup. taxonkit filter : flag -E/--equal-to supports multiple values. new flag -n/--save-predictable-norank : do not discard some special ranks without order when using -L, where rank of the closest higher node is still lower than rank cutoff. taxonkit reformat : new placeholder {t} for subspecies/strain , {T} for strain . Thanks @wqssf102 for feedback. new flag -S/--pseudo-strain for using the node with lowest rank as strain name, only if which rank is lower than \"species\" and not \"subpecies\" nor \"strain\". TaxonKit v0.7.1 taxonkit filter : disable unnecessary stdin check when using flag --list-order or --list-ranks . #36 better handling of black list, empty default value: \"no rank\" and \"clade\". And you need use -N/--discard-noranks to explicitly filter out \"no rank\", \"clade\". #37 update help message. Thanks @standage for improve this command! #38 TaxonKit v0.7.0 taxonkit : 2-3X faster taxonomy data loading . new command taxonkit filter : filtering TaxIds by taxonomic rank range . #32 new command taxonkit lca : Computing lowest common ancestor (LCA) for TaxIds. taxonkit reformat : new flag -P/--add-prefix : add prefixes for all ranks , single prefix for a rank is defined by flag --prefix-X , where X may be k , p , c , o , f , s , S . new flag -T/--trim : do not fill missing rank lower than current rank. taxonkit list : do not duplicate root node. TaxonKit v0.6.2 taxonkit reformat -F : fix taxids of abbreviated lineage containing names shared by different taxids. #35 TaxonKit v0.6.1 taxonkit lineage : new flag -n/--show-name for appending scientific name. new flag -L/--no-lineage for hide lineage, this is for fast retrieving names or/and ranks. taxonkit reformat : fix flag -F/--fill-miss-rank . discard order restriction of rank symbols. TaxonKit v0.6.0 taxonkit list : check merged and deleted taxids. fix bug of json output. #30 taxonkit name2taxid : new flag -s/--sci-name for limiting to searching scientific names. #29 taxonkit version : make checking update optional TaxonKit v0.5.0 taxonkit : requiring delnodes.dmp and merged.dmp. taxonkit lineage : detect deleted and merged taxids now. #19 taxonkit list/name2taxid : add short flag -r for --show-rank , -n for --show-name . TaxonKit v0.4.3 taxonkit taxid-changelog : rewrite logic, fix bug and add more change types TaxonKit v0.4.2 taxonkit taxid-changelog : change output of ABSORB , do not merged into one record for changes in different versions. TaxonKit v0.4.1 taxonkit taxid-changelog : add fields: name and rank . and fix sorting bug. detailed lineage change status TaxonKit v0.4.0 new command: taxonkit taxid-changelog : for creating taxid changelog from dump archive TaxonKit v0.3.0 this version is almost the same as v0.2.5 TaxonKit v0.2.5 add global flag: --line-buffered to disable output buffer. #11 replace global flags --names-file and --nodes-file with --data-dir , also support environment variable TAXONKIT_DB . #17 taxonkit reformat : detects lineages containing unofficial taxon name and won't show panic message. taxonkit name2taxid : supports synonyms names. #9 taxokit lineage : add flag -r/--show-rank to print rank at another new column. TaxonKit v0.2.4 taxonkit reformat : more accurate result when using flag -F/--fill-miss-rank to estimate and fill missing rank with original lineage information supporting escape strings like \\t , \\n , #5 outputting corresponding taxids for reformated lineage. #8 taxonkit lineage : fix bug for taxid 1 #7 add flag -d/--delimiter . TaxonKit v0.2.3 fix bug brought in v0.2.1 TaxonKit v0.2.2 make verbose information optional #4 TaxonKit v0.2.1 taxonkit list : fix bug of no output for leaf nodes of the taxonomic tree. #4 add new command genautocomplete to generate shell autocompletion script! TaxonKit v0.2.0 add command name2taxid to query taxid by taxon scientific name. lineage , reformat : changed flags and default operations , check the usage . TaxonKit v0.1.8 taxonkit lineage , add an extra column of lineage in Taxid. #3 . e.g., fix colorful output in windows. TaxonKit v0.1.7 taxonkit reformat : supports reading stdin from output of taxonkit lineage , reformated lineages are appended to input data. TaxonKit v0.1.6 remove flag -f/--formated-rank from taxonkit lineage , using taxonkit reformat can archieve same result. TaxonKit v0.1.5 reorganize code and flags TaxonKit v0.1.4 add flag --fill for taxonkit reformat , which estimates and fills missing rank with original lineage information TaxonKit v0.1.3 add command of taxonkit reformat which reformats full lineage to custom format TaxonKit v0.1.2 add command of taxonkit lineage , users can query lineage of given taxon IDs from file TaxonKit v0.1.1 add feature of taxonkit list , users can choose output in readable JSON format by flag --json so the taxonomy tree could be collapse and uncollapse in modern text editor. TaxonKit v0.1 first release /** * RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS. * LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/ /* var disqus_config = function () { this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable }; */ (function() { // DON'T EDIT BELOW THIS LINE var d = document, s = d.createElement('script'); s.src = '//taxonkit.disqus.com/embed.js'; s.setAttribute('data-timestamp', +new Date()); (d.head || d.body).appendChild(s); })(); Please enable JavaScript to view the comments powered by Disqus.","title":"Release history"},{"location":"tutorial/","text":"Tutorial Table of Contents Formatting lineage Parsing kraken/bracken result Making nr blastdb for specific taxids Summaries of taxonomy data Merging GTDB and NCBI taxonomy Formatting lineage Show lineage detail of a TaxId. The command below works on Windows with help of csvtk . $ echo \"2697049\" \\ | taxonkit lineage -t \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -Ht 10239 superkingdom Viruses 2559587 clade Riboviria 2732396 kingdom Orthornavirae 2732408 phylum Pisuviricota 2732506 class Pisoniviricetes 76804 order Nidovirales 2499399 suborder Cornidovirineae 11118 family Coronaviridae 2501931 subfamily Orthocoronavirinae 694002 genus Betacoronavirus 2509511 subgenus Sarbecovirus 694009 species Severe acute respiratory syndrome-related coronavirus 2697049 no rank Severe acute respiratory syndrome coronavirus 2 Example data. $ cat taxids3.txt 376619 349741 239935 314101 11932 1327037 83333 1408252 2605619 2697049 Format to 7-level ranks (\"superkingdom phylum class order family genus species\"). $ cat taxids3.txt \\ | taxonkit reformat -I 1 376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis 349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y 83333 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 1408252 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 2605619 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Severe acute respiratory syndrome-related coronavirus Format to 8-level ranks (\"superkingdom phylum class order family genus species subspecies/rank\"). $ cat taxids3.txt \\ | taxonkit reformat -I 1 -f \"{k};{p};{c};{o};{f};{g};{s};{t}\" 376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica LVS 349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila; 314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B; 11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle; 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y; 83333 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli K-12 1408252 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli R178 2605619 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli; 2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Severe acute respiratory syndrome-related coronavirus; Replace missing ranks with Unassigned and output tab-delimited format. $ cat taxids3.txt \\ | taxonkit reformat -I 1 -r \"Unassigned\" -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | csvtk pretty -H -t 376619 Bacteria Proteobacteria Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis Francisella tularensis subsp. holarctica LVS 349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila Akkermansia muciniphila ATCC BAA-835 239935 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila Unassigned 314101 Bacteria Unassigned Unassigned Unassigned Unassigned Unassigned uncultured murine large bowel bacterium BAC 54B Unassigned 11932 Viruses Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle Unassigned 1327037 Viruses Uroviricota Caudoviricetes Caudovirales Siphoviridae Unassigned Croceibacter phage P2559Y Unassigned 83333 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Escherichia coli K-12 1408252 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Escherichia coli R178 2605619 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Unassigned 2697049 Viruses Pisuviricota Pisoniviricetes Nidovirales Coronaviridae Betacoronavirus Severe acute respiratory syndrome-related coronavirus Unassigned Fill missing ranks and add prefixes. $ cat taxids3.txt \\ | taxonkit reformat -I 1 -F -P -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | csvtk pretty -H -t 376619 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Thiotrichales f__Francisellaceae g__Francisella s__Francisella tularensis t__Francisella tularensis subsp. holarctica LVS 349741 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila t__Akkermansia muciniphila ATCC BAA-835 239935 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila t__unclassified Akkermansia muciniphila subspecies/strain 314101 k__Bacteria p__unclassified Bacteria phylum c__unclassified Bacteria class o__unclassified Bacteria order f__unclassified Bacteria family g__unclassified Bacteria genus s__uncultured murine large bowel bacterium BAC 54B t__unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain 11932 k__Viruses p__Artverviricota c__Revtraviricetes o__Ortervirales f__Retroviridae g__Intracisternal A-particles s__Mouse Intracisternal A-particle t__unclassified Mouse Intracisternal A-particle subspecies/strain 1327037 k__Viruses p__Uroviricota c__Caudoviricetes o__Caudovirales f__Siphoviridae g__unclassified Siphoviridae genus s__Croceibacter phage P2559Y t__unclassified Croceibacter phage P2559Y subspecies/strain 83333 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__Escherichia coli K-12 1408252 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__Escherichia coli R178 2605619 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__unclassified Escherichia coli subspecies/strain 2697049 k__Viruses p__Pisuviricota c__Pisoniviricetes o__Nidovirales f__Coronaviridae g__Betacoronavirus s__Severe acute respiratory syndrome-related coronavirus t__unclassified Severe acute respiratory syndrome-related coronavirus subspecies/strain When these's no nodes of rank \"subspecies\" nor \"strain\", we can switch -S/--pseudo-strain to use the node with lowest rank as subspecies/strain name, if which rank is lower than \"species\" . $ cat taxids3.txt \\ | taxonkit lineage -r -L \\ | taxonkit reformat -I 1 -F -S -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | cut -f 1,2,9,10 \\ | csvtk add-header -t -n \"taxid,rank,species,strain\" \\ | csvtk pretty -t taxid rank species strain ------- ---------- ----------------------------------------------------- ------------------------------------------------------------------------------ 376619 strain Francisella tularensis Francisella tularensis subsp. holarctica LVS 349741 strain Akkermansia muciniphila Akkermansia muciniphila ATCC BAA-835 239935 species Akkermansia muciniphila unclassified Akkermansia muciniphila subspecies/strain 314101 species uncultured murine large bowel bacterium BAC 54B unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain 11932 species Mouse Intracisternal A-particle unclassified Mouse Intracisternal A-particle subspecies/strain 1327037 species Croceibacter phage P2559Y unclassified Croceibacter phage P2559Y subspecies/strain 83333 strain Escherichia coli Escherichia coli K-12 1408252 subspecies Escherichia coli Escherichia coli R178 2605619 no rank Escherichia coli Escherichia coli O16:H48 2697049 no rank Severe acute respiratory syndrome-related coronavirus Severe acute respiratory syndrome coronavirus 2 List eight-level lineage for all TaxIds of rank lower than or equal to species, including some nodes with \"no rank\". But when filtering with -L/--lower-than , you can use -n/--save-predictable-norank to save some special ranks without order, where rank of the closest higher node is still lower than rank cutoff . $ time taxonkit list --ids 1 \\ | taxonkit filter -L species -E species -R -N -n \\ | taxonkit lineage -n -r -L \\ | taxonkit reformat -I 1 -F -S -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | csvtk cut -Ht -l -f 1,3,2,1,4-11 \\ | csvtk add-header -t -n \"taxid,rank,name,lineage,kingdom,phylum,class,order,family,genus,species,strain\" \\ | pigz -c > result.tsv.gz real 0m25.167s user 2m14.809s sys 0m7.197s $ pigz -cd result.tsv.gz \\ | csvtk grep -t -f taxid -p 2697049 \\ | csvtk transpose -t \\ | csvtk pretty -H -t taxid 2697049 rank no rank name Severe acute respiratory syndrome coronavirus 2 lineage Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 kingdom Viruses phylum Pisuviricota class Pisoniviricetes order Nidovirales family Coronaviridae genus Betacoronavirus species Severe acute respiratory syndrome-related coronavirus strain Severe acute respiratory syndrome coronavirus 2 Parsing kraken/bracken result Example Data SRS014459-Stool.fasta.gz Run Kraken2 and Bracken KRAKEN_DB=/home/shenwei/ws/db/kraken/k2_pluspf THREADS=16 CLASSIFICATION_LVL=S THRESHOLD=10 READ_LEN=100 SAMPLE=SRS014459-Stool.fasta.gz BRACKEN_OUTPUT_FILE=$SAMPLE kraken2 --db ${KRAKEN_DB} --threads ${THREADS} -report ${SAMPLE}.kreport $SAMPLE > ${SAMPLE}.kraken est_abundance.py -i ${SAMPLE}.kreport -k ${KRAKEN_DB}/database${READ_LEN}mers.kmer_distrib \\ -l ${CLASSIFICATION_LVL} -t ${THRESHOLD} -o ${BRACKEN_OUTPUT_FILE}.bracken Orignial format $ head -n 15 SRS014459-Stool.fasta.gz_bracken_species.kreport 100.00 9491 0 R 1 root 99.85 9477 0 R1 131567 cellular organisms 99.85 9477 0 D 2 Bacteria 66.08 6271 0 D1 1783270 FCB group 66.08 6271 0 D2 68336 Bacteroidetes/Chlorobi group 66.08 6271 0 P 976 Bacteroidetes 66.08 6271 0 C 200643 Bacteroidia 66.08 6271 0 O 171549 Bacteroidales 34.45 3270 0 F 815 Bacteroidaceae 34.45 3270 0 G 816 Bacteroides 10.43 990 990 S 246787 Bacteroides cellulosilyticus 7.98 757 757 S 28116 Bacteroides ovatus 3.10 293 0 G1 2646097 unclassified Bacteroides 1.06 100 100 S 2755405 Bacteroides sp. CACC 737 0.49 46 46 S 2650157 Bacteroides sp. HF-5287 Converting to MetaPhlAn2 format. (Similar to kreport2mpa.py ) $ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\ | csvtk cut -Ht -f 5,1 \\ | taxonkit lineage \\ | taxonkit reformat -i 3 -P -f \"{k}|{p}|{c}|{o}|{f}|{g}|{s}\" \\ | csvtk cut -Ht -f 4,2 \\ | csvtk replace -Ht -p \"(\\|[kpcofgs]__)+$\" \\ | csvtk replace -Ht -p \"\\|[kpcofgs]__\\|\" -r \"|\" \\ | csvtk uniq -Ht \\ | csvtk grep -Ht -p k__ -v \\ > SRS014459-Stool.fasta.gz_bracken_species.kreport.format $ head -n 10 SRS014459-Stool.fasta.gz_bracken_species.kreport.format k__Bacteria 99.85 k__Bacteria|p__Bacteroidetes 66.08 k__Bacteria|p__Bacteroidetes|c__Bacteroidia 66.08 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales 66.08 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae 34.45 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides 34.45 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides cellulosilyticus 10.43 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides ovatus 7.98 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides sp. CACC 737 1.06 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides sp. HF-5287 0.49 Converting to Qiime format $ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\ | csvtk cut -Ht -f 5,1 \\ | taxonkit lineage \\ | taxonkit reformat -i 3 -P -f \"{k}; {p}; {c}; {o}; {f}; {g}; {s}\" \\ | csvtk cut -Ht -f 4,2 \\ | csvtk replace -Ht -p \"(; [kpcofgs]__)+$\" \\ | csvtk replace -Ht -p \"; [kpcofgs]__; \" -r \"; \" \\ | csvtk uniq -Ht \\ | csvtk grep -Ht -p k__ -v \\ | head -n 10 k__Bacteria 99.85 k__Bacteria; p__Bacteroidetes 66.08 k__Bacteria; p__Bacteroidetes; c__Bacteroidia 66.08 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales 66.08 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae 34.45 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides 34.45 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides cellulosilyticus 10.43 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides ovatus 7.98 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides sp. CACC 737 1.06 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides sp. HF-5287 0.49 Save taxon proportion and taxid, and get lineage, name and rank. $ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\ | csvtk cut -Ht -f 1,5 \\ | taxonkit lineage -i 2 -n -r \\ | csvtk cut -Ht -f 1,2,5,4,3 \\ | head -n 10 \\ | csvtk pretty -Ht 100.00 1 no rank root root 99.85 131567 no rank cellular organisms cellular organisms 99.85 2 superkingdom Bacteria cellular organisms;Bacteria 66.08 1783270 clade FCB group cellular organisms;Bacteria;FCB group 66.08 68336 clade Bacteroidetes/Chlorobi group cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group 66.08 976 phylum Bacteroidetes cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes 66.08 200643 class Bacteroidia cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia 66.08 171549 order Bacteroidales cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales 34.45 815 family Bacteroidaceae cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae 34.45 816 genus Bacteroides cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides Only save species or lower level and get lineage in format of \"superkingdom phylum class order family genus species\". $ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\ | csvtk cut -Ht -f 1,5 \\ | taxonkit filter -N -E species -L species -i 2 \\ | taxonkit lineage -i 2 -n -r \\ | taxonkit reformat -i 3 -f \"{k};{p};{c};{o};{f};{g};{s}\" \\ | csvtk cut -Ht -f 1,2,5,4,6 \\ | csvtk add-header -t -n abundance,taxid,rank,name,lineage \\ | head -n 10 \\ | csvtk pretty -t abundance taxid rank name lineage --------- ------- ------- ---------------------------- -------------------------------------------------------------------------------------------------------- 10.43 246787 species Bacteroides cellulosilyticus Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides cellulosilyticus 7.98 28116 species Bacteroides ovatus Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides ovatus 1.06 2755405 species Bacteroides sp. CACC 737 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. CACC 737 0.49 2650157 species Bacteroides sp. HF-5287 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. HF-5287 0.99 2528203 species Bacteroides sp. A1C1 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. A1C1 0.28 2763022 species Bacteroides sp. M10 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. M10 0.16 2650158 species Bacteroides sp. HF-5141 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. HF-5141 0.12 2715212 species Bacteroides sp. CBA7301 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. CBA7301 5.10 817 species Bacteroides fragilis Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides fragilis Making nr blastdb for specific taxids Attention: BLAST+ 2.8.1 is released with new databases , which allows you to limit your search by taxonomy using information built into the BLAST databases. So you don't need to build blastdb for specific taxids now. Changes: 2018-09-13 rewritten 2018-12-22 providing faster method for step 3.1 2019-01-07 add note of new blastdb version 2020-10-14 update steps for huge number of accessions belong to high taxon level like bacteria. Data: pre-formated blastdb (09/10/2018) prot.accession2taxid.gz (09/07/2018) (optional, but recommended) Hardware in this tutorial CPU: AMD 8-cores/16-threads 3.7Ghz RAM: 64GB DISK: Taxonomy files stores in NVMe SSD blastdb files stores in 7200rpm HDD Tools: blast+ pigz (recommended, faster than gzip) taxonkit seqkit (recommended), version >= 0.14.0 rush (optional, for parallizing filtering sequence) Steps: Listing all taxids below $id using taxonkit. id=6656 # 6656 is the phylum Arthropoda # echo 6656 | taxonkit lineage | taxonkit reformat # 6656 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Protostomia;Ecdysozoa;Panarthropoda;Arthropoda Eukaryota;Arthropoda;;;;; # 2 bacteria # 2157 archaea # 4751 fungi # 10239 virus # time: 2s taxonkit list --ids $id --indent \"\" > $id.taxid.txt # taxonkit list --ids 2,4751,10239 --indent \"\" > microbe.taxid.txt wc -l $id.taxid.txt # 518373 6656.taxid.txt Retrieving target accessions. There are two options: From prot.accession2taxid.gz ( faster, recommended ). Note that some accessions are not in nr . # time: 4min pigz -dc prot.accession2taxid.gz \\ | csvtk grep -t -f taxid -P $id.taxid.txt \\ | csvtk cut -t -f accession.version,taxid \\ | sed 1d \\ > $id.acc2taxid.txt cut -f 1 $id.acc2taxid.txt > $id.acc.txt wc -l $id.acc.txt # 8174609 6656.acc.txt From pre-formated nr blastdb # time: 40min blastdbcmd -db nr -entry all -outfmt \"%a %T\" | pigz -c > nr.acc2taxid.txt.gz pigz -dc nr.acc2taxid.txt.gz | wc -l # 555220892 # time: 3min pigz -dc nr.acc2taxid.txt.gz \\ | csvtk grep -d ' ' -D ' ' -f 2 -P $id.taxid.txt \\ | cut -d ' ' -f 1 \\ > $id.acc.txt wc -l $id.acc.txt # 6928021 6656.acc.txt Retrieving FASTA sequences from pre-formated blastdb. There are two options: From nr.fa exported from pre-formated blastdb ( faster, smaller output file, recommended ). DO NOT directly download nr.gz from ncbi ftp , in which the FASTA headers are not well formated. # 1. exporting nr.fa from pre-formated blastdb # time: 117min (run only once) blastdbcmd -db nr -dbtype prot -entry all -outfmt \"%f\" -out - | pigz -c > nr.fa.gz # ===================================================================== # 2. filtering sequence belong to $taxid # --------------------------------------------------------------------- # methond 1) (for cases where $id.acc.txt is not very huge) # time: 80min # perl one-liner is used to unfold records having mulitple accessions time cat <(echo) <(pigz -dc nr.fa.gz) \\ | perl -e 'BEGIN{ $/ = \"\\n>\"; <>; } while(<>){s/>$//; $i = index $_, \"\\n\"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print \">$_\"; next; }; $h = \">$h\"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\\W+//; print \">$h1\\n$s\";} } ' \\ | seqkit grep -f $id.acc.txt -o nr.$id.fa.gz # --------------------------------------------------------------------- # method 2) (**faster**) # 33min (run only once) # (1). split nr.fa.gz. # Note: I have 16 cpus. $ time seqkit split2 -p 15 nr.fa.gz # (2). parallize unfolding $ cat _unfold_blastdb_fa.sh #!/bin/sh perl -e 'BEGIN{ $/ = \"\\n>\"; <>; } while(<>){s/>$//; $i = index $_, \"\\n\"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print \">$_\"; next; }; $h = \">$h\"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\\W+//; print \">$h1\\n$s\";} } ' # 10 min time ls nr.fa.gz.split/nr.part_*.fa.gz \\ | rush -j 15 -v id=$id 'cat <(echo) <(pigz -dc {}) \\ | ./_unfold_blastdb_fa.sh \\ | seqkit grep -f {id}.acc.txt -o nr.{id}.{%@nr\\.(.+)$} ' # (3). merge result cat nr.$id.part*.fa.gz > nr.$id.fa.gz rm nr.$id.part*.fa.gz # --------------------------------------------------------------------- # method 3) (for huge $id.acc.txt file, e.g., bacteria) # (1). split ${id}.acc.txt into several parts. chunk size depends on lines and RAM (64G for me). split -d -l 300000000 $id.acc.txt $id.acc.txt.part_ # (2). filter time ls $id.acc.txt.part_* \\ | rush -j 1 --immediate-output -v id=$id \\ 'echo {}; cat <(echo) <(pigz -dc nr.fa.gz ) \\ | ./_unfold_blastdb_fa.sh \\ | seqkit grep -f {} -o nr.{id}.{%@(part_.+)}.fa.gz ' # (3). merge cat nr.$id.part*.fa.gz > nr.$id.fa.gz # clean rm nr.$id.part*.fa.gz rm $id.acc.txt.part_ # (4). optionally adding taxid, you may edit replacement (-r) below # split time split -d -l 200000000 $id.acc2taxid.txt $id.acc2taxid.txt.part_ ln -s nr.$id.fa.gz nr.$id.with-taxid.part0.fa.gz i=0 for f in $id.acc2taxid.txt.part_* ; do echo $f time pigz -cd nr.$id.with-taxid.part$i.fa.gz \\ | seqkit replace -k $f -p \"^([^\\-]+?) \" -r \"{kv}-\\$1 \" -K -U -o nr.$id.with-taxid.part$(($i+1)).fa.gz; /bin/rm nr.$id.with-taxid.part$i.fa.gz i=$(($i+1)); done mv nr.$id.with-taxid.part$i.fa.gz nr.$id.with-taxid.fa.gz # ===================================================================== # 3. counting sequences # # ls -lh nr.$id.fa.gz # -rw-r--r-- 1 shenwei shenwei 902M 9\u6708 13 01:42 nr.6656.fa.gz # pigz -dc nr.$id.fa.gz | grep '^>' -c # 6928017 # Here 6928017 ~= 6928021 ($id.acc.txt) Directly from pre-formated blastdb # time: 5h20min blastdbcmd -db nr -entry_batch $id.acc.txt -out - | pigz -c > nr.$id.fa.gz # counting sequences # # Note that the headers of outputed fasta by blastdbcmd are \"folded\" # for accessions from different species with same sequences, so the # number may be small than $(wc -l $id.acc.txt). pigz -dc nr.$id.fa.gz | grep '^>' -c # 1577383 # counting accessions # # ls -lh nr.$id.fa.gz # -rw-r--r-- 1 shenwei shenwei 2.1G 9\u6708 13 03:38 nr.6656.fa.gz # pigz -dc nr.$id.fa.gz | grep '^>' | sed 's/>/\\n>/g' | grep '^>' -c # 288415413 makeblastdb pigz -dc nr.$id.fa.gz > nr.$id.fa # time: 3min ($nr.$id.fa from step 3 option 1) # # building $nr.$id.fa from step 3 option 2 with -parse_seqids would produce error: # # BLAST Database creation error: Error: Duplicate seq_ids are found: SP|P29868.1 # makeblastdb -parse_seqids -in nr.$id.fa -dbtype prot -out nr.$id # rm nr.$id.fa blastp (optional) # blastdb nr.$id is built from sequences in step 3 option 1 # blastp -num_threads 16 -db nr.$id -query t4.fa > t4.fa.blast # real 0m20.866s # $ cat t4.fa.blast | grep Query= -A 10 # Query= A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-Hd1a # # Length=35 Score E # Sequences producing significant alignments: (Bits) Value # 2MPQ_A Chain A, Solution structure of the sodium channel toxin Hd1a 72.4 2e-17 # A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-... 72.4 2e-17 # ADB56726.1 HNTX-IV.2 precursor [Haplopelma hainanum] 66.6 9e-15 # D2Y233.1 RecName: Full=Mu-theraphotoxin-Hhn1b 2; Short=Mu-TRTX-H... 66.6 9e-15 # ADB56830.1 HNTX-IV.3 precursor [Haplopelma hainanum] 66.6 9e-15 Summaries of taxonomy data You can change the TaxId of interest. Rank counts of common categories. $ echo Archaea Bacteria Eukaryota Fungi Metazoa Viridiplantae \\ | rush -D ' ' -T b \\ 'taxonkit list --ids $(echo {} | taxonkit name2taxid | cut -f 2) \\ | sed 1d \\ | taxonkit filter -i 2 -E genus -L genus \\ | taxonkit lineage -L -r \\ | csvtk freq -H -t -f 2 -nr \\ > stats.{}.tsv ' $ csvtk -t join --outer-join stats.*.tsv \\ | csvtk add-header -t -n \"rank,$(ls stats.*.tsv | rush -k 'echo {@stats.(.+).tsv}' | paste -sd, )\" \\ | csvtk csv2md -t Similar data on NCBI Taxonomy rank Archaea Bacteria Eukaryota Fungi Metazoa Viridiplantae species 12482 460940 1349648 156908 957297 191026 strain 354 40643 3486 2352 33 50 genus 205 4112 90882 6844 64148 16202 isolate 7 503 809 76 17 3 species group 2 77 251 22 214 5 serotype 218 serogroup 136 subsection 21 21 subspecies 632 24523 158 17043 7212 forma specialis 521 220 179 33 1 species subgroup 23 101 101 biotype 7 10 morph 12 3 4 5 section 437 37 2 398 genotype 12 12 series 9 5 4 varietas 25 8499 1100 2 7188 forma 4 560 185 6 315 subgenus 1 1558 10 1414 112 pathogroup 5 subvariety 5 5 Count of all ranks $ time taxonkit list --ids 1 \\ | taxonkit lineage -L -r \\ | csvtk freq -H -t -f 2 -nr \\ | csvtk pretty -H -t species 1879659 no rank 222743 genus 96625 strain 44483 subspecies 25174 family 9492 varietas 8524 subfamily 3050 tribe 2213 order 1660 subgenus 1618 isolate 1319 serotype 1216 clade 886 superfamily 865 forma specialis 741 forma 564 subtribe 508 section 437 class 429 suborder 372 species group 330 phylum 272 subclass 156 serogroup 138 infraorder 130 species subgroup 124 superorder 55 subphylum 33 parvorder 26 subsection 21 genotype 20 infraclass 18 biotype 17 morph 12 kingdom 11 series 9 superclass 6 cohort 5 pathogroup 5 subvariety 5 superkingdom 4 subcohort 3 subkingdom 1 superphylum 1 real 0m3.663s user 0m15.897s sys 0m1.010s Ranks of taxa at or below species. $ taxonkit list --ids 1 \\ | taxonkit filter --lower-than species --equal-to species \\ | taxonkit lineage -L -r \\ | csvtk freq -Ht -nr -f 2 \\ | csvtk add-header -t -n rank,count \\ | csvtk pretty -t rank count --------------- ------- species 1880044 no rank 222756 strain 44483 subspecies 25171 varietas 8524 isolate 1319 serotype 1216 clade 885 forma specialis 741 forma 564 serogroup 138 genotype 20 biotype 17 morph 12 pathogroup 5 subvariety 5 Merging GTDB and NCBI taxonomy Sometimes ( 1 ) one needs to build a database including bacteria and archaea (from GTDB) and viral database from NCBI. The idea is to export lineages from both GTDB and NCBI using taxonkit reformat , and then create taxdump files from them with taxonkit create-taxdump . Exporting taxonomic lineages of taxa with rank equal to species from GTDB-taxdump . taxonkit list --data-dir gtdb-taxdump/R207/ --ids 1 --indent \"\" \\ | taxonkit filter --data-dir gtdb-taxdump/R207/ --equal-to species \\ | taxonkit reformat --data-dir gtdb-taxdump/R207/ --taxid-field 1 \\ --format \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" \\ -o gtdb.tsv Exporting taxonomic lineages of viral taxa with rank equal to or lower than species from NCBI taxdump. For taxa whose rank is \"no rank\" below the species, we treat them as tax of strain rank ( --pseudo-strain , taxonkit v0.14.1 needed). # taxid of Viruses: 10239 taxonkit list --data-dir ~/.taxonkit --ids 10239 --indent \"\" \\ | taxonkit filter --data-dir ~/.taxonkit --equal-to species --lower-than species \\ | taxonkit reformat --data-dir ~/.taxonkit --taxid-field 1 \\ --pseudo-strain --format \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ -o ncbi-viral.tsv Creating taxdump from lineages above. (awk '{print $_\"\\t\"}' gtdb.tsv; cat ncbi-viral.tsv) \\ | taxonkit create-taxdump \\ --field-accession 1 \\ -R \"superkingdom,phylum,class,order,family,genus,species,strain\" \\ -O taxdump # we use --field-accession 1 to output the mapping file between old taxids and new ones. $ grep 2697049 taxdump/taxid.map # SARS-COV-2 2697049 21630522 Some tests: # SARS-COV-2 in NCBI taxonomy $ echo 2697049 \\ | taxonkit lineage -t --data-dir ~/.taxonkit \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L --data-dir ~/.taxonkit \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -Ht 10239 superkingdom Viruses 2559587 clade Riboviria 2732396 kingdom Orthornavirae 2732408 phylum Pisuviricota 2732506 class Pisoniviricetes 76804 order Nidovirales 2499399 suborder Cornidovirineae 11118 family Coronaviridae 2501931 subfamily Orthocoronavirinae 694002 genus Betacoronavirus 2509511 subgenus Sarbecovirus 694009 species Severe acute respiratory syndrome-related coronavirus 2697049 no rank Severe acute respiratory syndrome coronavirus 2 $ echo \"Severe acute respiratory syndrome coronavirus 2\" | taxonkit name2taxid --data-dir taxdump/ Severe acute respiratory syndrome coronavirus 2 216305222 $ echo 216305222 \\ | taxonkit lineage -t --data-dir taxdump/ \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L --data-dir taxdump/ \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -Ht 1287770734 superkingdom Viruses 1506901452 phylum Pisuviricota 1091693597 class Pisoniviricetes 37745009 order Nidovirales 738421640 family Coronaviridae 906833049 genus Betacoronavirus 1015862491 species Severe acute respiratory syndrome-related coronavirus 216305222 strain Severe acute respiratory syndrome coronavirus 2 $ echo \"Escherichia coli\" | taxonkit name2taxid --data-dir taxdump/ Escherichia coli 1945799576 $ echo 1945799576 \\ | taxonkit lineage -t --data-dir taxdump/ \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L --data-dir taxdump/ \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -Ht 609216830 superkingdom Bacteria 1641076285 phylum Proteobacteria 329474883 class Gammaproteobacteria 1012954932 order Enterobacterales 87250111 family Enterobacteriaceae 1187493883 genus Escherichia 1945799576 species Escherichia coli /** * RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS. * LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/ /* var disqus_config = function () { this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable }; */ (function() { // DON'T EDIT BELOW THIS LINE var d = document, s = d.createElement('script'); s.src = '//taxonkit.disqus.com/embed.js'; s.setAttribute('data-timestamp', +new Date()); (d.head || d.body).appendChild(s); })(); Please enable JavaScript to view the comments powered by Disqus.","title":"Tutorial"},{"location":"tutorial/#tutorial","text":"","title":"Tutorial"},{"location":"tutorial/#table-of-contents","text":"Formatting lineage Parsing kraken/bracken result Making nr blastdb for specific taxids Summaries of taxonomy data Merging GTDB and NCBI taxonomy","title":"Table of Contents"},{"location":"tutorial/#formatting-lineage","text":"Show lineage detail of a TaxId. The command below works on Windows with help of csvtk . $ echo \"2697049\" \\ | taxonkit lineage -t \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -Ht 10239 superkingdom Viruses 2559587 clade Riboviria 2732396 kingdom Orthornavirae 2732408 phylum Pisuviricota 2732506 class Pisoniviricetes 76804 order Nidovirales 2499399 suborder Cornidovirineae 11118 family Coronaviridae 2501931 subfamily Orthocoronavirinae 694002 genus Betacoronavirus 2509511 subgenus Sarbecovirus 694009 species Severe acute respiratory syndrome-related coronavirus 2697049 no rank Severe acute respiratory syndrome coronavirus 2 Example data. $ cat taxids3.txt 376619 349741 239935 314101 11932 1327037 83333 1408252 2605619 2697049 Format to 7-level ranks (\"superkingdom phylum class order family genus species\"). $ cat taxids3.txt \\ | taxonkit reformat -I 1 376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis 349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y 83333 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 1408252 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 2605619 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Severe acute respiratory syndrome-related coronavirus Format to 8-level ranks (\"superkingdom phylum class order family genus species subspecies/rank\"). $ cat taxids3.txt \\ | taxonkit reformat -I 1 -f \"{k};{p};{c};{o};{f};{g};{s};{t}\" 376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica LVS 349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila; 314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B; 11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle; 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y; 83333 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli K-12 1408252 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli R178 2605619 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli; 2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Severe acute respiratory syndrome-related coronavirus; Replace missing ranks with Unassigned and output tab-delimited format. $ cat taxids3.txt \\ | taxonkit reformat -I 1 -r \"Unassigned\" -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | csvtk pretty -H -t 376619 Bacteria Proteobacteria Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis Francisella tularensis subsp. holarctica LVS 349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila Akkermansia muciniphila ATCC BAA-835 239935 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila Unassigned 314101 Bacteria Unassigned Unassigned Unassigned Unassigned Unassigned uncultured murine large bowel bacterium BAC 54B Unassigned 11932 Viruses Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle Unassigned 1327037 Viruses Uroviricota Caudoviricetes Caudovirales Siphoviridae Unassigned Croceibacter phage P2559Y Unassigned 83333 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Escherichia coli K-12 1408252 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Escherichia coli R178 2605619 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Unassigned 2697049 Viruses Pisuviricota Pisoniviricetes Nidovirales Coronaviridae Betacoronavirus Severe acute respiratory syndrome-related coronavirus Unassigned Fill missing ranks and add prefixes. $ cat taxids3.txt \\ | taxonkit reformat -I 1 -F -P -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | csvtk pretty -H -t 376619 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Thiotrichales f__Francisellaceae g__Francisella s__Francisella tularensis t__Francisella tularensis subsp. holarctica LVS 349741 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila t__Akkermansia muciniphila ATCC BAA-835 239935 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila t__unclassified Akkermansia muciniphila subspecies/strain 314101 k__Bacteria p__unclassified Bacteria phylum c__unclassified Bacteria class o__unclassified Bacteria order f__unclassified Bacteria family g__unclassified Bacteria genus s__uncultured murine large bowel bacterium BAC 54B t__unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain 11932 k__Viruses p__Artverviricota c__Revtraviricetes o__Ortervirales f__Retroviridae g__Intracisternal A-particles s__Mouse Intracisternal A-particle t__unclassified Mouse Intracisternal A-particle subspecies/strain 1327037 k__Viruses p__Uroviricota c__Caudoviricetes o__Caudovirales f__Siphoviridae g__unclassified Siphoviridae genus s__Croceibacter phage P2559Y t__unclassified Croceibacter phage P2559Y subspecies/strain 83333 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__Escherichia coli K-12 1408252 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__Escherichia coli R178 2605619 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__unclassified Escherichia coli subspecies/strain 2697049 k__Viruses p__Pisuviricota c__Pisoniviricetes o__Nidovirales f__Coronaviridae g__Betacoronavirus s__Severe acute respiratory syndrome-related coronavirus t__unclassified Severe acute respiratory syndrome-related coronavirus subspecies/strain When these's no nodes of rank \"subspecies\" nor \"strain\", we can switch -S/--pseudo-strain to use the node with lowest rank as subspecies/strain name, if which rank is lower than \"species\" . $ cat taxids3.txt \\ | taxonkit lineage -r -L \\ | taxonkit reformat -I 1 -F -S -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | cut -f 1,2,9,10 \\ | csvtk add-header -t -n \"taxid,rank,species,strain\" \\ | csvtk pretty -t taxid rank species strain ------- ---------- ----------------------------------------------------- ------------------------------------------------------------------------------ 376619 strain Francisella tularensis Francisella tularensis subsp. holarctica LVS 349741 strain Akkermansia muciniphila Akkermansia muciniphila ATCC BAA-835 239935 species Akkermansia muciniphila unclassified Akkermansia muciniphila subspecies/strain 314101 species uncultured murine large bowel bacterium BAC 54B unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain 11932 species Mouse Intracisternal A-particle unclassified Mouse Intracisternal A-particle subspecies/strain 1327037 species Croceibacter phage P2559Y unclassified Croceibacter phage P2559Y subspecies/strain 83333 strain Escherichia coli Escherichia coli K-12 1408252 subspecies Escherichia coli Escherichia coli R178 2605619 no rank Escherichia coli Escherichia coli O16:H48 2697049 no rank Severe acute respiratory syndrome-related coronavirus Severe acute respiratory syndrome coronavirus 2 List eight-level lineage for all TaxIds of rank lower than or equal to species, including some nodes with \"no rank\". But when filtering with -L/--lower-than , you can use -n/--save-predictable-norank to save some special ranks without order, where rank of the closest higher node is still lower than rank cutoff . $ time taxonkit list --ids 1 \\ | taxonkit filter -L species -E species -R -N -n \\ | taxonkit lineage -n -r -L \\ | taxonkit reformat -I 1 -F -S -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | csvtk cut -Ht -l -f 1,3,2,1,4-11 \\ | csvtk add-header -t -n \"taxid,rank,name,lineage,kingdom,phylum,class,order,family,genus,species,strain\" \\ | pigz -c > result.tsv.gz real 0m25.167s user 2m14.809s sys 0m7.197s $ pigz -cd result.tsv.gz \\ | csvtk grep -t -f taxid -p 2697049 \\ | csvtk transpose -t \\ | csvtk pretty -H -t taxid 2697049 rank no rank name Severe acute respiratory syndrome coronavirus 2 lineage Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 kingdom Viruses phylum Pisuviricota class Pisoniviricetes order Nidovirales family Coronaviridae genus Betacoronavirus species Severe acute respiratory syndrome-related coronavirus strain Severe acute respiratory syndrome coronavirus 2","title":"Formatting lineage"},{"location":"tutorial/#parsing-krakenbracken-result","text":"Example Data SRS014459-Stool.fasta.gz Run Kraken2 and Bracken KRAKEN_DB=/home/shenwei/ws/db/kraken/k2_pluspf THREADS=16 CLASSIFICATION_LVL=S THRESHOLD=10 READ_LEN=100 SAMPLE=SRS014459-Stool.fasta.gz BRACKEN_OUTPUT_FILE=$SAMPLE kraken2 --db ${KRAKEN_DB} --threads ${THREADS} -report ${SAMPLE}.kreport $SAMPLE > ${SAMPLE}.kraken est_abundance.py -i ${SAMPLE}.kreport -k ${KRAKEN_DB}/database${READ_LEN}mers.kmer_distrib \\ -l ${CLASSIFICATION_LVL} -t ${THRESHOLD} -o ${BRACKEN_OUTPUT_FILE}.bracken Orignial format $ head -n 15 SRS014459-Stool.fasta.gz_bracken_species.kreport 100.00 9491 0 R 1 root 99.85 9477 0 R1 131567 cellular organisms 99.85 9477 0 D 2 Bacteria 66.08 6271 0 D1 1783270 FCB group 66.08 6271 0 D2 68336 Bacteroidetes/Chlorobi group 66.08 6271 0 P 976 Bacteroidetes 66.08 6271 0 C 200643 Bacteroidia 66.08 6271 0 O 171549 Bacteroidales 34.45 3270 0 F 815 Bacteroidaceae 34.45 3270 0 G 816 Bacteroides 10.43 990 990 S 246787 Bacteroides cellulosilyticus 7.98 757 757 S 28116 Bacteroides ovatus 3.10 293 0 G1 2646097 unclassified Bacteroides 1.06 100 100 S 2755405 Bacteroides sp. CACC 737 0.49 46 46 S 2650157 Bacteroides sp. HF-5287 Converting to MetaPhlAn2 format. (Similar to kreport2mpa.py ) $ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\ | csvtk cut -Ht -f 5,1 \\ | taxonkit lineage \\ | taxonkit reformat -i 3 -P -f \"{k}|{p}|{c}|{o}|{f}|{g}|{s}\" \\ | csvtk cut -Ht -f 4,2 \\ | csvtk replace -Ht -p \"(\\|[kpcofgs]__)+$\" \\ | csvtk replace -Ht -p \"\\|[kpcofgs]__\\|\" -r \"|\" \\ | csvtk uniq -Ht \\ | csvtk grep -Ht -p k__ -v \\ > SRS014459-Stool.fasta.gz_bracken_species.kreport.format $ head -n 10 SRS014459-Stool.fasta.gz_bracken_species.kreport.format k__Bacteria 99.85 k__Bacteria|p__Bacteroidetes 66.08 k__Bacteria|p__Bacteroidetes|c__Bacteroidia 66.08 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales 66.08 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae 34.45 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides 34.45 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides cellulosilyticus 10.43 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides ovatus 7.98 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides sp. CACC 737 1.06 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides sp. HF-5287 0.49 Converting to Qiime format $ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\ | csvtk cut -Ht -f 5,1 \\ | taxonkit lineage \\ | taxonkit reformat -i 3 -P -f \"{k}; {p}; {c}; {o}; {f}; {g}; {s}\" \\ | csvtk cut -Ht -f 4,2 \\ | csvtk replace -Ht -p \"(; [kpcofgs]__)+$\" \\ | csvtk replace -Ht -p \"; [kpcofgs]__; \" -r \"; \" \\ | csvtk uniq -Ht \\ | csvtk grep -Ht -p k__ -v \\ | head -n 10 k__Bacteria 99.85 k__Bacteria; p__Bacteroidetes 66.08 k__Bacteria; p__Bacteroidetes; c__Bacteroidia 66.08 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales 66.08 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae 34.45 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides 34.45 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides cellulosilyticus 10.43 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides ovatus 7.98 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides sp. CACC 737 1.06 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides sp. HF-5287 0.49 Save taxon proportion and taxid, and get lineage, name and rank. $ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\ | csvtk cut -Ht -f 1,5 \\ | taxonkit lineage -i 2 -n -r \\ | csvtk cut -Ht -f 1,2,5,4,3 \\ | head -n 10 \\ | csvtk pretty -Ht 100.00 1 no rank root root 99.85 131567 no rank cellular organisms cellular organisms 99.85 2 superkingdom Bacteria cellular organisms;Bacteria 66.08 1783270 clade FCB group cellular organisms;Bacteria;FCB group 66.08 68336 clade Bacteroidetes/Chlorobi group cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group 66.08 976 phylum Bacteroidetes cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes 66.08 200643 class Bacteroidia cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia 66.08 171549 order Bacteroidales cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales 34.45 815 family Bacteroidaceae cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae 34.45 816 genus Bacteroides cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides Only save species or lower level and get lineage in format of \"superkingdom phylum class order family genus species\". $ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\ | csvtk cut -Ht -f 1,5 \\ | taxonkit filter -N -E species -L species -i 2 \\ | taxonkit lineage -i 2 -n -r \\ | taxonkit reformat -i 3 -f \"{k};{p};{c};{o};{f};{g};{s}\" \\ | csvtk cut -Ht -f 1,2,5,4,6 \\ | csvtk add-header -t -n abundance,taxid,rank,name,lineage \\ | head -n 10 \\ | csvtk pretty -t abundance taxid rank name lineage --------- ------- ------- ---------------------------- -------------------------------------------------------------------------------------------------------- 10.43 246787 species Bacteroides cellulosilyticus Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides cellulosilyticus 7.98 28116 species Bacteroides ovatus Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides ovatus 1.06 2755405 species Bacteroides sp. CACC 737 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. CACC 737 0.49 2650157 species Bacteroides sp. HF-5287 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. HF-5287 0.99 2528203 species Bacteroides sp. A1C1 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. A1C1 0.28 2763022 species Bacteroides sp. M10 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. M10 0.16 2650158 species Bacteroides sp. HF-5141 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. HF-5141 0.12 2715212 species Bacteroides sp. CBA7301 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. CBA7301 5.10 817 species Bacteroides fragilis Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides fragilis","title":"Parsing kraken/bracken result"},{"location":"tutorial/#making-nr-blastdb-for-specific-taxids","text":"Attention: BLAST+ 2.8.1 is released with new databases , which allows you to limit your search by taxonomy using information built into the BLAST databases. So you don't need to build blastdb for specific taxids now. Changes: 2018-09-13 rewritten 2018-12-22 providing faster method for step 3.1 2019-01-07 add note of new blastdb version 2020-10-14 update steps for huge number of accessions belong to high taxon level like bacteria. Data: pre-formated blastdb (09/10/2018) prot.accession2taxid.gz (09/07/2018) (optional, but recommended) Hardware in this tutorial CPU: AMD 8-cores/16-threads 3.7Ghz RAM: 64GB DISK: Taxonomy files stores in NVMe SSD blastdb files stores in 7200rpm HDD Tools: blast+ pigz (recommended, faster than gzip) taxonkit seqkit (recommended), version >= 0.14.0 rush (optional, for parallizing filtering sequence) Steps: Listing all taxids below $id using taxonkit. id=6656 # 6656 is the phylum Arthropoda # echo 6656 | taxonkit lineage | taxonkit reformat # 6656 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Protostomia;Ecdysozoa;Panarthropoda;Arthropoda Eukaryota;Arthropoda;;;;; # 2 bacteria # 2157 archaea # 4751 fungi # 10239 virus # time: 2s taxonkit list --ids $id --indent \"\" > $id.taxid.txt # taxonkit list --ids 2,4751,10239 --indent \"\" > microbe.taxid.txt wc -l $id.taxid.txt # 518373 6656.taxid.txt Retrieving target accessions. There are two options: From prot.accession2taxid.gz ( faster, recommended ). Note that some accessions are not in nr . # time: 4min pigz -dc prot.accession2taxid.gz \\ | csvtk grep -t -f taxid -P $id.taxid.txt \\ | csvtk cut -t -f accession.version,taxid \\ | sed 1d \\ > $id.acc2taxid.txt cut -f 1 $id.acc2taxid.txt > $id.acc.txt wc -l $id.acc.txt # 8174609 6656.acc.txt From pre-formated nr blastdb # time: 40min blastdbcmd -db nr -entry all -outfmt \"%a %T\" | pigz -c > nr.acc2taxid.txt.gz pigz -dc nr.acc2taxid.txt.gz | wc -l # 555220892 # time: 3min pigz -dc nr.acc2taxid.txt.gz \\ | csvtk grep -d ' ' -D ' ' -f 2 -P $id.taxid.txt \\ | cut -d ' ' -f 1 \\ > $id.acc.txt wc -l $id.acc.txt # 6928021 6656.acc.txt Retrieving FASTA sequences from pre-formated blastdb. There are two options: From nr.fa exported from pre-formated blastdb ( faster, smaller output file, recommended ). DO NOT directly download nr.gz from ncbi ftp , in which the FASTA headers are not well formated. # 1. exporting nr.fa from pre-formated blastdb # time: 117min (run only once) blastdbcmd -db nr -dbtype prot -entry all -outfmt \"%f\" -out - | pigz -c > nr.fa.gz # ===================================================================== # 2. filtering sequence belong to $taxid # --------------------------------------------------------------------- # methond 1) (for cases where $id.acc.txt is not very huge) # time: 80min # perl one-liner is used to unfold records having mulitple accessions time cat <(echo) <(pigz -dc nr.fa.gz) \\ | perl -e 'BEGIN{ $/ = \"\\n>\"; <>; } while(<>){s/>$//; $i = index $_, \"\\n\"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print \">$_\"; next; }; $h = \">$h\"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\\W+//; print \">$h1\\n$s\";} } ' \\ | seqkit grep -f $id.acc.txt -o nr.$id.fa.gz # --------------------------------------------------------------------- # method 2) (**faster**) # 33min (run only once) # (1). split nr.fa.gz. # Note: I have 16 cpus. $ time seqkit split2 -p 15 nr.fa.gz # (2). parallize unfolding $ cat _unfold_blastdb_fa.sh #!/bin/sh perl -e 'BEGIN{ $/ = \"\\n>\"; <>; } while(<>){s/>$//; $i = index $_, \"\\n\"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print \">$_\"; next; }; $h = \">$h\"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\\W+//; print \">$h1\\n$s\";} } ' # 10 min time ls nr.fa.gz.split/nr.part_*.fa.gz \\ | rush -j 15 -v id=$id 'cat <(echo) <(pigz -dc {}) \\ | ./_unfold_blastdb_fa.sh \\ | seqkit grep -f {id}.acc.txt -o nr.{id}.{%@nr\\.(.+)$} ' # (3). merge result cat nr.$id.part*.fa.gz > nr.$id.fa.gz rm nr.$id.part*.fa.gz # --------------------------------------------------------------------- # method 3) (for huge $id.acc.txt file, e.g., bacteria) # (1). split ${id}.acc.txt into several parts. chunk size depends on lines and RAM (64G for me). split -d -l 300000000 $id.acc.txt $id.acc.txt.part_ # (2). filter time ls $id.acc.txt.part_* \\ | rush -j 1 --immediate-output -v id=$id \\ 'echo {}; cat <(echo) <(pigz -dc nr.fa.gz ) \\ | ./_unfold_blastdb_fa.sh \\ | seqkit grep -f {} -o nr.{id}.{%@(part_.+)}.fa.gz ' # (3). merge cat nr.$id.part*.fa.gz > nr.$id.fa.gz # clean rm nr.$id.part*.fa.gz rm $id.acc.txt.part_ # (4). optionally adding taxid, you may edit replacement (-r) below # split time split -d -l 200000000 $id.acc2taxid.txt $id.acc2taxid.txt.part_ ln -s nr.$id.fa.gz nr.$id.with-taxid.part0.fa.gz i=0 for f in $id.acc2taxid.txt.part_* ; do echo $f time pigz -cd nr.$id.with-taxid.part$i.fa.gz \\ | seqkit replace -k $f -p \"^([^\\-]+?) \" -r \"{kv}-\\$1 \" -K -U -o nr.$id.with-taxid.part$(($i+1)).fa.gz; /bin/rm nr.$id.with-taxid.part$i.fa.gz i=$(($i+1)); done mv nr.$id.with-taxid.part$i.fa.gz nr.$id.with-taxid.fa.gz # ===================================================================== # 3. counting sequences # # ls -lh nr.$id.fa.gz # -rw-r--r-- 1 shenwei shenwei 902M 9\u6708 13 01:42 nr.6656.fa.gz # pigz -dc nr.$id.fa.gz | grep '^>' -c # 6928017 # Here 6928017 ~= 6928021 ($id.acc.txt) Directly from pre-formated blastdb # time: 5h20min blastdbcmd -db nr -entry_batch $id.acc.txt -out - | pigz -c > nr.$id.fa.gz # counting sequences # # Note that the headers of outputed fasta by blastdbcmd are \"folded\" # for accessions from different species with same sequences, so the # number may be small than $(wc -l $id.acc.txt). pigz -dc nr.$id.fa.gz | grep '^>' -c # 1577383 # counting accessions # # ls -lh nr.$id.fa.gz # -rw-r--r-- 1 shenwei shenwei 2.1G 9\u6708 13 03:38 nr.6656.fa.gz # pigz -dc nr.$id.fa.gz | grep '^>' | sed 's/>/\\n>/g' | grep '^>' -c # 288415413 makeblastdb pigz -dc nr.$id.fa.gz > nr.$id.fa # time: 3min ($nr.$id.fa from step 3 option 1) # # building $nr.$id.fa from step 3 option 2 with -parse_seqids would produce error: # # BLAST Database creation error: Error: Duplicate seq_ids are found: SP|P29868.1 # makeblastdb -parse_seqids -in nr.$id.fa -dbtype prot -out nr.$id # rm nr.$id.fa blastp (optional) # blastdb nr.$id is built from sequences in step 3 option 1 # blastp -num_threads 16 -db nr.$id -query t4.fa > t4.fa.blast # real 0m20.866s # $ cat t4.fa.blast | grep Query= -A 10 # Query= A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-Hd1a # # Length=35 Score E # Sequences producing significant alignments: (Bits) Value # 2MPQ_A Chain A, Solution structure of the sodium channel toxin Hd1a 72.4 2e-17 # A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-... 72.4 2e-17 # ADB56726.1 HNTX-IV.2 precursor [Haplopelma hainanum] 66.6 9e-15 # D2Y233.1 RecName: Full=Mu-theraphotoxin-Hhn1b 2; Short=Mu-TRTX-H... 66.6 9e-15 # ADB56830.1 HNTX-IV.3 precursor [Haplopelma hainanum] 66.6 9e-15","title":"Making nr blastdb for specific taxids"},{"location":"tutorial/#summaries-of-taxonomy-data","text":"You can change the TaxId of interest. Rank counts of common categories. $ echo Archaea Bacteria Eukaryota Fungi Metazoa Viridiplantae \\ | rush -D ' ' -T b \\ 'taxonkit list --ids $(echo {} | taxonkit name2taxid | cut -f 2) \\ | sed 1d \\ | taxonkit filter -i 2 -E genus -L genus \\ | taxonkit lineage -L -r \\ | csvtk freq -H -t -f 2 -nr \\ > stats.{}.tsv ' $ csvtk -t join --outer-join stats.*.tsv \\ | csvtk add-header -t -n \"rank,$(ls stats.*.tsv | rush -k 'echo {@stats.(.+).tsv}' | paste -sd, )\" \\ | csvtk csv2md -t Similar data on NCBI Taxonomy rank Archaea Bacteria Eukaryota Fungi Metazoa Viridiplantae species 12482 460940 1349648 156908 957297 191026 strain 354 40643 3486 2352 33 50 genus 205 4112 90882 6844 64148 16202 isolate 7 503 809 76 17 3 species group 2 77 251 22 214 5 serotype 218 serogroup 136 subsection 21 21 subspecies 632 24523 158 17043 7212 forma specialis 521 220 179 33 1 species subgroup 23 101 101 biotype 7 10 morph 12 3 4 5 section 437 37 2 398 genotype 12 12 series 9 5 4 varietas 25 8499 1100 2 7188 forma 4 560 185 6 315 subgenus 1 1558 10 1414 112 pathogroup 5 subvariety 5 5 Count of all ranks $ time taxonkit list --ids 1 \\ | taxonkit lineage -L -r \\ | csvtk freq -H -t -f 2 -nr \\ | csvtk pretty -H -t species 1879659 no rank 222743 genus 96625 strain 44483 subspecies 25174 family 9492 varietas 8524 subfamily 3050 tribe 2213 order 1660 subgenus 1618 isolate 1319 serotype 1216 clade 886 superfamily 865 forma specialis 741 forma 564 subtribe 508 section 437 class 429 suborder 372 species group 330 phylum 272 subclass 156 serogroup 138 infraorder 130 species subgroup 124 superorder 55 subphylum 33 parvorder 26 subsection 21 genotype 20 infraclass 18 biotype 17 morph 12 kingdom 11 series 9 superclass 6 cohort 5 pathogroup 5 subvariety 5 superkingdom 4 subcohort 3 subkingdom 1 superphylum 1 real 0m3.663s user 0m15.897s sys 0m1.010s Ranks of taxa at or below species. $ taxonkit list --ids 1 \\ | taxonkit filter --lower-than species --equal-to species \\ | taxonkit lineage -L -r \\ | csvtk freq -Ht -nr -f 2 \\ | csvtk add-header -t -n rank,count \\ | csvtk pretty -t rank count --------------- ------- species 1880044 no rank 222756 strain 44483 subspecies 25171 varietas 8524 isolate 1319 serotype 1216 clade 885 forma specialis 741 forma 564 serogroup 138 genotype 20 biotype 17 morph 12 pathogroup 5 subvariety 5","title":"Summaries of taxonomy data"},{"location":"tutorial/#merging-gtdb-and-ncbi-taxonomy","text":"Sometimes ( 1 ) one needs to build a database including bacteria and archaea (from GTDB) and viral database from NCBI. The idea is to export lineages from both GTDB and NCBI using taxonkit reformat , and then create taxdump files from them with taxonkit create-taxdump . Exporting taxonomic lineages of taxa with rank equal to species from GTDB-taxdump . taxonkit list --data-dir gtdb-taxdump/R207/ --ids 1 --indent \"\" \\ | taxonkit filter --data-dir gtdb-taxdump/R207/ --equal-to species \\ | taxonkit reformat --data-dir gtdb-taxdump/R207/ --taxid-field 1 \\ --format \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" \\ -o gtdb.tsv Exporting taxonomic lineages of viral taxa with rank equal to or lower than species from NCBI taxdump. For taxa whose rank is \"no rank\" below the species, we treat them as tax of strain rank ( --pseudo-strain , taxonkit v0.14.1 needed). # taxid of Viruses: 10239 taxonkit list --data-dir ~/.taxonkit --ids 10239 --indent \"\" \\ | taxonkit filter --data-dir ~/.taxonkit --equal-to species --lower-than species \\ | taxonkit reformat --data-dir ~/.taxonkit --taxid-field 1 \\ --pseudo-strain --format \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ -o ncbi-viral.tsv Creating taxdump from lineages above. (awk '{print $_\"\\t\"}' gtdb.tsv; cat ncbi-viral.tsv) \\ | taxonkit create-taxdump \\ --field-accession 1 \\ -R \"superkingdom,phylum,class,order,family,genus,species,strain\" \\ -O taxdump # we use --field-accession 1 to output the mapping file between old taxids and new ones. $ grep 2697049 taxdump/taxid.map # SARS-COV-2 2697049 21630522 Some tests: # SARS-COV-2 in NCBI taxonomy $ echo 2697049 \\ | taxonkit lineage -t --data-dir ~/.taxonkit \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L --data-dir ~/.taxonkit \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -Ht 10239 superkingdom Viruses 2559587 clade Riboviria 2732396 kingdom Orthornavirae 2732408 phylum Pisuviricota 2732506 class Pisoniviricetes 76804 order Nidovirales 2499399 suborder Cornidovirineae 11118 family Coronaviridae 2501931 subfamily Orthocoronavirinae 694002 genus Betacoronavirus 2509511 subgenus Sarbecovirus 694009 species Severe acute respiratory syndrome-related coronavirus 2697049 no rank Severe acute respiratory syndrome coronavirus 2 $ echo \"Severe acute respiratory syndrome coronavirus 2\" | taxonkit name2taxid --data-dir taxdump/ Severe acute respiratory syndrome coronavirus 2 216305222 $ echo 216305222 \\ | taxonkit lineage -t --data-dir taxdump/ \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L --data-dir taxdump/ \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -Ht 1287770734 superkingdom Viruses 1506901452 phylum Pisuviricota 1091693597 class Pisoniviricetes 37745009 order Nidovirales 738421640 family Coronaviridae 906833049 genus Betacoronavirus 1015862491 species Severe acute respiratory syndrome-related coronavirus 216305222 strain Severe acute respiratory syndrome coronavirus 2 $ echo \"Escherichia coli\" | taxonkit name2taxid --data-dir taxdump/ Escherichia coli 1945799576 $ echo 1945799576 \\ | taxonkit lineage -t --data-dir taxdump/ \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L --data-dir taxdump/ \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -Ht 609216830 superkingdom Bacteria 1641076285 phylum Proteobacteria 329474883 class Gammaproteobacteria 1012954932 order Enterobacterales 87250111 family Enterobacteriaceae 1187493883 genus Escherichia 1945799576 species Escherichia coli /** * RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS. * LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/ /* var disqus_config = function () { this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable }; */ (function() { // DON'T EDIT BELOW THIS LINE var d = document, s = d.createElement('script'); s.src = '//taxonkit.disqus.com/embed.js'; s.setAttribute('data-timestamp', +new Date()); (d.head || d.body).appendChild(s); })(); Please enable JavaScript to view the comments powered by Disqus.","title":"Merging GTDB and NCBI taxonomy"},{"location":"usage/","text":"Usage and Examples Table of Contents Usage and Examples Before use taxonkit list lineage reformat name2taxid filter lca taxid-changelog profile2cami cami-filter create-taxdump genautocomplete Before use Download and uncompress taxdump.tar.gz : ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz Copy names.dmp , nodes.dmp , delnodes.dmp and merged.dmp to data directory: $HOME/.taxonkit , e.g., /home/shenwei/.taxonkit , Optionally copy to some other directories, and later you can refer to using flag --data-dir , or environment variable TAXONKIT_DB . All-in-one command: wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz mkdir -p $HOME/.taxonkit cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit Update dataset : Simply re-download the taxdump files, uncompress and override old ones. taxonkit TaxonKit - A Practical and Efficient NCBI Taxonomy Toolkit Version: 0.14.2 Author: Wei Shen Source code: https://github.com/shenwei356/taxonkit Documents : https://bioinf.shenwei.me/taxonkit Citation : https://www.sciencedirect.com/science/article/pii/S1673852721000837 Dataset: Please download and uncompress \"taxdump.tar.gz\": ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz and copy \"names.dmp\", \"nodes.dmp\", \"delnodes.dmp\" and \"merged.dmp\" to data directory: \"/home/shenwei/.taxonkit\" or some other directory, and later you can refer to using flag --data-dir, or environment variable TAXONKIT_DB. When environment variable TAXONKIT_DB is set, explicitly setting --data-dir will overide the value of TAXONKIT_DB. Usage: taxonkit [command] Available Commands: cami-filter Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile create-taxdump Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV filter Filter TaxIds by taxonomic rank range genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell) lca Compute lowest common ancestor (LCA) for TaxIds lineage Query taxonomic lineage of given TaxIds list List taxonomic subtrees of given TaxIds name2taxid Convert scientific names to TaxIds profile2cami Convert metagenomic profile table to CAMI format reformat Reformat lineage in canonical ranks taxid-changelog Create TaxId changelog from dump archives version print version information and check for update Flags: --data-dir string directory containing nodes.dmp and names.dmp (default \"/home/shenwei/.taxonkit\") -h, --help help for taxonkit --line-buffered use line buffering on output, i.e., immediately writing to stdin/file for every line of output -o, --out-file string out file (\"-\" for stdout, suffix .gz for gzipped out) (default \"-\") -j, --threads int number of CPUs. 4 is enough (default 4) --verbose print verbose information list Usage List taxonomic subtrees of given TaxIds Attentions: 1. When multiple taxids are given, the output may contain duplicated records if some taxids are descendants of others. Examples: $ taxonkit list --ids 9606 -n -r --indent \" \" 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' $ taxonkit list --ids 9606 --indent \"\" 9606 63221 741158 Usage: taxonkit list [flags] Flags: -h, --help help for list -i, --ids string TaxId(s), multiple values should be separated by comma -I, --indent string indent (default \" \") -J, --json output in JSON format. you can save the result in file with suffix \".json\" and open with modern text editor -n, --show-name output scientific name -r, --show-rank output rank Examples Default usage. $ taxonkit list --ids 9605,239934 9605 9606 63221 741158 1425170 2665952 2665953 239934 239935 349741 512293 512294 1131822 1262691 1263034 1679444 2608915 1131336 ... Removing indent. The list could be used to extract sequences from BLAST database with blastdbcmd (see tutorial ) $ taxonkit list --ids 9605,239934 --indent \"\" 9605 9606 63221 741158 1425170 2665952 2665953 239934 239935 349741 512293 512294 1131822 1262691 1263034 1679444 ... Performance: Time and memory usage for whole taxon tree: $ # emptying the buffers cache $ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\" $ memusg -t taxonkit list --ids 1 --indent \"\" --verbose > t0.txt 21:05:01.782 [INFO] parsing merged file: /home/shenwei/.taxonkit/names.dmp 21:05:01.782 [INFO] parsing names file: /home/shenwei/.taxonkit/names.dmp 21:05:01.782 [INFO] parsing delnodes file: /home/shenwei/.taxonkit/names.dmp 21:05:01.816 [INFO] 61023 merged nodes parsed 21:05:01.889 [INFO] 437929 delnodes parsed 21:05:03.178 [INFO] 2303979 names parsed elapsed time: 3.290s peak rss: 742.77 MB Adding names $ taxonkit list --show-rank --show-name --indent \" \" --ids 9605,239934 9605 [genus] Homo 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' 1425170 [species] Homo heidelbergensis 2665952 [no rank] environmental samples 2665953 [species] Homo sapiens environmental sample 239934 [genus] Akkermansia 239935 [species] Akkermansia muciniphila 349741 [strain] Akkermansia muciniphila ATCC BAA-835 512293 [no rank] environmental samples 512294 [species] uncultured Akkermansia sp. 1131822 [species] uncultured Akkermansia sp. SMG25 1262691 [species] Akkermansia sp. CAG:344 1263034 [species] Akkermansia muciniphila CAG:154 1679444 [species] Akkermansia glycaniphila 2608915 [no rank] unclassified Akkermansia 1131336 [species] Akkermansia sp. KLE1605 1574264 [species] Akkermansia sp. KLE1797 ... Performance: Time and memory usage for whole taxonomy tree: $ # emptying the buffers cache $ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\" $ memusg -t taxonkit list --show-rank --show-name --ids 1 > t1.txt elapsed time: 5.341s peak rss: 1.04 GB Output in JSON format, you can easily collapse and uncollapse taxonomy tree in modern text editor. $ taxonkit list --show-rank --show-name --indent \" \" --ids 9605,239934 --json { \"9605 [genus] Homo\": { \"9606 [species] Homo sapiens\": { \"63221 [subspecies] Homo sapiens neanderthalensis\": { }, \"741158 [subspecies] Homo sapiens subsp. 'Denisova'\": { } }, \"1425170 [species] Homo heidelbergensis\": { } }, \"239934 [genus] Akkermansia\": { \"239935 [species] Akkermansia muciniphila\": { \"349741 [no rank] Akkermansia muciniphila ATCC BAA-835\": { } }, \"512293 [no rank] environmental samples\": { \"512294 [species] uncultured Akkermansia sp.\": { }, \"1131822 [species] uncultured Akkermansia sp. SMG25\": { }, \"1262691 [species] Akkermansia sp. CAG:344\": { }, \"1263034 [species] Akkermansia muciniphila CAG:154\": { } }, \"1679444 [species] Akkermansia glycaniphila\": { }, \"2608915 [no rank] unclassified Akkermansia\": { \"1131336 [species] Akkermansia sp. KLE1605\": { }, \"1574264 [species] Akkermansia sp. KLE1797\": { }, \"1574265 [species] Akkermansia sp. KLE1798\": { }, \"1638783 [species] Akkermansia sp. UNK.MGS-1\": { }, \"1755639 [species] Akkermansia sp. MC_55\": { } } } } Snapshot of taxonomy (taxid 1) in kate: lineage Usage Query taxonomic lineage of given TaxIds Input: - List of TaxIds, one TaxId per line. - Or tab-delimited format, please specify TaxId field with flag -i/--taxid-field (default 1). - Supporting (gzipped) file or STDIN. Output: 1. Input line data. 2. (Optional) Status code (-c/--show-status-code), values: - \"-1\" for queries not found in whole database. - \"0\" for deleted TaxIds, provided by \"delnodes.dmp\". - New TaxIds for merged TaxIds, provided by \"merged.dmp\". - Taxids for these found in \"nodes.dmp\". 3. Lineage, delimiter can be changed with flag -d/--delimiter. 4. (Optional) TaxIds taxons in the lineage (-t/--show-lineage-taxids) 5. (Optional) Name (-n/--show-name) 6. (Optional) Rank (-r/--show-rank) Filter out invalid and deleted taxids, and replace merged taxids with new ones: # input is one-column-taxid $ taxonkit lineage -c taxids.txt \\ | awk '$2>0' \\ | cut -f 2- # taxids are in 3rd field in a 4-columns tab-delimited file, # for $5, where 5 = 4 + 1. $ cat input.txt \\ | taxonkit lineage -c -i 3 \\ | csvtk filter2 -H -t -f '$5>0' \\ | csvtk -H -t cut -f -3 Usage: taxonkit lineage [flags] Flags: -d, --delimiter string field delimiter in lineage (default \";\") -h, --help help for lineage -L, --no-lineage do not show lineage, when user just want names or/and ranks -R, --show-lineage-ranks appending ranks of all levels -t, --show-lineage-taxids appending lineage consisting of taxids -n, --show-name appending scientific name -r, --show-rank appending rank of taxids -c, --show-status-code show status code before lineage -i, --taxid-field int field index of taxid. input data should be tab-separated (default 1) Examples Full lineage: # note that 123124124 is a fake taxid, 3 was deleted, 92489,1458427 were merged $ cat taxids.txt 9606 9913 376619 349741 239935 314101 11932 1327037 123124124 3 92489 1458427 $ taxonkit lineage taxids.txt | tee lineage.txt 19:22:13.077 [WARN] taxid 92489 was merged into 796334 19:22:13.077 [WARN] taxid 1458427 was merged into 1458425 19:22:13.077 [WARN] taxid 123124124 not found 19:22:13.077 [WARN] taxid 3 was deleted 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus 376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y 123124124 3 92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raicheisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei # wrapped table with csvtk pretty (>v0.26.0) $ taxonkit lineage taxids.txt | csvtk pretty -Ht -x ';' -W 70 -S bold \u250f\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2513 \u2503 9606 \u2503 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria; \u2503 \u2503 \u2503 Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi; \u2503 \u2503 \u2503 Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota; \u2503 \u2503 \u2503 Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates; \u2503 \u2503 \u2503 Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae; \u2503 \u2503 \u2503 Homo;Homo sapiens \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 9913 \u2503 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria; \u2503 \u2503 \u2503 Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi; \u2503 \u2503 \u2503 Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota; \u2503 \u2503 \u2503 Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla; \u2503 \u2503 \u2503 Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 376619 \u2503 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria; \u2503 \u2503 \u2503 Thiotrichales;Francisellaceae;Francisella;Francisella tularensis; \u2503 \u2503 \u2503 Francisella tularensis subsp. holarctica; \u2503 \u2503 \u2503 Francisella tularensis subsp. holarctica LVS \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 349741 \u2503 cellular organisms;Bacteria;PVC group;Verrucomicrobia; \u2503 \u2503 \u2503 Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia; \u2503 \u2503 \u2503 Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 239935 \u2503 cellular organisms;Bacteria;PVC group;Verrucomicrobia; \u2503 \u2503 \u2503 Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia; \u2503 \u2503 \u2503 Akkermansia muciniphila \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 314101 \u2503 cellular organisms;Bacteria;environmental samples; \u2503 \u2503 \u2503 uncultured murine large bowel bacterium BAC 54B \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 11932 \u2503 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes; \u2503 \u2503 \u2503 Ortervirales;Retroviridae;unclassified Retroviridae; \u2503 \u2503 \u2503 Intracisternal A-particles;Mouse Intracisternal A-particle \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 1327037 \u2503 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes; \u2503 \u2503 \u2503 Caudovirales;Siphoviridae;unclassified Siphoviridae; \u2503 \u2503 \u2503 Croceibacter phage P2559Y \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 92489 \u2503 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria; \u2503 \u2503 \u2503 Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 1458427 \u2503 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria; \u2503 \u2503 \u2503 Burkholderiales;Comamonadaceae;Serpentinomonas; \u2503 \u2503 \u2503 Serpentinomonas raichei \u2503 \u2517\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u253b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u251b Speed. $ time echo 9606 | taxonkit lineage 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens real 0m1.190s user 0m2.365s sys 0m0.170s # all TaxIds $ time taxonkit list --ids 1 --indent \"\" | taxonkit lineage > t real 0m4.249s user 0m16.418s sys 0m1.221s Checking deleted or merged taxids $ taxonkit lineage --show-status-code taxids.txt | tee lineage.withcode.txt # valid $ cat lineage.withcode.txt | awk '$2 > 0' | cut -f 1,2 9606 9606 9913 9913 376619 376619 349741 349741 239935 239935 314101 314101 11932 11932 1327037 1327037 92489 796334 1458427 1458425 # merged $ cat lineage.withcode.txt | awk '$2 > 0 && $2 != $1' | cut -f 1,2 92489 796334 1458427 1458425 # deleted $ cat lineage.withcode.txt | awk '$2 == 0' | cut -f 1 3 # invalid $ cat lineage.withcode.txt | awk '$2 < 0' | cut -f 1 123124124 Filter out invalid and deleted taxids, and replace merged taxids with new ones , you may install csvtk . # input is one-column-taxid $ taxonkit lineage -c taxids.txt \\ | awk '$2>0' \\ | cut -f 2- # taxids are in 3rd field in a 4-columns tab-delimited file, # for $5, where 5 = 4 + 1. $ cat input.txt \\ | taxonkit lineage -c -i 3 \\ | csvtk filter2 -H -t -f '$5>0' \\ | csvtk -H -t cut -f -3 Only show name and rank. $ taxonkit lineage -r -n -L taxids.txt \\ | csvtk pretty -H -t 9606 Homo sapiens species 9913 Bos taurus species 376619 Francisella tularensis subsp. holarctica LVS strain 349741 Akkermansia muciniphila ATCC BAA-835 strain 239935 Akkermansia muciniphila species 314101 uncultured murine large bowel bacterium BAC 54B species 11932 Mouse Intracisternal A-particle species 1327037 Croceibacter phage P2559Y species 123124124 3 92489 Erwinia oleae species 1458427 Serpentinomonas raichei species Show lineage consisting of taxids: $ taxonkit lineage -t taxids.txt 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 131567;2759;33154;33208;6072;33213;33511;7711;89593;7742;7776;117570;117571;8287;1338369;32523;32524;40674;32525;9347;1437010;314146;9443;376913;314293;9526;314295;9604;207598;9605;9606 9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus 131567;2759;33154;33208;6072;33213;33511;7711;89593;7742;7776;117570;117571;8287;1338369;32523;32524;40674;32525;9347;1437010;314145;91561;9845;35500;9895;27592;9903;9913 376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS 131567;2;1224;1236;72273;34064;262;263;119857;376619 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;1783257;74201;203494;48461;1647988;239934;239935;349741 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;1783257;74201;203494;48461;1647988;239934;239935 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B 131567;2;48479;314101 11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 10239;2559587;2732397;2732409;2732514;2169561;11632;35276;11749;11932 1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y 10239;2731341;2731360;2731618;2731619;28883;10699;196894;1327037 123124124 3 92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 131567;2;1224;1236;91347;1903409;551;796334 1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei 131567;2;1224;28216;80840;80864;2490452;1458425 or read taxids from STDIN: $ cat taxids.txt | taxonkit lineage And ranks of all nodes: $ echo 2697049 \\ | taxonkit lineage -t -R \\ | csvtk transpose -Ht 2697049 Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 superkingdom;clade;kingdom;phylum;class;order;suborder;family;subfamily;genus;subgenus;species;no rank Another way to show lineage detail of a TaxId $ echo 2697049 \\ | taxonkit lineage -t \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -H -t 10239 superkingdom Viruses 2559587 clade Riboviria 2732396 kingdom Orthornavirae 2732408 phylum Pisuviricota 2732506 class Pisoniviricetes 76804 order Nidovirales 2499399 suborder Cornidovirineae 11118 family Coronaviridae 2501931 subfamily Orthocoronavirinae 694002 genus Betacoronavirus 2509511 subgenus Sarbecovirus 694009 species Severe acute respiratory syndrome-related coronavirus 2697049 no rank Severe acute respiratory syndrome coronavirus 2 reformat Usage Reformat lineage in canonical ranks Input: - List of TaxIds or lineages, one record per line. The lineage can be a complete lineage or only one taxonomy name. - Or tab-delimited format. Plese specify the lineage field with flag -i/--lineage-field (default 2). Or specify the TaxId field with flag -I/--taxid-field (default 0), which overrides -i/--lineage-field. - Supporting (gzipped) file or STDIN. Output: 1. Input line data. 2. Reformated lineage. 3. (Optional) TaxIds taxons in the lineage (-t/--show-lineage-taxids) Ambiguous names: - Some TaxIds have the same complete lineage, empty result is returned by default. You can use the flag -a/--output-ambiguous-result to return one possible result Output format can be formated by flag --format, available placeholders: {k}: superkingdom {K}: kingdom {p}: phylum {c}: class {o}: order {f}: family {g}: genus {s}: species {t}: subspecies/strain {S}: subspecies {T}: strain When these're no nodes of rank \"subspecies\" nor \"strain\", you can switch on -S/--pseudo-strain to use the node with lowest rank as subspecies/strain name, if which rank is lower than \"species\". This flag affects {t}, {S}, {T}. Output format can contains some escape charactors like \"\\t\". Usage: taxonkit reformat [flags] Flags: -P, --add-prefix add prefixes for all ranks, single prefix for a rank is defined by flag --prefix-X -d, --delimiter string field delimiter in input lineage (default \";\") -F, --fill-miss-rank fill missing rank with lineage information of the next higher rank -f, --format string output format, placeholders of rank are needed (default \"{k};{p};{c};{o};{f};{g};{s}\") -h, --help help for reformat -i, --lineage-field int field index of lineage. data should be tab-separated (default 2) -r, --miss-rank-repl string replacement string for missing rank -p, --miss-rank-repl-prefix string prefix for estimated taxon level (default \"unclassified \") -s, --miss-rank-repl-suffix string suffix for estimated taxon names. \"rank\" for rank name, \"\" for no suffix (default \"rank\") -R, --miss-taxid-repl string replacement string for missing taxid -a, --output-ambiguous-result output one of the ambigous result --prefix-K string prefix for kingdom, used along with flag -P/--add-prefix (default \"K__\") --prefix-S string prefix for subspecies, used along with flag -P/--add-prefix (default \"S__\") --prefix-T string prefix for strain, used along with flag -P/--add-prefix (default \"T__\") --prefix-c string prefix for class, used along with flag -P/--add-prefix (default \"c__\") --prefix-f string prefix for family, used along with flag -P/--add-prefix (default \"f__\") --prefix-g string prefix for genus, used along with flag -P/--add-prefix (default \"g__\") --prefix-k string prefix for superkingdom, used along with flag -P/--add-prefix (default \"k__\") --prefix-o string prefix for order, used along with flag -P/--add-prefix (default \"o__\") --prefix-p string prefix for phylum, used along with flag -P/--add-prefix (default \"p__\") --prefix-s string prefix for species, used along with flag -P/--add-prefix (default \"s__\") --prefix-t string prefix for subspecies/strain, used along with flag -P/--add-prefix (default \"t__\") -S, --pseudo-strain use the node with lowest rank as strain name, only if which rank is lower than \"species\" and not \"subpecies\" nor \"strain\". It affects {t}, {S}, {T}. This flag needs flag -F -t, --show-lineage-taxids show corresponding taxids of reformated lineage -I, --taxid-field int field index of taxid. input data should be tab-separated. it overrides -i/--lineage-field -T, --trim do not fill or add prefix for missing rank lower than current rank Examples: For version > 0.8.0, reformat accept input of TaxIds via flag -I/--taxid-field . $ echo 239935 | taxonkit reformat -I 1 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila $ echo 349741 | taxonkit reformat -I 1 -f \"{k}|{p}|{c}|{o}|{f}|{g}|{s}|{t}\" -F -t 349741 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila|Akkermansia muciniphila ATCC BAA-835 2|74201|203494|48461|1647988|239934|239935|349741 Example lineage (produced by: taxonkit lineage taxids.txt | awk '$2!=\"\"' > lineage.txt ). $ cat lineage.txt 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus 376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y 92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei Default output format ( \"{k};{p};{c};{o};{f};{g};{s}\" ). # reformated lineages are appended to the input data $ taxonkit reformat lineage.txt ... 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila ... $ $ taxonkit reformat lineage.txt | tee lineage.txt.reformat $ cut -f 1,3 lineage.txt.reformat 9606 Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens 9913 Eukaryota;Chordata;Mammalia;Artiodactyla;Bovidae;Bos;Bos taurus 376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis 349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y 92489 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 1458427 Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei # aligned $ cat lineage.txt \\ | taxonkit reformat \\ | csvtk -H -t cut -f 1,3 \\ | csvtk -H -t sep -f 2 -s ';' -R \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk pretty -t taxid kindom phylum class order family genus species ------- --------- --------------- ------------------- ------------------ --------------- -------------------------- ----------------------------------------------- 9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens 9913 Eukaryota Chordata Mammalia Artiodactyla Bovidae Bos Bos taurus 376619 Bacteria Proteobacteria Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis 349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila 239935 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila 314101 Bacteria uncultured murine large bowel bacterium BAC 54B 11932 Viruses Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle 1327037 Viruses Uroviricota Caudoviricetes Caudovirales Siphoviridae Croceibacter phage P2559Y 92489 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Erwiniaceae Erwinia Erwinia oleae 1458427 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Comamonadaceae Serpentinomonas Serpentinomonas raichei And subspecies/strain ( {t} ), subspecies ( {S} ), and strain ( {T} ) are also available. # default operation $ echo -ne \"239935\\n83333\\n1408252\\n2697049\\n2605619\\n\" \\ | taxonkit lineage -n -r \\ | taxonkit reformat -f '{t};{S};{T}' \\ | csvtk -H -t cut -f 1,4,3,5 \\ | csvtk -H -t sep -f 4 -s ';' -R \\ | csvtk -H -t add-header -n \"taxid,rank,name,subspecies/strain,subspecies,strain\" \\ | csvtk pretty -t taxid rank name subspecies/strain subspecies strain ------- ---------- ----------------------------------------------- --------------------- --------------------- --------------------- 239935 species Akkermansia muciniphila 83333 strain Escherichia coli K-12 Escherichia coli K-12 Escherichia coli K-12 1408252 subspecies Escherichia coli R178 Escherichia coli R178 Escherichia coli R178 2697049 no rank Severe acute respiratory syndrome coronavirus 2 2605619 no rank Escherichia coli O16:H48 # fill missing ranks # see example below for -F/--fill-miss-rank # $ echo -ne \"239935\\n83333\\n1408252\\n2697049\\n2605619\\n\" \\ | taxonkit lineage -n -r \\ | taxonkit reformat -f '{t};{S};{T}' --fill-miss-rank \\ | csvtk -H -t cut -f 1,4,3,5 \\ | csvtk -H -t sep -f 4 -s ';' -R \\ | csvtk -H -t add-header -n \"taxid,rank,name,subspecies/strain,subspecies,strain\" \\ | csvtk pretty -t taxid rank name subspecies/strain subspecies strain ------- ---------- ----------------------------------------------- ------------------------------------------------------------------------------------ ----------------------------------------------------------------------------- ------------------------------------------------------------------------- 239935 species Akkermansia muciniphila unclassified Akkermansia muciniphila subspecies/strain unclassified Akkermansia muciniphila subspecies unclassified Akkermansia muciniphila strain 83333 strain Escherichia coli K-12 Escherichia coli K-12 unclassified Escherichia coli subspecies Escherichia coli K-12 1408252 subspecies Escherichia coli R178 Escherichia coli R178 Escherichia coli R178 unclassified Escherichia coli R178 strain 2697049 no rank Severe acute respiratory syndrome coronavirus 2 unclassified Severe acute respiratory syndrome-related coronavirus subspecies/strain unclassified Severe acute respiratory syndrome-related coronavirus subspecies unclassified Severe acute respiratory syndrome-related coronavirus strain 2605619 no rank Escherichia coli O16:H48 unclassified Escherichia coli subspecies/strain unclassified Escherichia coli subspecies unclassified Escherichia coli strain When these's no nodes of rank \"subspecies\" nor \"strain\", you can switch -S/--pseudo-strain to use the node with lowest rank as subspecies/strain name, if which rank is lower than \"species\" . Recommend using v0.14.1 or later versions. $ echo -ne \"239935\\n83333\\n1408252\\n2697049\\n2605619\\n\" \\ | taxonkit lineage -n -r \\ | taxonkit reformat -f '{t};{S};{T}' --pseudo-strain \\ | csvtk -H -t cut -f 1,4,3,5 \\ | csvtk -H -t sep -f 4 -s ';' -R \\ | csvtk -H -t add-header -n \"taxid,rank,name,subspecies/strain,subspecies,strain\" \\ | csvtk pretty -t taxid rank name subspecies/strain subspecies strain ------- ---------- ----------------------------------------------- ----------------------------------------------- ----------------------------------------------- ----------------------------------------------- 239935 species Akkermansia muciniphila 83333 strain Escherichia coli K-12 Escherichia coli K-12 Escherichia coli K-12 1408252 subspecies Escherichia coli R178 Escherichia coli R178 Escherichia coli R178 2697049 no rank Severe acute respiratory syndrome coronavirus 2 Severe acute respiratory syndrome coronavirus 2 Severe acute respiratory syndrome coronavirus 2 Severe acute respiratory syndrome coronavirus 2 2605619 no rank Escherichia coli O16:H48 Escherichia coli O16:H48 Escherichia coli O16:H48 Escherichia coli O16:H48 Add prefix ( -P/--add-prefix ). $ cat lineage.txt \\ | taxonkit reformat -P \\ | csvtk -H -t cut -f 1,3 9606 k__Eukaryota;p__Chordata;c__Mammalia;o__Primates;f__Hominidae;g__Homo;s__Homo sapiens 9913 k__Eukaryota;p__Chordata;c__Mammalia;o__Artiodactyla;f__Bovidae;g__Bos;s__Bos taurus 376619 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Thiotrichales;f__Francisellaceae;g__Francisella;s__Francisella tularensis 349741 k__Bacteria;p__Verrucomicrobia;c__Verrucomicrobiae;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia muciniphila 239935 k__Bacteria;p__Verrucomicrobia;c__Verrucomicrobiae;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia muciniphila 314101 k__Bacteria;p__;c__;o__;f__;g__;s__uncultured murine large bowel bacterium BAC 54B 11932 k__Viruses;p__Artverviricota;c__Revtraviricetes;o__Ortervirales;f__Retroviridae;g__Intracisternal A-particles;s__Mouse Intracisternal A-particle 1327037 k__Viruses;p__Uroviricota;c__Caudoviricetes;o__Caudovirales;f__Siphoviridae;g__;s__Croceibacter phage P2559Y 92489 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Erwiniaceae;g__Erwinia;s__Erwinia oleae 1458427 k__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__Comamonadaceae;g__Serpentinomonas;s__Serpentinomonas raichei Show corresponding taxids of reformated lineage (flag -t/--show-lineage-taxids ) $ cat lineage.txt \\ | taxonkit reformat -t \\ | csvtk -H -t cut -f 1,4 \\ | csvtk -H -t sep -f 2 -s ';' -R \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk pretty -t taxid kindom phylum class order family genus species ------- ------ ------- ------- ------- ------- ------- ------- 9606 2759 7711 40674 9443 9604 9605 9606 9913 2759 7711 40674 91561 9895 9903 9913 376619 2 1224 1236 72273 34064 262 263 349741 2 74201 203494 48461 1647988 239934 239935 239935 2 74201 203494 48461 1647988 239934 239935 314101 2 314101 11932 10239 2732409 2732514 2169561 11632 11749 11932 1327037 10239 2731618 2731619 28883 10699 1327037 92489 2 1224 1236 91347 1903409 551 796334 1458427 2 1224 28216 80840 80864 2490452 1458425 Use custom symbols for unclassfied ranks ( -r/--miss-rank-repl ) $ taxonkit reformat lineage.txt -r \"__\" | cut -f 3 Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens Eukaryota;Chordata;Mammalia;Artiodactyla;Bovidae;Bos;Bos taurus Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;__;__;__;__;__;uncultured murine large bowel bacterium BAC 54B Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;__;Croceibacter phage P2559Y Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei $ taxonkit reformat lineage.txt -r Unassigned | cut -f 3 Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens Eukaryota;Chordata;Mammalia;Artiodactyla;Bovidae;Bos;Bos taurus Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Unassigned;Unassigned;Unassigned;Unassigned;Unassigned;uncultured murine large bowel bacterium BAC 54B Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;Unassigned;Croceibacter phage P2559Y Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei Estimate and fill missing rank with original lineage information ( -F, --fill-miss-rank , very useful for formatting input data for LEfSe ). You can change the prefix \"unclassified\" using flag -p/--miss-rank-repl-prefix . $ cat lineage.txt \\ | taxonkit reformat -F \\ | csvtk -H -t cut -f 1,3 \\ | csvtk -H -t sep -f 2 -s ';' -R \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk pretty -t taxid kindom phylum class order family genus species ------- --------- ---------------------------- --------------------------- --------------------------- ---------------------------- ------------------------------- ----------------------------------------------- 9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens 9913 Eukaryota Chordata Mammalia Artiodactyla Bovidae Bos Bos taurus 376619 Bacteria Proteobacteria Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis 349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila 239935 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila 314101 Bacteria unclassified Bacteria phylum unclassified Bacteria class unclassified Bacteria order unclassified Bacteria family unclassified Bacteria genus uncultured murine large bowel bacterium BAC 54B 11932 Viruses Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle 1327037 Viruses Uroviricota Caudoviricetes Caudovirales Siphoviridae unclassified Siphoviridae genus Croceibacter phage P2559Y 92489 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Erwiniaceae Erwinia Erwinia oleae 1458427 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Comamonadaceae Serpentinomonas Serpentinomonas raichei Do not add prefix or suffix for estimated nodes: $ echo 314101 | taxonkit reformat -I 1 314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B $ echo 314101 | taxonkit reformat -I 1 -F -p \"\" -s \"\" 314101 Bacteria;Bacteria;Bacteria;Bacteria;Bacteria;Bacteria;uncultured murine large bowel bacterium BAC 54B Only some ranks. $ cat lineage.txt \\ | taxonkit reformat -F -f \"{s};{p}\"\\ | csvtk -H -t cut -f 1,3 \\ | csvtk -H -t sep -f 2 -s ';' -R \\ | csvtk add-header -t -n taxid,species,phylum \\ | csvtk pretty -t taxid species phylum ------- ----------------------------------------------- ---------------------------- 9606 Homo sapiens Chordata 9913 Bos taurus Chordata 376619 Francisella tularensis Proteobacteria 349741 Akkermansia muciniphila Verrucomicrobia 239935 Akkermansia muciniphila Verrucomicrobia 314101 uncultured murine large bowel bacterium BAC 54B unclassified Bacteria phylum 11932 Mouse Intracisternal A-particle Artverviricota 1327037 Croceibacter phage P2559Y Uroviricota 92489 Erwinia oleae Proteobacteria 1458427 Serpentinomonas raichei Proteobacteria For some taxids which rank is higher than the lowest rank in -f/--format , use -T/--trim to avoid fill missing rank lower than current rank . $ echo -ne \"2\\n239934\\n239935\\n\" \\ | taxonkit lineage \\ | taxonkit reformat -F \\ | sed -r \"s/;+$//\" \\ | csvtk -H -t cut -f 1,3 2 Bacteria;unclassified Bacteria phylum;unclassified Bacteria class;unclassified Bacteria order;unclassified Bacteria family;unclassified Bacteria genus;unclassified Bacteria species 239934 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;unclassified Akkermansia species 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila $ echo -ne \"2\\n239934\\n239935\\n\" \\ | taxonkit lineage \\ | taxonkit reformat -F -T \\ | sed -r \"s/;+$//\" \\ | csvtk -H -t cut -f 1,3 2 Bacteria 239934 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Support tab in format string $ echo 9606 \\ | taxonkit lineage \\ | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{S}\" \\ | csvtk cut -t -f -2 9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens List seven-level lineage for all TaxIds. # replace empty taxon with \"Unassigned\" $ taxonkit list --ids 1 \\ | taxonkit lineage \\ | taxonkit reformat -r Unassigned | gzip -c > all.lineage.tsv.gz # tab-delimited seven-levels $ taxonkit list --ids 1 \\ | taxonkit lineage \\ | taxonkit reformat -r Unassigned -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" \\ | csvtk cut -H -t -f -2 \\ | head -n 5 \\ | csvtk pretty -H -t # 8-level $ taxonkit list --ids 1 \\ | taxonkit lineage \\ | taxonkit reformat -r Unassigned -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | csvtk cut -H -t -f -2 \\ | head -n 5 \\ | csvtk pretty -H -t # Fill and trim $ memusg -t -s ' taxonkit list --ids 1 \\ | taxonkit lineage \\ | taxonkit reformat -F -T \\ | sed -r \"s/;+$//\" \\ | gzip -c > all.lineage.tsv.gz ' elapsed time: 19.930s peak rss: 6.25 GB From taxid to 7-ranks lineage: $ cat taxids.txt | taxonkit lineage | taxonkit reformat # for taxonkit v0.8.0 or later versions $ cat taxids.txt | taxonkit reformat -I 1 Some TaxIds have the same complete lineage, empty result is returned by default. You can use the flag -a/--output-ambiguous-result to return one possible result. see #42 $ echo -ne \"2507530\\n2516889\\n\" | taxonkit lineage --data-dir . | taxonkit reformat --data-dir . -t 19:18:29.770 [WARN] we can't distinguish the TaxIds (2507530, 2516889) for lineage: cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019. But you can use -a/--output-ambiguous-result to return one possible result 19:18:29.770 [WARN] we can't distinguish the TaxIds (2507530, 2516889) for lineage: cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019. But you can use -a/--output-ambiguous-result to return one possible result 2507530 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 2516889 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 $ echo -ne \"2507530\\n2516889\\n\" | taxonkit lineage --data-dir . | taxonkit reformat --data-dir . -t -a 2507530 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 Eukaryota;Basidiomycota;Agaricomycetes;Russulales;Russulaceae;Russula;Russula sp. 8 KA-2019 2759;5204;155619;452342;5401;5402;2507530 2516889 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 Eukaryota;Basidiomycota;Agaricomycetes;Russulales;Russulaceae;Russula;Russula sp. 8 KA-2019 2759;5204;155619;452342;5401;5402;2507530 name2taxid Usage Convert scientific names to TaxIds Attention: 1. Some TaxIds share the same scientific names, e.g, Drosophila. These input lines are duplicated with multiple TaxIds. $ echo Drosophila | taxonkit name2taxid | taxonkit lineage -i 2 -r -L Drosophila 7215 genus Drosophila 32281 subgenus Drosophila 2081351 genus Usage: taxonkit name2taxid [flags] Flags: -h, --help help for name2taxid -i, --name-field int field index of name. data should be tab-separated (default 1) -s, --sci-name only searching scientific names -r, --show-rank show rank Examples Example data $ cat names.txt Homo sapiens Akkermansia muciniphila ATCC BAA-835 Akkermansia muciniphila Mouse Intracisternal A-particle Wei Shen uncultured murine large bowel bacterium BAC 54B Croceibacter phage P2559Y Default. # taxonkit name2taxid names.txt $ cat names.txt | taxonkit name2taxid | csvtk pretty -H -t Homo sapiens 9606 Akkermansia muciniphila ATCC BAA-835 349741 Akkermansia muciniphila 239935 Mouse Intracisternal A-particle 11932 Wei Shen uncultured murine large bowel bacterium BAC 54B 314101 Croceibacter phage P2559Y 1327037 Show rank. $ cat names.txt | taxonkit name2taxid --show-rank | csvtk pretty -H -t Homo sapiens 9606 species Akkermansia muciniphila ATCC BAA-835 349741 strain Akkermansia muciniphila 239935 species Mouse Intracisternal A-particle 11932 species Wei Shen uncultured murine large bowel bacterium BAC 54B 314101 species Croceibacter phage P2559Y 1327037 species From name to lineage. $ cat names.txt | taxonkit name2taxid | taxonkit lineage --taxid-field 2 Homo sapiens 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens Akkermansia muciniphila ATCC BAA-835 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 Akkermansia muciniphila 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Mouse Intracisternal A-particle 11932 Viruses;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle Wei Shen uncultured murine large bowel bacterium BAC 54B 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B Croceibacter phage P2559Y 1327037 Viruses;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y Some TaxIds share the same scientific names , e.g, Drosophila. $ echo Drosophila \\ | taxonkit name2taxid \\ | taxonkit lineage -i 2 -r \\ | taxonkit reformat -i 3 \\ | csvtk cut -H -t -f 1,2,4,5 \\ | csvtk pretty -H -t Drosophila 7215 genus Eukaryota;Arthropoda;Insecta;Diptera;Drosophilidae;Drosophila; Drosophila 32281 subgenus Eukaryota;Arthropoda;Insecta;Diptera;Drosophilidae;Drosophila; Drosophila 2081351 genus Eukaryota;Basidiomycota;Agaricomycetes;Agaricales;Psathyrellaceae;Drosophila; filter Usage Filter TaxIds by taxonomic rank range Attentions: 1. Flag -L/--lower-than and -H/--higher-than are exclusive, and can be used along with -E/--equal-to which values can be different. 2. A list of pre-ordered ranks is in ~/.taxonkit/ranks.txt, you can use your list by -r/--rank-file, the format specification is below. 3. All ranks in taxonomy database should be defined in rank file. 4. Ranks can be removed with black list via -B/--black-list. 5. TaxIDs with no rank are kept by default!!! They can be optionally discarded by -N/--discard-noranks. 6. [Recommended] When filtering with -L/--lower-than, you can use -n/--save-predictable-norank to save some special ranks without order, where rank of the closest higher node is still lower than rank cutoff. Rank file: 1. Blank lines or lines starting with \"#\" are ignored. 2. Ranks are in decending order and case ignored. 3. Ranks with same order should be in one line separated with comma (\",\", no space). 4. Ranks without order should be assigned a prefix symbol \"!\" for each rank. Usage: taxonkit filter [flags] Flags: -B, --black-list strings black list of ranks to discard, e.g., '-B \"no rank\" -B \"clade\" -N, --discard-noranks discard all ranks without order, type \"taxonkit filter --help\" for details -R, --discard-root discard root taxid, defined by --root-taxid -E, --equal-to strings output TaxIds with rank equal to some ranks, multiple values can be separated with comma \",\" (e.g., -E \"genus,species\"), or give multiple times (e.g., -E genus -E species) -h, --help help for filter -H, --higher-than string output TaxIds with rank higher than a rank, exclusive with --lower-than --list-order list user defined ranks in order, from \"$HOME/.taxonkit/ranks.txt\" --list-ranks list ordered ranks in taxonomy database, sorted in user defined order -L, --lower-than string output TaxIds with rank lower than a rank, exclusive with --higher-than -r, --rank-file string user-defined ordered taxonomic ranks, type \"taxonkit filter --help\" for details --root-taxid uint32 root taxid (default 1) -n, --save-predictable-norank do not discard some special ranks without order when using -L, where rank of the closest higher node is still lower than rank cutoff -i, --taxid-field int field index of taxid. input data should be tab-separated (default 1) Examples Example data $ echo 349741 | taxonkit lineage -t | cut -f 3 | sed 's/;/\\n/g' > taxids2.txt $ cat taxids2.txt 131567 2 1783257 74201 203494 48461 1647988 239934 239935 349741 $ cat taxids2.txt | taxonkit lineage -r | csvtk -Ht cut -f 1,3,2 | csvtk pretty -H -t 131567 no rank cellular organisms 2 superkingdom cellular organisms;Bacteria 1783257 clade cellular organisms;Bacteria;PVC group 74201 phylum cellular organisms;Bacteria;PVC group;Verrucomicrobia 203494 class cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae 48461 order cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales 1647988 family cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae 239934 genus cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia 239935 species cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 349741 strain cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 Equal to certain rank(s) ( -E/--equal-to ) $ cat taxids2.txt \\ | taxonkit filter -E Phylum -E Class \\ | taxonkit lineage -r \\ | csvtk -Ht cut -f 1,3,2 \\ | csvtk pretty -H -t 74201 phylum cellular organisms;Bacteria;PVC group;Verrucomicrobia 203494 class cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae Lower than a rank ( -L/--lower-than ) $ cat taxids2.txt \\ | taxonkit filter -L genus \\ | taxonkit lineage -r -n -L \\ | csvtk -Ht cut -f 1,3,2 \\ | csvtk pretty -H -t 239935 species Akkermansia muciniphila 349741 strain Akkermansia muciniphila ATCC BAA-835 Higher than a rank ( -H/--higher-than ) $ cat taxids2.txt \\ | taxonkit filter -H phylum \\ | taxonkit lineage -r -n -L \\ | csvtk -Ht cut -f 1,3,2 \\ | csvtk pretty -H -t 2 superkingdom Bacteria TaxIDs with no rank are kept by default!!! \"no rank\" and \"clade\" have no rank and can be filter out via -N/--discard-noranks . Futher ranks can be removed with black list via -B/--black-list . # 562 is the TaxId of Escherichia coli $ taxonkit list --ids 562 \\ | taxonkit filter -L species \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk freq -Ht -f 2 -nr \\ | csvtk pretty -H -t strain 2950 no rank 149 serotype 141 serogroup 95 isolate 1 subspecies 1 $ taxonkit list --ids 562 \\ | taxonkit filter -L species -N -B strain \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk freq -Ht -f 2 -nr \\ | csvtk pretty -H -t serotype 141 serogroup 95 isolate 1 subspecies 1 Combine of -L/-H with -E . $ cat taxids2.txt \\ | taxonkit filter -L genus -E genus \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -H -t 239934 genus Akkermansia 239935 species Akkermansia muciniphila 349741 strain Akkermansia muciniphila ATCC BAA-835 Special cases of \"no rank\" . ( -n/--save-predictable-norank ). When filtering with -L/--lower-than , you can use -n/--save-predictable-norank to save some special ranks without order, where rank of the closest higher node is still lower than rank cutoff. $ echo -ne \"2605619\\n1327037\\n\" \\ | taxonkit lineage -t \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -H -t 131567 no rank cellular organisms 2 superkingdom Bacteria 1224 phylum Proteobacteria 1236 class Gammaproteobacteria 91347 order Enterobacterales 543 family Enterobacteriaceae 561 genus Escherichia 562 species Escherichia coli 2605619 no rank Escherichia coli O16:H48 10239 superkingdom Viruses 2731341 clade Duplodnaviria 2731360 clade Heunggongvirae 2731618 phylum Uroviricota 2731619 class Caudoviricetes 28883 order Caudovirales 10699 family Siphoviridae 196894 no rank unclassified Siphoviridae 1327037 species Croceibacter phage P2559Y # save taxids $ echo -ne \"2605619\\n1327037\\n\" \\ | taxonkit lineage -t \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | tee taxids4.txt 131567 2 1224 1236 91347 543 561 562 2605619 10239 2731341 2731360 2731618 2731619 28883 10699 196894 1327037 Now, filter nodes of rank <= species. $ cat taxids4.txt \\ | taxonkit filter -L species -E species -N -n \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -H -t 562 species Escherichia coli 2605619 no rank Escherichia coli O16:H48 1327037 species Croceibacter phage P2559Y Note that 2605619 (no rank) is saved because its parent node 562 is <= species. lca Usage Compute lowest common ancestor (LCA) for TaxIds Attention: 1. This command computes LCA TaxId for a list of TaxIds in a field (\"-i/--taxids-field) of tab-delimited file or STDIN. 2. TaxIDs should have the same separator (\"-s/--separator\"), single charactor separator is prefered. 3. Empty lines or lines without valid TaxIds in the field are omitted. 4. If some TaxIds are not found in database, it returns 0. Examples: $ echo 239934, 239935, 349741 | taxonkit lca -s \", \" 239934, 239935, 349741 239934 $ time echo 239934 239935 349741 9606 | taxonkit lca 239934 239935 349741 9606 131567 Usage: taxonkit lca [flags] Flags: -b, --buffer-size string size of line buffer, supported unit: K, M, G. You need to increase the value when \"bufio.Scanner: token too long\" error occured (default \"1M\") -h, --help help for lca --separater string separater for TaxIds. This flag is same to --separator. (default \" \") -s, --separator string separator for TaxIds (default \" \") -D, --skip-deleted skip deleted TaxIds and compute with left ones -U, --skip-unfound skip unfound TaxIds and compute with left ones -i, --taxids-field int field index of TaxIds. Input data should be tab-separated (default 1) Examples: Example data $ taxonkit list --ids 9605 -nr --indent \" \" 9605 [genus] Homo 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' 1425170 [species] Homo heidelbergensis 2665952 [no rank] environmental samples 2665953 [species] Homo sapiens environmental sample Simple one $ echo 63221 2665953 | taxonkit lca 63221 2665953 9605 Custom field ( -i/--taxids-field ) and separater ( -s/--separator ). $ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" a 63221,2665953 b 63221, 741158 $ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" \\ | taxonkit lca -i 2 -s \",\" a 63221,2665953 9605 b 63221, 741158 9606 Merged TaxIds. # merged $ echo 92487 92488 92489 | taxonkit lca 10:08:26.578 [WARN] taxid 92489 was merged into 796334 92487 92488 92489 1236 Deleted TaxIds, you can ommit theses and continue compute with left onces with ( -D/--skip-deleted ). $ echo 1 2 3 | taxonkit lca 10:30:17.678 [WARN] taxid 3 not found 1 2 3 0 $ time echo 1 2 3 | taxonkit lca -D 10:29:31.828 [WARN] taxid 3 was deleted 1 2 3 1 TaxIDs not found in database, you can ommit theses and continue compute with left onces with ( -U/--skip-unfound ). $ echo 61021 61022 11111111 | taxonkit lca 10:31:44.929 [WARN] taxid 11111111 not found 61021 61022 11111111 0 $ echo 61021 61022 11111111 | taxonkit lca -U 10:32:02.772 [WARN] taxid 11111111 not found 61021 61022 11111111 2628496 taxid-changelog Usage Create TaxId changelog from dump archives Steps: # dependencies: # rush - https://github.com/shenwei356/rush/ mkdir -p archive; cd archive; # --------- download --------- # option 1 # for fast network connection wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/taxdmp*.zip # option 2 # for slow network connection url=https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/ wget $url -O - -o /dev/null \\ | grep taxdmp | perl -ne '/(taxdmp_.+?.zip)/; print \"$1\\n\";' \\ | rush -j 2 -v url=$url 'axel -n 5 {url}/{}' \\ --immediate-output -c -C download.rush # --------- unzip --------- ls taxdmp*.zip | rush -j 1 'unzip {} names.dmp nodes.dmp merged.dmp delnodes.dmp -d {@_(.+)\\.}' # optionally compress .dmp files with pigz, for saving disk space fd .dmp$ | rush -j 4 'pigz {}' # --------- create log --------- cd .. taxonkit taxid-changelog -i archive -o taxid-changelog.csv.gz --verbose Output format (CSV): # fields comments taxid # taxid version # version / time of archive, e.g, 2019-07-01 change # change, values: # NEW newly added # REUSE_DEL deleted taxids being reused # REUSE_MER merged taxids being reused # DELETE deleted # MERGE merged into another taxid # ABSORB other taxids merged into this one # CHANGE_NAME scientific name changed # CHANGE_RANK rank changed # CHANGE_LIN_LIN lineage taxids remain but lineage remain # CHANGE_LIN_TAX lineage taxids changed # CHANGE_LIN_LEN lineage length changed change-value # variable values for changes: # 1) new taxid for MERGE # 2) merged taxids for ABSORB # 3) empty for others name # scientific name rank # rank lineage # complete lineage of the taxid lineage-taxids # taxids of the lineage # you can use csvtk to investigate them. e.g., csvtk grep -f taxid -p 1390515 taxid-changelog.csv.gz Usage: taxonkit taxid-changelog [flags] Flags: -i, --archive string directory containing uncompressed dumped archives -h, --help help for taxid-changelog Details Example 1 ( E.coli with taxid 562 ) $ pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -p 562 \\ | csvtk pretty taxid version change change-value name rank lineage lineage-taxids 562 2014-08-01 NEW Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 562 2014-08-01 ABSORB 662101;662104 Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 562 2015-11-01 ABSORB 1637691 Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 562 2016-10-01 CHANGE_LIN_LIN Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 562 2018-06-01 ABSORB 469598 Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 # merged taxids $ pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -p 662101,662104,1637691,469598 \\ | csvtk pretty taxid version change change-value name rank lineage lineage-taxids 469598 2014-08-01 NEW Escherichia sp. 3_2_53FAA species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia sp. 3_2_53FAA 131567;2;1224;1236;91347;543;561;469598 469598 2016-10-01 CHANGE_LIN_LIN Escherichia sp. 3_2_53FAA species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia sp. 3_2_53FAA 131567;2;1224;1236;91347;543;561;469598 469598 2018-06-01 MERGE 562 Escherichia sp. 3_2_53FAA species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia sp. 3_2_53FAA 131567;2;1224;1236;91347;543;561;469598 662101 2014-08-01 MERGE 562 662104 2014-08-01 MERGE 562 1637691 2015-04-01 DELETE 1637691 2015-05-01 REUSE_DEL Escherichia sp. MAR species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia sp. MAR 131567;2;1224;1236;91347;543;561;1637691 1637691 2015-11-01 MERGE 562 Escherichia sp. MAR species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia sp. MAR 131567;2;1224;1236;91347;543;561;1637691 Example 2 (SARS-CoV-2). $ time pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -p 2697049 \\ | csvtk pretty taxid version change change-value name rank lineage lineage-taxids 2697049 2020-02-01 NEW Wuhan seafood market pneumonia virus species Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;unclassified Betacoronavirus;Wuhan seafood market pneumonia virus 10239;2559587;76804;2499399;11118;2501931;694002;696098;2697049 2697049 2020-03-01 CHANGE_NAME Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-03-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-03-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-06-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-07-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 isolate Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-08-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 real 0m7.644s user 0m16.749s sys 0m3.985s Example 3 (All subspecies and strain in Akkermansia muciniphila 239935) # species in Akkermansia $ taxonkit list --show-rank --show-name --indent \" \" --ids 239935 239935 [species] Akkermansia muciniphila 349741 [strain] Akkermansia muciniphila ATCC BAA-835 # check them all $ pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -P <(taxonkit list --indent \"\" --ids 239935) \\ | csvtk pretty lineage-taxids taxid version change change-value name rank lineage lineage-taxids 239935 2014-08-01 NEW Akkermansia muciniphila species cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Verrucomicrobiaceae;Akkermansia;Akkermansia muciniphila 131567;2;51290;74201;203494;48461;203557;239934;239935 239935 2015-05-01 CHANGE_LIN_TAX Akkermansia muciniphila species cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;51290;74201;203494;48461;1647988;239934;239935 239935 2016-03-01 CHANGE_LIN_TAX Akkermansia muciniphila species cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;1783257;74201;203494;48461;1647988;239934;239935 239935 2016-05-01 ABSORB 1834199 Akkermansia muciniphila species cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;1783257;74201;203494;48461;1647988;239934;239935 349741 2014-08-01 NEW Akkermansia muciniphila ATCC BAA-835 no rank cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Verrucomicrobiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;51290;74201;203494;48461;203557;239934;239935;349741 349741 2015-05-01 CHANGE_LIN_TAX Akkermansia muciniphila ATCC BAA-835 no rank cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;51290;74201;203494;48461;1647988;239934;239935;349741 349741 2016-03-01 CHANGE_LIN_TAX Akkermansia muciniphila ATCC BAA-835 no rank cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;1783257;74201;203494;48461;1647988;239934;239935;349741 349741 2020-07-01 CHANGE_RANK Akkermansia muciniphila ATCC BAA-835 strain cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;1783257;74201;203494;48461;1647988;239934;239935;349741 More create-taxdump Usage Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV Input format: 0. For GTDB taxonomy file, just use --gtdb. We use the numeric assembly accession as the taxon at subspecies rank. (without the prefix GCA_ and GCF_, and version number). 1. The input file should be tab-delimited, at least one column is needed. 2. Ranks can be given either via the first row or the flag --rank-names. 3. The column containing the genome/assembly accession is recommended to generate TaxId mapping file (taxid.map, id -> taxid). -A/--field-accession, field contaning genome/assembly accession --field-accession-re, regular expression to extract the accession Note that mutiple TaxIds pointing to the same accession are listed as comma-seperated integers. Attentions: 1. Names should be distinct in taxa of different ranks. But for these missing some taxon nodes, using names of parent nodes is allowed: GB_GCA_018897955.1 d__Archaea;p__EX4484-52;c__EX4484-52;o__EX4484-52;f__LFW-46;g__LFW-46;s__LFW-46 sp018897155 It can also detect duplicate names with different ranks, e.g., the Class and Genus have the same name B47-G6, and the Order and Family between them have different names. In this case, we reassign a new TaxId by increasing the TaxId until it being distinct. GB_GCA_003663585.1 d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663585 2. Taxa from different parents may have the same name. We will assign different TaxIds to them. E.g., in ICTV, many viruses from different species have the same names. In practice, we set the \"Virus names(s)\" as a subspecies rank and also specify it as the accession. Species Virus name(s) Jerseyvirus SETP3 Salmonella phage SETP7 Jerseyvirus SETP7 Salmonella phage SETP7 3. The generated TaxIds are not consecutive numbers, however some tools like MMSeqs2 required this, you can use the script below for convertion: https://github.com/apcamargo/ictv-mmseqs2-protein-database/blob/master/fix_taxdump.py Usage: taxonkit create-taxdump [flags] Flags: -A, --field-accession int field index of assembly accession (genome ID), for outputting taxid.map --field-accession-re string regular expression to extract assembly accession (default \"^\\\\w\\\\w_(.+)$\") --force overwrite existed output directory --gtdb input files are GTDB taxonomy file --gtdb-re-subs string regular expression to extract assembly accession as the subspecies (default \"^\\\\w\\\\w_GC[AF]_(.+)\\\\.\\\\d+$\") -h, --help help for create-taxdump --line-chunk-size int number of lines to process for each thread, and 4 threads is fast enough. (default 5000) --null strings null value of taxa (default [,NULL,NA]) -x, --old-taxdump-dir string taxdump directory of the previous version, for generating merged.dmp and delnodes.dmp -O, --out-dir string output directory -R, --rank-names strings names of all ranks, leave it empty to use the first row of input as rank names Examples: GTDB. See more: https://github.com/shenwei356/gtdb-taxdump $ taxonkit create-taxdump --gtdb ar53_taxonomy_r207.tsv.gz bac120_taxonomy_r207.tsv.gz --out-dir taxdump 16:42:35.213 [INFO] 317542 records saved to taxdump/taxid.map 16:42:35.460 [INFO] 401815 records saved to taxdump/nodes.dmp 16:42:35.611 [INFO] 401815 records saved to taxdump/names.dmp 16:42:35.611 [INFO] 0 records saved to taxdump/merged.dmp 16:42:35.611 [INFO] 0 records saved to taxdump/delnodes.dmp ICTV, See more: https://github.com/shenwei356/ictv-taxdump MGV . Only Order, Family, Genus information are available. $ cat mgv_contig_info.tsv \\ | csvtk cut -t -f ictv_order,ictv_family,ictv_genus,votu_id,contig_id \\ | sed 1d \\ > mgv.tsv $ taxonkit create-taxdump mgv.tsv --out-dir mgv --force -A 5 -R order,family,genus,species 23:33:18.098 [INFO] 189680 records saved to mgv/taxid.map 23:33:18.131 [INFO] 58102 records saved to mgv/nodes.dmp 23:33:18.150 [INFO] 58102 records saved to mgv/names.dmp 23:33:18.150 [INFO] 0 records saved to mgv/merged.dmp 23:33:18.150 [INFO] 0 records saved to mgv/delnodes.dmp $ head -n 5 mgv/taxid.map MGV-GENOME-0364295 677052301 MGV-GENOME-0364296 677052301 MGV-GENOME-0364303 1414406025 MGV-GENOME-0364311 1849074420 MGV-GENOME-0364312 2074846424 $ echo 677052301 | taxonkit lineage --data-dir mgv/ 677052301 Caudovirales;crAss-phage;OTU-61123 $ echo 677052301 | taxonkit reformat --data-dir mgv/ -I 1 -P 677052301 k__;p__;c__;o__Caudovirales;f__crAss-phage;g__;s__OTU-61123 $ grep MGV-GENOME-0364295 mgv.tsv Caudovirales crAss-phage NULL OTU-61123 MGV-GENOME-0364295 Custom lineages with the first row as rank names and treating one column as accession. $ csvtk pretty -t example/taxonomy.tsv id superkingdom phylum class order family genus species --------------- ------------ -------------- ------------------- ---------------- ------------------ -------------- -------------------------- GCF_001027105.1 Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus aureus GCF_001096185.1 Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus Streptococcus pneumoniae GCF_001544255.1 Bacteria Firmicutes Bacilli Lactobacillales Enterococcaceae Enterococcus Enterococcus faecium GCF_002949675.1 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Shigella Shigella dysenteriae GCF_002950215.1 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Shigella Shigella flexneri GCF_006742205.1 Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus epidermidis GCF_000006945.2 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Salmonella Salmonella enterica GCF_000017205.1 Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas Pseudomonas aeruginosa GCF_003697165.2 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli GCF_009759685.1 Bacteria Proteobacteria Gammaproteobacteria Moraxellales Moraxellaceae Acinetobacter Acinetobacter baumannii GCF_000148585.2 Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus Streptococcus mitis GCF_000392875.1 Bacteria Firmicutes Bacilli Lactobacillales Enterococcaceae Enterococcus Enterococcus faecalis GCF_000742135.1 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Klebsiella Klebsiella pneumonia # the first column as accession $ taxonkit create-taxdump -A 1 example/taxonomy.tsv -O example/taxdump 16:31:31.828 [INFO] I will use the first row of input as rank names 16:31:31.843 [INFO] 13 records saved to example/taxdump/taxid.map 16:31:31.843 [INFO] 39 records saved to example/taxdump/nodes.dmp 16:31:31.843 [INFO] 39 records saved to example/taxdump/names.dmp 16:31:31.843 [INFO] 0 records saved to example/taxdump/merged.dmp 16:31:31.843 [INFO] 0 records saved to example/taxdump/delnodes.dmp $ export TAXONKIT_DB=example/taxdump $ taxonkit list --ids 1 | taxonkit filter -E species | taxonkit lineage -r 1527235303 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus mitis species 2983929374 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus pneumoniae species 3809813362 Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecalis species 4145431389 Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecium species 1569132721 Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus aureus species 1920251658 Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus epidermidis species 3843752343 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas aeruginosa species 72054943 Bacteria;Proteobacteria;Gammaproteobacteria;Moraxellales;Moraxellaceae;Acinetobacter;Acinetobacter baumannii species 1678121664 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Salmonella;Salmonella enterica species 524994882 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Shigella;Shigella dysenteriae species 2695851945 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Shigella;Shigella flexneri species 3958205156 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae species 4093283224 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli species $ head -n 3 example/taxdump/taxid.map GCF_001027105.1 1569132721 GCF_001096185.1 2983929374 GCF_001544255.1 4145431389 Custom lineages with the first row as rank names (pure lineage data) $ csvtk cut -t -f 2- example/taxonomy.tsv | head -n 2 | csvtk pretty -t superkingdom phylum class order family genus species ------------ ---------- ------- ---------- ----------------- -------------- --------------------- Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus aureus $ csvtk cut -t -f 2- example/taxonomy.tsv \\ | taxonkit create-taxdump -O example/taxdump2 16:53:08.604 [INFO] I will use the first row of input as rank names 16:53:08.614 [INFO] 39 records saved to example/taxdump2/nodes.dmp 16:53:08.614 [INFO] 39 records saved to example/taxdump2/names.dmp 16:53:08.614 [INFO] 0 records saved to example/taxdump2/merged.dmp 16:53:08.615 [INFO] 0 records saved to example/taxdump2/delnodes.dmp $ export TAXONKIT_DB=example/taxdump2 $ taxonkit list --ids 1 | taxonkit filter -E species | taxonkit lineage -r | head -n 2 1527235303 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus mitis species 2983929374 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus pneumoniae species genautocomplete Usage Generate shell autocompletion script Supported shell: bash|zsh|fish|powershell Bash: # generate completion shell taxonkit genautocomplete --shell bash # configure if never did. # install bash-completion if the \"complete\" command is not found. echo \"for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\" >> ~/.bash_completion echo \"source ~/.bash_completion\" >> ~/.bashrc Zsh: # generate completion shell taxonkit genautocomplete --shell zsh --file ~/.zfunc/_taxonkit # configure if never did echo 'fpath=( ~/.zfunc \"${fpath[@]}\" )' >> ~/.zshrc echo \"autoload -U compinit; compinit\" >> ~/.zshrc fish: taxonkit genautocomplete --shell fish --file ~/.config/fish/completions/taxonkit.fish Usage: taxonkit genautocomplete [flags] Flags: --file string autocompletion file (default \"/home/shenwei/.bash_completion.d/taxonkit.sh\") -h, --help help for genautocomplete --type string autocompletion type (currently only bash supported) (default \"bash\") profile2cami Usage Convert metagenomic profile table to CAMI format Input format: 1. The input file should be tab-delimited 2. At least two columns needed: a) TaxId of taxon at species or lower rank. b) Abundance (could be percentage, automatically detected or use -p/--percentage). Attentions: 1. Some TaxIds may be merged to another ones in current taxonomy version, the abundances will be summed up. 2. Some TaxIds may be deleted in current taxonomy version, the abundances can be optionally recomputed with the flag -R/--recompute-abd. Usage: taxonkit profile2cami [flags] Flags: -a, --abundance-field int field index of abundance. input data should be tab-separated (default 2) -h, --help help for profile2cami -0, --keep-zero keep taxons with abundance of zero -p, --percentage abundance is in percentage -R, --recompute-abd recompute abundance if some TaxIds are deleted in current taxonomy version -s, --sample-id string sample ID in result file -r, --show-rank strings only show TaxIds and names of these ranks (default [superkingdom,phylum,class,order,family,genus,species,strain]) -i, --taxid-field int field index of taxid. input data should be tab-separated (default 1) -t, --taxonomy-id string taxonomy ID in result file Examples Test data, note that 2824115 is merged to 483329 and 1657696 is deleted in current taxonomy version. $ cat example/abundance.tsv 2824115 0.2 merged to 483329 483329 0.2 absord 2824115 239935 0.5 no change 1657696 0.1 deleted Example: $ taxonkit profile2cami -s sample1 -t 2021-10-01 \\ example/abundance.tsv 13:17:40.552 [WARN] taxid is deleted in current taxonomy version: 1657696 13:17:40.552 [WARN] you may recomputed abundance with the flag -R/--recompute-abd @SampleID:sample1 @Version:0.10.0 @Ranks:superkingdom|phylum|class|order|family|genus|species|strain @TaxonomyID:2021-10-01 @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 50.000000000000000 2759 superkingdom 2759 Eukaryota 40.000000000000000 74201 phylum 2|74201 Bacteria|Verrucomicrobia 50.000000000000000 6656 phylum 2759|6656 Eukaryota|Arthropoda 40.000000000000000 203494 class 2|74201|203494 Bacteria|Verrucomicrobia|Verrucomicrobiae 50.000000000000000 50557 class 2759|6656|50557 Eukaryota|Arthropoda|Insecta 40.000000000000000 48461 order 2|74201|203494|48461 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales 50.000000000000000 7041 order 2759|6656|50557|7041 Eukaryota|Arthropoda|Insecta|Coleoptera 40.000000000000000 1647988 family 2|74201|203494|48461|1647988 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae 50.000000000000000 57514 family 2759|6656|50557|7041|57514 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae 40.000000000000000 239934 genus 2|74201|203494|48461|1647988|239934 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia 50.000000000000000 57515 genus 2759|6656|50557|7041|57514|57515 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus 40.000000000000000 239935 species 2|74201|203494|48461|1647988|239934|239935 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila 50.000000000000000 483329 species 2759|6656|50557|7041|57514|57515|483329 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus|Nicrophorus carolina 40.000000000000000 Recompute (normalize) the abundance $ taxonkit profile2cami -s sample1 -t 2021-10-01 \\ example/abundance.tsv --recompute-abd 13:19:23.647 [WARN] taxid is deleted in current taxonomy version: 1657696 @SampleID:sample1 @Version:0.10.0 @Ranks:superkingdom|phylum|class|order|family|genus|species|strain @TaxonomyID:2021-10-01 @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 55.555555555555557 2759 superkingdom 2759 Eukaryota 44.444444444444450 74201 phylum 2|74201 Bacteria|Verrucomicrobia 55.555555555555557 6656 phylum 2759|6656 Eukaryota|Arthropoda 44.444444444444450 203494 class 2|74201|203494 Bacteria|Verrucomicrobia|Verrucomicrobiae 55.555555555555557 50557 class 2759|6656|50557 Eukaryota|Arthropoda|Insecta 44.444444444444450 48461 order 2|74201|203494|48461 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales 55.555555555555557 7041 order 2759|6656|50557|7041 Eukaryota|Arthropoda|Insecta|Coleoptera 44.444444444444450 1647988 family 2|74201|203494|48461|1647988 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae 55.555555555555557 57514 family 2759|6656|50557|7041|57514 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae 44.444444444444450 239934 genus 2|74201|203494|48461|1647988|239934 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia 55.555555555555557 57515 genus 2759|6656|50557|7041|57514|57515 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus 44.444444444444450 239935 species 2|74201|203494|48461|1647988|239934|239935 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila 55.555555555555557 483329 species 2759|6656|50557|7041|57514|57515|483329 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus|Nicrophorus carolina 44.444444444444450 See https://github.com/shenwei356/sun2021-cami-profiles cami-filter Usage Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile Input format: The CAMI (Taxonomic) Profiling Output Format - https://github.com/CAMI-challenge/contest_information/blob/master/file_formats/CAMI_TP_specification.mkd - One file with mutiple samples is also supported. How to: - No extra taxonomy data needed, so the original taxonomic information are used and not changed. - A mini taxonomic tree is built from records with abundance greater than zero, and only leaves are retained for later use. The rank of leaves may be \"strain\", \"species\", or \"no rank\". - Relative abundances (in percentage) are recomputed for all leaves (reference genome). - A new taxonomic tree is built from these leaves, and abundances are cumulatively added up from leaves to the root. Examples: 1. Remove Archaea, Bacteria, and EukaryoteS, only keep Viruses: taxonkit cami-filter -t 2,2157,2759 test.profile -o test.filter.profile 2. Remove Viruses: taxonkit cami-filter -t 10239 test.profile -o test.filter.profile Usage: taxonkit cami-filter [flags] Flags: --field-percentage int field index of PERCENTAGE (default 5) --field-rank int field index of taxid (default 2) --field-taxid int field index of taxid (default 1) --field-taxpath int field index of TAXPATH (default 3) --field-taxpathsn int field index of TAXPATHSN (default 4) -h, --help help for cami-filter --leaf-ranks strings only consider leaves at these ranks (default [species,strain,no rank]) --show-rank strings only show TaxIds and names of these ranks (default [superkingdom,phylum,class,order,family,genus,species,strain]) --taxid-sep string separator of taxid in TAXPATH and TAXPATHSN (default \"|\") -t, --taxids strings the parent taxid(s) to filter out -f, --taxids-file strings file(s) for the parent taxid(s) to filter out, one taxid per line Examples: Remove Eukaryota taxonkit profile2cami -s sample1 -t 2021-10-01 \\ example/abundance.tsv --recompute-abd \\ | taxonkit cami-filter -t 2759 @SampleID:sample1 @Version:0.10.0 @Ranks:superkingdom|phylum|class|order|family|genus|species|strain @TaxonomyID:2021-10-01 @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 100.000000000000000 74201 phylum 2|74201 Bacteria|Verrucomicrobia 100.000000000000000 203494 class 2|74201|203494 Bacteria|Verrucomicrobia|Verrucomicrobiae 100.000000000000000 48461 order 2|74201|203494|48461 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales 100.000000000000000 1647988 family 2|74201|203494|48461|1647988 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae 100.000000000000000 239934 genus 2|74201|203494|48461|1647988|239934 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia 100.000000000000000 239935 species 2|74201|203494|48461|1647988|239934|239935 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila 100.000000000000000 /** * RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS. * LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/ /* var disqus_config = function () { this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable }; */ (function() { // DON'T EDIT BELOW THIS LINE var d = document, s = d.createElement('script'); s.src = '//taxonkit.disqus.com/embed.js'; s.setAttribute('data-timestamp', +new Date()); (d.head || d.body).appendChild(s); })(); Please enable JavaScript to view the comments powered by Disqus.","title":"Usage"},{"location":"usage/#usage-and-examples","text":"Table of Contents Usage and Examples Before use taxonkit list lineage reformat name2taxid filter lca taxid-changelog profile2cami cami-filter create-taxdump genautocomplete","title":"Usage and Examples"},{"location":"usage/#before-use","text":"Download and uncompress taxdump.tar.gz : ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz Copy names.dmp , nodes.dmp , delnodes.dmp and merged.dmp to data directory: $HOME/.taxonkit , e.g., /home/shenwei/.taxonkit , Optionally copy to some other directories, and later you can refer to using flag --data-dir , or environment variable TAXONKIT_DB . All-in-one command: wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz mkdir -p $HOME/.taxonkit cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit Update dataset : Simply re-download the taxdump files, uncompress and override old ones.","title":"Before use"},{"location":"usage/#taxonkit","text":"TaxonKit - A Practical and Efficient NCBI Taxonomy Toolkit Version: 0.14.2 Author: Wei Shen Source code: https://github.com/shenwei356/taxonkit Documents : https://bioinf.shenwei.me/taxonkit Citation : https://www.sciencedirect.com/science/article/pii/S1673852721000837 Dataset: Please download and uncompress \"taxdump.tar.gz\": ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz and copy \"names.dmp\", \"nodes.dmp\", \"delnodes.dmp\" and \"merged.dmp\" to data directory: \"/home/shenwei/.taxonkit\" or some other directory, and later you can refer to using flag --data-dir, or environment variable TAXONKIT_DB. When environment variable TAXONKIT_DB is set, explicitly setting --data-dir will overide the value of TAXONKIT_DB. Usage: taxonkit [command] Available Commands: cami-filter Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile create-taxdump Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV filter Filter TaxIds by taxonomic rank range genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell) lca Compute lowest common ancestor (LCA) for TaxIds lineage Query taxonomic lineage of given TaxIds list List taxonomic subtrees of given TaxIds name2taxid Convert scientific names to TaxIds profile2cami Convert metagenomic profile table to CAMI format reformat Reformat lineage in canonical ranks taxid-changelog Create TaxId changelog from dump archives version print version information and check for update Flags: --data-dir string directory containing nodes.dmp and names.dmp (default \"/home/shenwei/.taxonkit\") -h, --help help for taxonkit --line-buffered use line buffering on output, i.e., immediately writing to stdin/file for every line of output -o, --out-file string out file (\"-\" for stdout, suffix .gz for gzipped out) (default \"-\") -j, --threads int number of CPUs. 4 is enough (default 4) --verbose print verbose information","title":"taxonkit"},{"location":"usage/#list","text":"Usage List taxonomic subtrees of given TaxIds Attentions: 1. When multiple taxids are given, the output may contain duplicated records if some taxids are descendants of others. Examples: $ taxonkit list --ids 9606 -n -r --indent \" \" 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' $ taxonkit list --ids 9606 --indent \"\" 9606 63221 741158 Usage: taxonkit list [flags] Flags: -h, --help help for list -i, --ids string TaxId(s), multiple values should be separated by comma -I, --indent string indent (default \" \") -J, --json output in JSON format. you can save the result in file with suffix \".json\" and open with modern text editor -n, --show-name output scientific name -r, --show-rank output rank Examples Default usage. $ taxonkit list --ids 9605,239934 9605 9606 63221 741158 1425170 2665952 2665953 239934 239935 349741 512293 512294 1131822 1262691 1263034 1679444 2608915 1131336 ... Removing indent. The list could be used to extract sequences from BLAST database with blastdbcmd (see tutorial ) $ taxonkit list --ids 9605,239934 --indent \"\" 9605 9606 63221 741158 1425170 2665952 2665953 239934 239935 349741 512293 512294 1131822 1262691 1263034 1679444 ... Performance: Time and memory usage for whole taxon tree: $ # emptying the buffers cache $ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\" $ memusg -t taxonkit list --ids 1 --indent \"\" --verbose > t0.txt 21:05:01.782 [INFO] parsing merged file: /home/shenwei/.taxonkit/names.dmp 21:05:01.782 [INFO] parsing names file: /home/shenwei/.taxonkit/names.dmp 21:05:01.782 [INFO] parsing delnodes file: /home/shenwei/.taxonkit/names.dmp 21:05:01.816 [INFO] 61023 merged nodes parsed 21:05:01.889 [INFO] 437929 delnodes parsed 21:05:03.178 [INFO] 2303979 names parsed elapsed time: 3.290s peak rss: 742.77 MB Adding names $ taxonkit list --show-rank --show-name --indent \" \" --ids 9605,239934 9605 [genus] Homo 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' 1425170 [species] Homo heidelbergensis 2665952 [no rank] environmental samples 2665953 [species] Homo sapiens environmental sample 239934 [genus] Akkermansia 239935 [species] Akkermansia muciniphila 349741 [strain] Akkermansia muciniphila ATCC BAA-835 512293 [no rank] environmental samples 512294 [species] uncultured Akkermansia sp. 1131822 [species] uncultured Akkermansia sp. SMG25 1262691 [species] Akkermansia sp. CAG:344 1263034 [species] Akkermansia muciniphila CAG:154 1679444 [species] Akkermansia glycaniphila 2608915 [no rank] unclassified Akkermansia 1131336 [species] Akkermansia sp. KLE1605 1574264 [species] Akkermansia sp. KLE1797 ... Performance: Time and memory usage for whole taxonomy tree: $ # emptying the buffers cache $ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\" $ memusg -t taxonkit list --show-rank --show-name --ids 1 > t1.txt elapsed time: 5.341s peak rss: 1.04 GB Output in JSON format, you can easily collapse and uncollapse taxonomy tree in modern text editor. $ taxonkit list --show-rank --show-name --indent \" \" --ids 9605,239934 --json { \"9605 [genus] Homo\": { \"9606 [species] Homo sapiens\": { \"63221 [subspecies] Homo sapiens neanderthalensis\": { }, \"741158 [subspecies] Homo sapiens subsp. 'Denisova'\": { } }, \"1425170 [species] Homo heidelbergensis\": { } }, \"239934 [genus] Akkermansia\": { \"239935 [species] Akkermansia muciniphila\": { \"349741 [no rank] Akkermansia muciniphila ATCC BAA-835\": { } }, \"512293 [no rank] environmental samples\": { \"512294 [species] uncultured Akkermansia sp.\": { }, \"1131822 [species] uncultured Akkermansia sp. SMG25\": { }, \"1262691 [species] Akkermansia sp. CAG:344\": { }, \"1263034 [species] Akkermansia muciniphila CAG:154\": { } }, \"1679444 [species] Akkermansia glycaniphila\": { }, \"2608915 [no rank] unclassified Akkermansia\": { \"1131336 [species] Akkermansia sp. KLE1605\": { }, \"1574264 [species] Akkermansia sp. KLE1797\": { }, \"1574265 [species] Akkermansia sp. KLE1798\": { }, \"1638783 [species] Akkermansia sp. UNK.MGS-1\": { }, \"1755639 [species] Akkermansia sp. MC_55\": { } } } } Snapshot of taxonomy (taxid 1) in kate:","title":"list"},{"location":"usage/#lineage","text":"Usage Query taxonomic lineage of given TaxIds Input: - List of TaxIds, one TaxId per line. - Or tab-delimited format, please specify TaxId field with flag -i/--taxid-field (default 1). - Supporting (gzipped) file or STDIN. Output: 1. Input line data. 2. (Optional) Status code (-c/--show-status-code), values: - \"-1\" for queries not found in whole database. - \"0\" for deleted TaxIds, provided by \"delnodes.dmp\". - New TaxIds for merged TaxIds, provided by \"merged.dmp\". - Taxids for these found in \"nodes.dmp\". 3. Lineage, delimiter can be changed with flag -d/--delimiter. 4. (Optional) TaxIds taxons in the lineage (-t/--show-lineage-taxids) 5. (Optional) Name (-n/--show-name) 6. (Optional) Rank (-r/--show-rank) Filter out invalid and deleted taxids, and replace merged taxids with new ones: # input is one-column-taxid $ taxonkit lineage -c taxids.txt \\ | awk '$2>0' \\ | cut -f 2- # taxids are in 3rd field in a 4-columns tab-delimited file, # for $5, where 5 = 4 + 1. $ cat input.txt \\ | taxonkit lineage -c -i 3 \\ | csvtk filter2 -H -t -f '$5>0' \\ | csvtk -H -t cut -f -3 Usage: taxonkit lineage [flags] Flags: -d, --delimiter string field delimiter in lineage (default \";\") -h, --help help for lineage -L, --no-lineage do not show lineage, when user just want names or/and ranks -R, --show-lineage-ranks appending ranks of all levels -t, --show-lineage-taxids appending lineage consisting of taxids -n, --show-name appending scientific name -r, --show-rank appending rank of taxids -c, --show-status-code show status code before lineage -i, --taxid-field int field index of taxid. input data should be tab-separated (default 1) Examples Full lineage: # note that 123124124 is a fake taxid, 3 was deleted, 92489,1458427 were merged $ cat taxids.txt 9606 9913 376619 349741 239935 314101 11932 1327037 123124124 3 92489 1458427 $ taxonkit lineage taxids.txt | tee lineage.txt 19:22:13.077 [WARN] taxid 92489 was merged into 796334 19:22:13.077 [WARN] taxid 1458427 was merged into 1458425 19:22:13.077 [WARN] taxid 123124124 not found 19:22:13.077 [WARN] taxid 3 was deleted 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus 376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y 123124124 3 92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raicheisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei # wrapped table with csvtk pretty (>v0.26.0) $ taxonkit lineage taxids.txt | csvtk pretty -Ht -x ';' -W 70 -S bold \u250f\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2513 \u2503 9606 \u2503 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria; \u2503 \u2503 \u2503 Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi; \u2503 \u2503 \u2503 Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota; \u2503 \u2503 \u2503 Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates; \u2503 \u2503 \u2503 Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae; \u2503 \u2503 \u2503 Homo;Homo sapiens \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 9913 \u2503 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria; \u2503 \u2503 \u2503 Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi; \u2503 \u2503 \u2503 Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota; \u2503 \u2503 \u2503 Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla; \u2503 \u2503 \u2503 Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 376619 \u2503 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria; \u2503 \u2503 \u2503 Thiotrichales;Francisellaceae;Francisella;Francisella tularensis; \u2503 \u2503 \u2503 Francisella tularensis subsp. holarctica; \u2503 \u2503 \u2503 Francisella tularensis subsp. holarctica LVS \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 349741 \u2503 cellular organisms;Bacteria;PVC group;Verrucomicrobia; \u2503 \u2503 \u2503 Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia; \u2503 \u2503 \u2503 Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 239935 \u2503 cellular organisms;Bacteria;PVC group;Verrucomicrobia; \u2503 \u2503 \u2503 Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia; \u2503 \u2503 \u2503 Akkermansia muciniphila \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 314101 \u2503 cellular organisms;Bacteria;environmental samples; \u2503 \u2503 \u2503 uncultured murine large bowel bacterium BAC 54B \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 11932 \u2503 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes; \u2503 \u2503 \u2503 Ortervirales;Retroviridae;unclassified Retroviridae; \u2503 \u2503 \u2503 Intracisternal A-particles;Mouse Intracisternal A-particle \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 1327037 \u2503 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes; \u2503 \u2503 \u2503 Caudovirales;Siphoviridae;unclassified Siphoviridae; \u2503 \u2503 \u2503 Croceibacter phage P2559Y \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 92489 \u2503 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria; \u2503 \u2503 \u2503 Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae \u2503 \u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b \u2503 1458427 \u2503 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria; \u2503 \u2503 \u2503 Burkholderiales;Comamonadaceae;Serpentinomonas; \u2503 \u2503 \u2503 Serpentinomonas raichei \u2503 \u2517\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u253b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u251b Speed. $ time echo 9606 | taxonkit lineage 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens real 0m1.190s user 0m2.365s sys 0m0.170s # all TaxIds $ time taxonkit list --ids 1 --indent \"\" | taxonkit lineage > t real 0m4.249s user 0m16.418s sys 0m1.221s Checking deleted or merged taxids $ taxonkit lineage --show-status-code taxids.txt | tee lineage.withcode.txt # valid $ cat lineage.withcode.txt | awk '$2 > 0' | cut -f 1,2 9606 9606 9913 9913 376619 376619 349741 349741 239935 239935 314101 314101 11932 11932 1327037 1327037 92489 796334 1458427 1458425 # merged $ cat lineage.withcode.txt | awk '$2 > 0 && $2 != $1' | cut -f 1,2 92489 796334 1458427 1458425 # deleted $ cat lineage.withcode.txt | awk '$2 == 0' | cut -f 1 3 # invalid $ cat lineage.withcode.txt | awk '$2 < 0' | cut -f 1 123124124 Filter out invalid and deleted taxids, and replace merged taxids with new ones , you may install csvtk . # input is one-column-taxid $ taxonkit lineage -c taxids.txt \\ | awk '$2>0' \\ | cut -f 2- # taxids are in 3rd field in a 4-columns tab-delimited file, # for $5, where 5 = 4 + 1. $ cat input.txt \\ | taxonkit lineage -c -i 3 \\ | csvtk filter2 -H -t -f '$5>0' \\ | csvtk -H -t cut -f -3 Only show name and rank. $ taxonkit lineage -r -n -L taxids.txt \\ | csvtk pretty -H -t 9606 Homo sapiens species 9913 Bos taurus species 376619 Francisella tularensis subsp. holarctica LVS strain 349741 Akkermansia muciniphila ATCC BAA-835 strain 239935 Akkermansia muciniphila species 314101 uncultured murine large bowel bacterium BAC 54B species 11932 Mouse Intracisternal A-particle species 1327037 Croceibacter phage P2559Y species 123124124 3 92489 Erwinia oleae species 1458427 Serpentinomonas raichei species Show lineage consisting of taxids: $ taxonkit lineage -t taxids.txt 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 131567;2759;33154;33208;6072;33213;33511;7711;89593;7742;7776;117570;117571;8287;1338369;32523;32524;40674;32525;9347;1437010;314146;9443;376913;314293;9526;314295;9604;207598;9605;9606 9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus 131567;2759;33154;33208;6072;33213;33511;7711;89593;7742;7776;117570;117571;8287;1338369;32523;32524;40674;32525;9347;1437010;314145;91561;9845;35500;9895;27592;9903;9913 376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS 131567;2;1224;1236;72273;34064;262;263;119857;376619 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;1783257;74201;203494;48461;1647988;239934;239935;349741 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;1783257;74201;203494;48461;1647988;239934;239935 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B 131567;2;48479;314101 11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 10239;2559587;2732397;2732409;2732514;2169561;11632;35276;11749;11932 1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y 10239;2731341;2731360;2731618;2731619;28883;10699;196894;1327037 123124124 3 92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 131567;2;1224;1236;91347;1903409;551;796334 1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei 131567;2;1224;28216;80840;80864;2490452;1458425 or read taxids from STDIN: $ cat taxids.txt | taxonkit lineage And ranks of all nodes: $ echo 2697049 \\ | taxonkit lineage -t -R \\ | csvtk transpose -Ht 2697049 Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 superkingdom;clade;kingdom;phylum;class;order;suborder;family;subfamily;genus;subgenus;species;no rank Another way to show lineage detail of a TaxId $ echo 2697049 \\ | taxonkit lineage -t \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -H -t 10239 superkingdom Viruses 2559587 clade Riboviria 2732396 kingdom Orthornavirae 2732408 phylum Pisuviricota 2732506 class Pisoniviricetes 76804 order Nidovirales 2499399 suborder Cornidovirineae 11118 family Coronaviridae 2501931 subfamily Orthocoronavirinae 694002 genus Betacoronavirus 2509511 subgenus Sarbecovirus 694009 species Severe acute respiratory syndrome-related coronavirus 2697049 no rank Severe acute respiratory syndrome coronavirus 2","title":"lineage"},{"location":"usage/#reformat","text":"Usage Reformat lineage in canonical ranks Input: - List of TaxIds or lineages, one record per line. The lineage can be a complete lineage or only one taxonomy name. - Or tab-delimited format. Plese specify the lineage field with flag -i/--lineage-field (default 2). Or specify the TaxId field with flag -I/--taxid-field (default 0), which overrides -i/--lineage-field. - Supporting (gzipped) file or STDIN. Output: 1. Input line data. 2. Reformated lineage. 3. (Optional) TaxIds taxons in the lineage (-t/--show-lineage-taxids) Ambiguous names: - Some TaxIds have the same complete lineage, empty result is returned by default. You can use the flag -a/--output-ambiguous-result to return one possible result Output format can be formated by flag --format, available placeholders: {k}: superkingdom {K}: kingdom {p}: phylum {c}: class {o}: order {f}: family {g}: genus {s}: species {t}: subspecies/strain {S}: subspecies {T}: strain When these're no nodes of rank \"subspecies\" nor \"strain\", you can switch on -S/--pseudo-strain to use the node with lowest rank as subspecies/strain name, if which rank is lower than \"species\". This flag affects {t}, {S}, {T}. Output format can contains some escape charactors like \"\\t\". Usage: taxonkit reformat [flags] Flags: -P, --add-prefix add prefixes for all ranks, single prefix for a rank is defined by flag --prefix-X -d, --delimiter string field delimiter in input lineage (default \";\") -F, --fill-miss-rank fill missing rank with lineage information of the next higher rank -f, --format string output format, placeholders of rank are needed (default \"{k};{p};{c};{o};{f};{g};{s}\") -h, --help help for reformat -i, --lineage-field int field index of lineage. data should be tab-separated (default 2) -r, --miss-rank-repl string replacement string for missing rank -p, --miss-rank-repl-prefix string prefix for estimated taxon level (default \"unclassified \") -s, --miss-rank-repl-suffix string suffix for estimated taxon names. \"rank\" for rank name, \"\" for no suffix (default \"rank\") -R, --miss-taxid-repl string replacement string for missing taxid -a, --output-ambiguous-result output one of the ambigous result --prefix-K string prefix for kingdom, used along with flag -P/--add-prefix (default \"K__\") --prefix-S string prefix for subspecies, used along with flag -P/--add-prefix (default \"S__\") --prefix-T string prefix for strain, used along with flag -P/--add-prefix (default \"T__\") --prefix-c string prefix for class, used along with flag -P/--add-prefix (default \"c__\") --prefix-f string prefix for family, used along with flag -P/--add-prefix (default \"f__\") --prefix-g string prefix for genus, used along with flag -P/--add-prefix (default \"g__\") --prefix-k string prefix for superkingdom, used along with flag -P/--add-prefix (default \"k__\") --prefix-o string prefix for order, used along with flag -P/--add-prefix (default \"o__\") --prefix-p string prefix for phylum, used along with flag -P/--add-prefix (default \"p__\") --prefix-s string prefix for species, used along with flag -P/--add-prefix (default \"s__\") --prefix-t string prefix for subspecies/strain, used along with flag -P/--add-prefix (default \"t__\") -S, --pseudo-strain use the node with lowest rank as strain name, only if which rank is lower than \"species\" and not \"subpecies\" nor \"strain\". It affects {t}, {S}, {T}. This flag needs flag -F -t, --show-lineage-taxids show corresponding taxids of reformated lineage -I, --taxid-field int field index of taxid. input data should be tab-separated. it overrides -i/--lineage-field -T, --trim do not fill or add prefix for missing rank lower than current rank Examples: For version > 0.8.0, reformat accept input of TaxIds via flag -I/--taxid-field . $ echo 239935 | taxonkit reformat -I 1 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila $ echo 349741 | taxonkit reformat -I 1 -f \"{k}|{p}|{c}|{o}|{f}|{g}|{s}|{t}\" -F -t 349741 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila|Akkermansia muciniphila ATCC BAA-835 2|74201|203494|48461|1647988|239934|239935|349741 Example lineage (produced by: taxonkit lineage taxids.txt | awk '$2!=\"\"' > lineage.txt ). $ cat lineage.txt 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus 376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y 92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei Default output format ( \"{k};{p};{c};{o};{f};{g};{s}\" ). # reformated lineages are appended to the input data $ taxonkit reformat lineage.txt ... 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila ... $ $ taxonkit reformat lineage.txt | tee lineage.txt.reformat $ cut -f 1,3 lineage.txt.reformat 9606 Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens 9913 Eukaryota;Chordata;Mammalia;Artiodactyla;Bovidae;Bos;Bos taurus 376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis 349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B 11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y 92489 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 1458427 Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei # aligned $ cat lineage.txt \\ | taxonkit reformat \\ | csvtk -H -t cut -f 1,3 \\ | csvtk -H -t sep -f 2 -s ';' -R \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk pretty -t taxid kindom phylum class order family genus species ------- --------- --------------- ------------------- ------------------ --------------- -------------------------- ----------------------------------------------- 9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens 9913 Eukaryota Chordata Mammalia Artiodactyla Bovidae Bos Bos taurus 376619 Bacteria Proteobacteria Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis 349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila 239935 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila 314101 Bacteria uncultured murine large bowel bacterium BAC 54B 11932 Viruses Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle 1327037 Viruses Uroviricota Caudoviricetes Caudovirales Siphoviridae Croceibacter phage P2559Y 92489 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Erwiniaceae Erwinia Erwinia oleae 1458427 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Comamonadaceae Serpentinomonas Serpentinomonas raichei And subspecies/strain ( {t} ), subspecies ( {S} ), and strain ( {T} ) are also available. # default operation $ echo -ne \"239935\\n83333\\n1408252\\n2697049\\n2605619\\n\" \\ | taxonkit lineage -n -r \\ | taxonkit reformat -f '{t};{S};{T}' \\ | csvtk -H -t cut -f 1,4,3,5 \\ | csvtk -H -t sep -f 4 -s ';' -R \\ | csvtk -H -t add-header -n \"taxid,rank,name,subspecies/strain,subspecies,strain\" \\ | csvtk pretty -t taxid rank name subspecies/strain subspecies strain ------- ---------- ----------------------------------------------- --------------------- --------------------- --------------------- 239935 species Akkermansia muciniphila 83333 strain Escherichia coli K-12 Escherichia coli K-12 Escherichia coli K-12 1408252 subspecies Escherichia coli R178 Escherichia coli R178 Escherichia coli R178 2697049 no rank Severe acute respiratory syndrome coronavirus 2 2605619 no rank Escherichia coli O16:H48 # fill missing ranks # see example below for -F/--fill-miss-rank # $ echo -ne \"239935\\n83333\\n1408252\\n2697049\\n2605619\\n\" \\ | taxonkit lineage -n -r \\ | taxonkit reformat -f '{t};{S};{T}' --fill-miss-rank \\ | csvtk -H -t cut -f 1,4,3,5 \\ | csvtk -H -t sep -f 4 -s ';' -R \\ | csvtk -H -t add-header -n \"taxid,rank,name,subspecies/strain,subspecies,strain\" \\ | csvtk pretty -t taxid rank name subspecies/strain subspecies strain ------- ---------- ----------------------------------------------- ------------------------------------------------------------------------------------ ----------------------------------------------------------------------------- ------------------------------------------------------------------------- 239935 species Akkermansia muciniphila unclassified Akkermansia muciniphila subspecies/strain unclassified Akkermansia muciniphila subspecies unclassified Akkermansia muciniphila strain 83333 strain Escherichia coli K-12 Escherichia coli K-12 unclassified Escherichia coli subspecies Escherichia coli K-12 1408252 subspecies Escherichia coli R178 Escherichia coli R178 Escherichia coli R178 unclassified Escherichia coli R178 strain 2697049 no rank Severe acute respiratory syndrome coronavirus 2 unclassified Severe acute respiratory syndrome-related coronavirus subspecies/strain unclassified Severe acute respiratory syndrome-related coronavirus subspecies unclassified Severe acute respiratory syndrome-related coronavirus strain 2605619 no rank Escherichia coli O16:H48 unclassified Escherichia coli subspecies/strain unclassified Escherichia coli subspecies unclassified Escherichia coli strain When these's no nodes of rank \"subspecies\" nor \"strain\", you can switch -S/--pseudo-strain to use the node with lowest rank as subspecies/strain name, if which rank is lower than \"species\" . Recommend using v0.14.1 or later versions. $ echo -ne \"239935\\n83333\\n1408252\\n2697049\\n2605619\\n\" \\ | taxonkit lineage -n -r \\ | taxonkit reformat -f '{t};{S};{T}' --pseudo-strain \\ | csvtk -H -t cut -f 1,4,3,5 \\ | csvtk -H -t sep -f 4 -s ';' -R \\ | csvtk -H -t add-header -n \"taxid,rank,name,subspecies/strain,subspecies,strain\" \\ | csvtk pretty -t taxid rank name subspecies/strain subspecies strain ------- ---------- ----------------------------------------------- ----------------------------------------------- ----------------------------------------------- ----------------------------------------------- 239935 species Akkermansia muciniphila 83333 strain Escherichia coli K-12 Escherichia coli K-12 Escherichia coli K-12 1408252 subspecies Escherichia coli R178 Escherichia coli R178 Escherichia coli R178 2697049 no rank Severe acute respiratory syndrome coronavirus 2 Severe acute respiratory syndrome coronavirus 2 Severe acute respiratory syndrome coronavirus 2 Severe acute respiratory syndrome coronavirus 2 2605619 no rank Escherichia coli O16:H48 Escherichia coli O16:H48 Escherichia coli O16:H48 Escherichia coli O16:H48 Add prefix ( -P/--add-prefix ). $ cat lineage.txt \\ | taxonkit reformat -P \\ | csvtk -H -t cut -f 1,3 9606 k__Eukaryota;p__Chordata;c__Mammalia;o__Primates;f__Hominidae;g__Homo;s__Homo sapiens 9913 k__Eukaryota;p__Chordata;c__Mammalia;o__Artiodactyla;f__Bovidae;g__Bos;s__Bos taurus 376619 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Thiotrichales;f__Francisellaceae;g__Francisella;s__Francisella tularensis 349741 k__Bacteria;p__Verrucomicrobia;c__Verrucomicrobiae;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia muciniphila 239935 k__Bacteria;p__Verrucomicrobia;c__Verrucomicrobiae;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia muciniphila 314101 k__Bacteria;p__;c__;o__;f__;g__;s__uncultured murine large bowel bacterium BAC 54B 11932 k__Viruses;p__Artverviricota;c__Revtraviricetes;o__Ortervirales;f__Retroviridae;g__Intracisternal A-particles;s__Mouse Intracisternal A-particle 1327037 k__Viruses;p__Uroviricota;c__Caudoviricetes;o__Caudovirales;f__Siphoviridae;g__;s__Croceibacter phage P2559Y 92489 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Erwiniaceae;g__Erwinia;s__Erwinia oleae 1458427 k__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__Comamonadaceae;g__Serpentinomonas;s__Serpentinomonas raichei Show corresponding taxids of reformated lineage (flag -t/--show-lineage-taxids ) $ cat lineage.txt \\ | taxonkit reformat -t \\ | csvtk -H -t cut -f 1,4 \\ | csvtk -H -t sep -f 2 -s ';' -R \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk pretty -t taxid kindom phylum class order family genus species ------- ------ ------- ------- ------- ------- ------- ------- 9606 2759 7711 40674 9443 9604 9605 9606 9913 2759 7711 40674 91561 9895 9903 9913 376619 2 1224 1236 72273 34064 262 263 349741 2 74201 203494 48461 1647988 239934 239935 239935 2 74201 203494 48461 1647988 239934 239935 314101 2 314101 11932 10239 2732409 2732514 2169561 11632 11749 11932 1327037 10239 2731618 2731619 28883 10699 1327037 92489 2 1224 1236 91347 1903409 551 796334 1458427 2 1224 28216 80840 80864 2490452 1458425 Use custom symbols for unclassfied ranks ( -r/--miss-rank-repl ) $ taxonkit reformat lineage.txt -r \"__\" | cut -f 3 Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens Eukaryota;Chordata;Mammalia;Artiodactyla;Bovidae;Bos;Bos taurus Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;__;__;__;__;__;uncultured murine large bowel bacterium BAC 54B Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;__;Croceibacter phage P2559Y Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei $ taxonkit reformat lineage.txt -r Unassigned | cut -f 3 Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens Eukaryota;Chordata;Mammalia;Artiodactyla;Bovidae;Bos;Bos taurus Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Unassigned;Unassigned;Unassigned;Unassigned;Unassigned;uncultured murine large bowel bacterium BAC 54B Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;Unassigned;Croceibacter phage P2559Y Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei Estimate and fill missing rank with original lineage information ( -F, --fill-miss-rank , very useful for formatting input data for LEfSe ). You can change the prefix \"unclassified\" using flag -p/--miss-rank-repl-prefix . $ cat lineage.txt \\ | taxonkit reformat -F \\ | csvtk -H -t cut -f 1,3 \\ | csvtk -H -t sep -f 2 -s ';' -R \\ | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\ | csvtk pretty -t taxid kindom phylum class order family genus species ------- --------- ---------------------------- --------------------------- --------------------------- ---------------------------- ------------------------------- ----------------------------------------------- 9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens 9913 Eukaryota Chordata Mammalia Artiodactyla Bovidae Bos Bos taurus 376619 Bacteria Proteobacteria Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis 349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila 239935 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila 314101 Bacteria unclassified Bacteria phylum unclassified Bacteria class unclassified Bacteria order unclassified Bacteria family unclassified Bacteria genus uncultured murine large bowel bacterium BAC 54B 11932 Viruses Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle 1327037 Viruses Uroviricota Caudoviricetes Caudovirales Siphoviridae unclassified Siphoviridae genus Croceibacter phage P2559Y 92489 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Erwiniaceae Erwinia Erwinia oleae 1458427 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Comamonadaceae Serpentinomonas Serpentinomonas raichei Do not add prefix or suffix for estimated nodes: $ echo 314101 | taxonkit reformat -I 1 314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B $ echo 314101 | taxonkit reformat -I 1 -F -p \"\" -s \"\" 314101 Bacteria;Bacteria;Bacteria;Bacteria;Bacteria;Bacteria;uncultured murine large bowel bacterium BAC 54B Only some ranks. $ cat lineage.txt \\ | taxonkit reformat -F -f \"{s};{p}\"\\ | csvtk -H -t cut -f 1,3 \\ | csvtk -H -t sep -f 2 -s ';' -R \\ | csvtk add-header -t -n taxid,species,phylum \\ | csvtk pretty -t taxid species phylum ------- ----------------------------------------------- ---------------------------- 9606 Homo sapiens Chordata 9913 Bos taurus Chordata 376619 Francisella tularensis Proteobacteria 349741 Akkermansia muciniphila Verrucomicrobia 239935 Akkermansia muciniphila Verrucomicrobia 314101 uncultured murine large bowel bacterium BAC 54B unclassified Bacteria phylum 11932 Mouse Intracisternal A-particle Artverviricota 1327037 Croceibacter phage P2559Y Uroviricota 92489 Erwinia oleae Proteobacteria 1458427 Serpentinomonas raichei Proteobacteria For some taxids which rank is higher than the lowest rank in -f/--format , use -T/--trim to avoid fill missing rank lower than current rank . $ echo -ne \"2\\n239934\\n239935\\n\" \\ | taxonkit lineage \\ | taxonkit reformat -F \\ | sed -r \"s/;+$//\" \\ | csvtk -H -t cut -f 1,3 2 Bacteria;unclassified Bacteria phylum;unclassified Bacteria class;unclassified Bacteria order;unclassified Bacteria family;unclassified Bacteria genus;unclassified Bacteria species 239934 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;unclassified Akkermansia species 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila $ echo -ne \"2\\n239934\\n239935\\n\" \\ | taxonkit lineage \\ | taxonkit reformat -F -T \\ | sed -r \"s/;+$//\" \\ | csvtk -H -t cut -f 1,3 2 Bacteria 239934 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia 239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Support tab in format string $ echo 9606 \\ | taxonkit lineage \\ | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{S}\" \\ | csvtk cut -t -f -2 9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens List seven-level lineage for all TaxIds. # replace empty taxon with \"Unassigned\" $ taxonkit list --ids 1 \\ | taxonkit lineage \\ | taxonkit reformat -r Unassigned | gzip -c > all.lineage.tsv.gz # tab-delimited seven-levels $ taxonkit list --ids 1 \\ | taxonkit lineage \\ | taxonkit reformat -r Unassigned -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" \\ | csvtk cut -H -t -f -2 \\ | head -n 5 \\ | csvtk pretty -H -t # 8-level $ taxonkit list --ids 1 \\ | taxonkit lineage \\ | taxonkit reformat -r Unassigned -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\ | csvtk cut -H -t -f -2 \\ | head -n 5 \\ | csvtk pretty -H -t # Fill and trim $ memusg -t -s ' taxonkit list --ids 1 \\ | taxonkit lineage \\ | taxonkit reformat -F -T \\ | sed -r \"s/;+$//\" \\ | gzip -c > all.lineage.tsv.gz ' elapsed time: 19.930s peak rss: 6.25 GB From taxid to 7-ranks lineage: $ cat taxids.txt | taxonkit lineage | taxonkit reformat # for taxonkit v0.8.0 or later versions $ cat taxids.txt | taxonkit reformat -I 1 Some TaxIds have the same complete lineage, empty result is returned by default. You can use the flag -a/--output-ambiguous-result to return one possible result. see #42 $ echo -ne \"2507530\\n2516889\\n\" | taxonkit lineage --data-dir . | taxonkit reformat --data-dir . -t 19:18:29.770 [WARN] we can't distinguish the TaxIds (2507530, 2516889) for lineage: cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019. But you can use -a/--output-ambiguous-result to return one possible result 19:18:29.770 [WARN] we can't distinguish the TaxIds (2507530, 2516889) for lineage: cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019. But you can use -a/--output-ambiguous-result to return one possible result 2507530 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 2516889 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 $ echo -ne \"2507530\\n2516889\\n\" | taxonkit lineage --data-dir . | taxonkit reformat --data-dir . -t -a 2507530 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 Eukaryota;Basidiomycota;Agaricomycetes;Russulales;Russulaceae;Russula;Russula sp. 8 KA-2019 2759;5204;155619;452342;5401;5402;2507530 2516889 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 Eukaryota;Basidiomycota;Agaricomycetes;Russulales;Russulaceae;Russula;Russula sp. 8 KA-2019 2759;5204;155619;452342;5401;5402;2507530","title":"reformat"},{"location":"usage/#name2taxid","text":"Usage Convert scientific names to TaxIds Attention: 1. Some TaxIds share the same scientific names, e.g, Drosophila. These input lines are duplicated with multiple TaxIds. $ echo Drosophila | taxonkit name2taxid | taxonkit lineage -i 2 -r -L Drosophila 7215 genus Drosophila 32281 subgenus Drosophila 2081351 genus Usage: taxonkit name2taxid [flags] Flags: -h, --help help for name2taxid -i, --name-field int field index of name. data should be tab-separated (default 1) -s, --sci-name only searching scientific names -r, --show-rank show rank Examples Example data $ cat names.txt Homo sapiens Akkermansia muciniphila ATCC BAA-835 Akkermansia muciniphila Mouse Intracisternal A-particle Wei Shen uncultured murine large bowel bacterium BAC 54B Croceibacter phage P2559Y Default. # taxonkit name2taxid names.txt $ cat names.txt | taxonkit name2taxid | csvtk pretty -H -t Homo sapiens 9606 Akkermansia muciniphila ATCC BAA-835 349741 Akkermansia muciniphila 239935 Mouse Intracisternal A-particle 11932 Wei Shen uncultured murine large bowel bacterium BAC 54B 314101 Croceibacter phage P2559Y 1327037 Show rank. $ cat names.txt | taxonkit name2taxid --show-rank | csvtk pretty -H -t Homo sapiens 9606 species Akkermansia muciniphila ATCC BAA-835 349741 strain Akkermansia muciniphila 239935 species Mouse Intracisternal A-particle 11932 species Wei Shen uncultured murine large bowel bacterium BAC 54B 314101 species Croceibacter phage P2559Y 1327037 species From name to lineage. $ cat names.txt | taxonkit name2taxid | taxonkit lineage --taxid-field 2 Homo sapiens 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens Akkermansia muciniphila ATCC BAA-835 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 Akkermansia muciniphila 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Mouse Intracisternal A-particle 11932 Viruses;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle Wei Shen uncultured murine large bowel bacterium BAC 54B 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B Croceibacter phage P2559Y 1327037 Viruses;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y Some TaxIds share the same scientific names , e.g, Drosophila. $ echo Drosophila \\ | taxonkit name2taxid \\ | taxonkit lineage -i 2 -r \\ | taxonkit reformat -i 3 \\ | csvtk cut -H -t -f 1,2,4,5 \\ | csvtk pretty -H -t Drosophila 7215 genus Eukaryota;Arthropoda;Insecta;Diptera;Drosophilidae;Drosophila; Drosophila 32281 subgenus Eukaryota;Arthropoda;Insecta;Diptera;Drosophilidae;Drosophila; Drosophila 2081351 genus Eukaryota;Basidiomycota;Agaricomycetes;Agaricales;Psathyrellaceae;Drosophila;","title":"name2taxid"},{"location":"usage/#filter","text":"Usage Filter TaxIds by taxonomic rank range Attentions: 1. Flag -L/--lower-than and -H/--higher-than are exclusive, and can be used along with -E/--equal-to which values can be different. 2. A list of pre-ordered ranks is in ~/.taxonkit/ranks.txt, you can use your list by -r/--rank-file, the format specification is below. 3. All ranks in taxonomy database should be defined in rank file. 4. Ranks can be removed with black list via -B/--black-list. 5. TaxIDs with no rank are kept by default!!! They can be optionally discarded by -N/--discard-noranks. 6. [Recommended] When filtering with -L/--lower-than, you can use -n/--save-predictable-norank to save some special ranks without order, where rank of the closest higher node is still lower than rank cutoff. Rank file: 1. Blank lines or lines starting with \"#\" are ignored. 2. Ranks are in decending order and case ignored. 3. Ranks with same order should be in one line separated with comma (\",\", no space). 4. Ranks without order should be assigned a prefix symbol \"!\" for each rank. Usage: taxonkit filter [flags] Flags: -B, --black-list strings black list of ranks to discard, e.g., '-B \"no rank\" -B \"clade\" -N, --discard-noranks discard all ranks without order, type \"taxonkit filter --help\" for details -R, --discard-root discard root taxid, defined by --root-taxid -E, --equal-to strings output TaxIds with rank equal to some ranks, multiple values can be separated with comma \",\" (e.g., -E \"genus,species\"), or give multiple times (e.g., -E genus -E species) -h, --help help for filter -H, --higher-than string output TaxIds with rank higher than a rank, exclusive with --lower-than --list-order list user defined ranks in order, from \"$HOME/.taxonkit/ranks.txt\" --list-ranks list ordered ranks in taxonomy database, sorted in user defined order -L, --lower-than string output TaxIds with rank lower than a rank, exclusive with --higher-than -r, --rank-file string user-defined ordered taxonomic ranks, type \"taxonkit filter --help\" for details --root-taxid uint32 root taxid (default 1) -n, --save-predictable-norank do not discard some special ranks without order when using -L, where rank of the closest higher node is still lower than rank cutoff -i, --taxid-field int field index of taxid. input data should be tab-separated (default 1) Examples Example data $ echo 349741 | taxonkit lineage -t | cut -f 3 | sed 's/;/\\n/g' > taxids2.txt $ cat taxids2.txt 131567 2 1783257 74201 203494 48461 1647988 239934 239935 349741 $ cat taxids2.txt | taxonkit lineage -r | csvtk -Ht cut -f 1,3,2 | csvtk pretty -H -t 131567 no rank cellular organisms 2 superkingdom cellular organisms;Bacteria 1783257 clade cellular organisms;Bacteria;PVC group 74201 phylum cellular organisms;Bacteria;PVC group;Verrucomicrobia 203494 class cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae 48461 order cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales 1647988 family cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae 239934 genus cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia 239935 species cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 349741 strain cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 Equal to certain rank(s) ( -E/--equal-to ) $ cat taxids2.txt \\ | taxonkit filter -E Phylum -E Class \\ | taxonkit lineage -r \\ | csvtk -Ht cut -f 1,3,2 \\ | csvtk pretty -H -t 74201 phylum cellular organisms;Bacteria;PVC group;Verrucomicrobia 203494 class cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae Lower than a rank ( -L/--lower-than ) $ cat taxids2.txt \\ | taxonkit filter -L genus \\ | taxonkit lineage -r -n -L \\ | csvtk -Ht cut -f 1,3,2 \\ | csvtk pretty -H -t 239935 species Akkermansia muciniphila 349741 strain Akkermansia muciniphila ATCC BAA-835 Higher than a rank ( -H/--higher-than ) $ cat taxids2.txt \\ | taxonkit filter -H phylum \\ | taxonkit lineage -r -n -L \\ | csvtk -Ht cut -f 1,3,2 \\ | csvtk pretty -H -t 2 superkingdom Bacteria TaxIDs with no rank are kept by default!!! \"no rank\" and \"clade\" have no rank and can be filter out via -N/--discard-noranks . Futher ranks can be removed with black list via -B/--black-list . # 562 is the TaxId of Escherichia coli $ taxonkit list --ids 562 \\ | taxonkit filter -L species \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk freq -Ht -f 2 -nr \\ | csvtk pretty -H -t strain 2950 no rank 149 serotype 141 serogroup 95 isolate 1 subspecies 1 $ taxonkit list --ids 562 \\ | taxonkit filter -L species -N -B strain \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk freq -Ht -f 2 -nr \\ | csvtk pretty -H -t serotype 141 serogroup 95 isolate 1 subspecies 1 Combine of -L/-H with -E . $ cat taxids2.txt \\ | taxonkit filter -L genus -E genus \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -H -t 239934 genus Akkermansia 239935 species Akkermansia muciniphila 349741 strain Akkermansia muciniphila ATCC BAA-835 Special cases of \"no rank\" . ( -n/--save-predictable-norank ). When filtering with -L/--lower-than , you can use -n/--save-predictable-norank to save some special ranks without order, where rank of the closest higher node is still lower than rank cutoff. $ echo -ne \"2605619\\n1327037\\n\" \\ | taxonkit lineage -t \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -H -t 131567 no rank cellular organisms 2 superkingdom Bacteria 1224 phylum Proteobacteria 1236 class Gammaproteobacteria 91347 order Enterobacterales 543 family Enterobacteriaceae 561 genus Escherichia 562 species Escherichia coli 2605619 no rank Escherichia coli O16:H48 10239 superkingdom Viruses 2731341 clade Duplodnaviria 2731360 clade Heunggongvirae 2731618 phylum Uroviricota 2731619 class Caudoviricetes 28883 order Caudovirales 10699 family Siphoviridae 196894 no rank unclassified Siphoviridae 1327037 species Croceibacter phage P2559Y # save taxids $ echo -ne \"2605619\\n1327037\\n\" \\ | taxonkit lineage -t \\ | csvtk cut -Ht -f 3 \\ | csvtk unfold -Ht -f 1 -s \";\" \\ | tee taxids4.txt 131567 2 1224 1236 91347 543 561 562 2605619 10239 2731341 2731360 2731618 2731619 28883 10699 196894 1327037 Now, filter nodes of rank <= species. $ cat taxids4.txt \\ | taxonkit filter -L species -E species -N -n \\ | taxonkit lineage -r -n -L \\ | csvtk cut -Ht -f 1,3,2 \\ | csvtk pretty -H -t 562 species Escherichia coli 2605619 no rank Escherichia coli O16:H48 1327037 species Croceibacter phage P2559Y Note that 2605619 (no rank) is saved because its parent node 562 is <= species.","title":"filter"},{"location":"usage/#lca","text":"Usage Compute lowest common ancestor (LCA) for TaxIds Attention: 1. This command computes LCA TaxId for a list of TaxIds in a field (\"-i/--taxids-field) of tab-delimited file or STDIN. 2. TaxIDs should have the same separator (\"-s/--separator\"), single charactor separator is prefered. 3. Empty lines or lines without valid TaxIds in the field are omitted. 4. If some TaxIds are not found in database, it returns 0. Examples: $ echo 239934, 239935, 349741 | taxonkit lca -s \", \" 239934, 239935, 349741 239934 $ time echo 239934 239935 349741 9606 | taxonkit lca 239934 239935 349741 9606 131567 Usage: taxonkit lca [flags] Flags: -b, --buffer-size string size of line buffer, supported unit: K, M, G. You need to increase the value when \"bufio.Scanner: token too long\" error occured (default \"1M\") -h, --help help for lca --separater string separater for TaxIds. This flag is same to --separator. (default \" \") -s, --separator string separator for TaxIds (default \" \") -D, --skip-deleted skip deleted TaxIds and compute with left ones -U, --skip-unfound skip unfound TaxIds and compute with left ones -i, --taxids-field int field index of TaxIds. Input data should be tab-separated (default 1) Examples: Example data $ taxonkit list --ids 9605 -nr --indent \" \" 9605 [genus] Homo 9606 [species] Homo sapiens 63221 [subspecies] Homo sapiens neanderthalensis 741158 [subspecies] Homo sapiens subsp. 'Denisova' 1425170 [species] Homo heidelbergensis 2665952 [no rank] environmental samples 2665953 [species] Homo sapiens environmental sample Simple one $ echo 63221 2665953 | taxonkit lca 63221 2665953 9605 Custom field ( -i/--taxids-field ) and separater ( -s/--separator ). $ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" a 63221,2665953 b 63221, 741158 $ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" \\ | taxonkit lca -i 2 -s \",\" a 63221,2665953 9605 b 63221, 741158 9606 Merged TaxIds. # merged $ echo 92487 92488 92489 | taxonkit lca 10:08:26.578 [WARN] taxid 92489 was merged into 796334 92487 92488 92489 1236 Deleted TaxIds, you can ommit theses and continue compute with left onces with ( -D/--skip-deleted ). $ echo 1 2 3 | taxonkit lca 10:30:17.678 [WARN] taxid 3 not found 1 2 3 0 $ time echo 1 2 3 | taxonkit lca -D 10:29:31.828 [WARN] taxid 3 was deleted 1 2 3 1 TaxIDs not found in database, you can ommit theses and continue compute with left onces with ( -U/--skip-unfound ). $ echo 61021 61022 11111111 | taxonkit lca 10:31:44.929 [WARN] taxid 11111111 not found 61021 61022 11111111 0 $ echo 61021 61022 11111111 | taxonkit lca -U 10:32:02.772 [WARN] taxid 11111111 not found 61021 61022 11111111 2628496","title":"lca"},{"location":"usage/#taxid-changelog","text":"Usage Create TaxId changelog from dump archives Steps: # dependencies: # rush - https://github.com/shenwei356/rush/ mkdir -p archive; cd archive; # --------- download --------- # option 1 # for fast network connection wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/taxdmp*.zip # option 2 # for slow network connection url=https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/ wget $url -O - -o /dev/null \\ | grep taxdmp | perl -ne '/(taxdmp_.+?.zip)/; print \"$1\\n\";' \\ | rush -j 2 -v url=$url 'axel -n 5 {url}/{}' \\ --immediate-output -c -C download.rush # --------- unzip --------- ls taxdmp*.zip | rush -j 1 'unzip {} names.dmp nodes.dmp merged.dmp delnodes.dmp -d {@_(.+)\\.}' # optionally compress .dmp files with pigz, for saving disk space fd .dmp$ | rush -j 4 'pigz {}' # --------- create log --------- cd .. taxonkit taxid-changelog -i archive -o taxid-changelog.csv.gz --verbose Output format (CSV): # fields comments taxid # taxid version # version / time of archive, e.g, 2019-07-01 change # change, values: # NEW newly added # REUSE_DEL deleted taxids being reused # REUSE_MER merged taxids being reused # DELETE deleted # MERGE merged into another taxid # ABSORB other taxids merged into this one # CHANGE_NAME scientific name changed # CHANGE_RANK rank changed # CHANGE_LIN_LIN lineage taxids remain but lineage remain # CHANGE_LIN_TAX lineage taxids changed # CHANGE_LIN_LEN lineage length changed change-value # variable values for changes: # 1) new taxid for MERGE # 2) merged taxids for ABSORB # 3) empty for others name # scientific name rank # rank lineage # complete lineage of the taxid lineage-taxids # taxids of the lineage # you can use csvtk to investigate them. e.g., csvtk grep -f taxid -p 1390515 taxid-changelog.csv.gz Usage: taxonkit taxid-changelog [flags] Flags: -i, --archive string directory containing uncompressed dumped archives -h, --help help for taxid-changelog Details Example 1 ( E.coli with taxid 562 ) $ pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -p 562 \\ | csvtk pretty taxid version change change-value name rank lineage lineage-taxids 562 2014-08-01 NEW Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 562 2014-08-01 ABSORB 662101;662104 Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 562 2015-11-01 ABSORB 1637691 Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 562 2016-10-01 CHANGE_LIN_LIN Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 562 2018-06-01 ABSORB 469598 Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562 # merged taxids $ pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -p 662101,662104,1637691,469598 \\ | csvtk pretty taxid version change change-value name rank lineage lineage-taxids 469598 2014-08-01 NEW Escherichia sp. 3_2_53FAA species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia sp. 3_2_53FAA 131567;2;1224;1236;91347;543;561;469598 469598 2016-10-01 CHANGE_LIN_LIN Escherichia sp. 3_2_53FAA species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia sp. 3_2_53FAA 131567;2;1224;1236;91347;543;561;469598 469598 2018-06-01 MERGE 562 Escherichia sp. 3_2_53FAA species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia sp. 3_2_53FAA 131567;2;1224;1236;91347;543;561;469598 662101 2014-08-01 MERGE 562 662104 2014-08-01 MERGE 562 1637691 2015-04-01 DELETE 1637691 2015-05-01 REUSE_DEL Escherichia sp. MAR species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia sp. MAR 131567;2;1224;1236;91347;543;561;1637691 1637691 2015-11-01 MERGE 562 Escherichia sp. MAR species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia sp. MAR 131567;2;1224;1236;91347;543;561;1637691 Example 2 (SARS-CoV-2). $ time pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -p 2697049 \\ | csvtk pretty taxid version change change-value name rank lineage lineage-taxids 2697049 2020-02-01 NEW Wuhan seafood market pneumonia virus species Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;unclassified Betacoronavirus;Wuhan seafood market pneumonia virus 10239;2559587;76804;2499399;11118;2501931;694002;696098;2697049 2697049 2020-03-01 CHANGE_NAME Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-03-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-03-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-06-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-07-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 isolate Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 2697049 2020-08-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049 real 0m7.644s user 0m16.749s sys 0m3.985s Example 3 (All subspecies and strain in Akkermansia muciniphila 239935) # species in Akkermansia $ taxonkit list --show-rank --show-name --indent \" \" --ids 239935 239935 [species] Akkermansia muciniphila 349741 [strain] Akkermansia muciniphila ATCC BAA-835 # check them all $ pigz -cd taxid-changelog.csv.gz \\ | csvtk grep -f taxid -P <(taxonkit list --indent \"\" --ids 239935) \\ | csvtk pretty lineage-taxids taxid version change change-value name rank lineage lineage-taxids 239935 2014-08-01 NEW Akkermansia muciniphila species cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Verrucomicrobiaceae;Akkermansia;Akkermansia muciniphila 131567;2;51290;74201;203494;48461;203557;239934;239935 239935 2015-05-01 CHANGE_LIN_TAX Akkermansia muciniphila species cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;51290;74201;203494;48461;1647988;239934;239935 239935 2016-03-01 CHANGE_LIN_TAX Akkermansia muciniphila species cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;1783257;74201;203494;48461;1647988;239934;239935 239935 2016-05-01 ABSORB 1834199 Akkermansia muciniphila species cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;1783257;74201;203494;48461;1647988;239934;239935 349741 2014-08-01 NEW Akkermansia muciniphila ATCC BAA-835 no rank cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Verrucomicrobiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;51290;74201;203494;48461;203557;239934;239935;349741 349741 2015-05-01 CHANGE_LIN_TAX Akkermansia muciniphila ATCC BAA-835 no rank cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;51290;74201;203494;48461;1647988;239934;239935;349741 349741 2016-03-01 CHANGE_LIN_TAX Akkermansia muciniphila ATCC BAA-835 no rank cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;1783257;74201;203494;48461;1647988;239934;239935;349741 349741 2020-07-01 CHANGE_RANK Akkermansia muciniphila ATCC BAA-835 strain cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;1783257;74201;203494;48461;1647988;239934;239935;349741 More","title":"taxid-changelog"},{"location":"usage/#create-taxdump","text":"Usage Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV Input format: 0. For GTDB taxonomy file, just use --gtdb. We use the numeric assembly accession as the taxon at subspecies rank. (without the prefix GCA_ and GCF_, and version number). 1. The input file should be tab-delimited, at least one column is needed. 2. Ranks can be given either via the first row or the flag --rank-names. 3. The column containing the genome/assembly accession is recommended to generate TaxId mapping file (taxid.map, id -> taxid). -A/--field-accession, field contaning genome/assembly accession --field-accession-re, regular expression to extract the accession Note that mutiple TaxIds pointing to the same accession are listed as comma-seperated integers. Attentions: 1. Names should be distinct in taxa of different ranks. But for these missing some taxon nodes, using names of parent nodes is allowed: GB_GCA_018897955.1 d__Archaea;p__EX4484-52;c__EX4484-52;o__EX4484-52;f__LFW-46;g__LFW-46;s__LFW-46 sp018897155 It can also detect duplicate names with different ranks, e.g., the Class and Genus have the same name B47-G6, and the Order and Family between them have different names. In this case, we reassign a new TaxId by increasing the TaxId until it being distinct. GB_GCA_003663585.1 d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663585 2. Taxa from different parents may have the same name. We will assign different TaxIds to them. E.g., in ICTV, many viruses from different species have the same names. In practice, we set the \"Virus names(s)\" as a subspecies rank and also specify it as the accession. Species Virus name(s) Jerseyvirus SETP3 Salmonella phage SETP7 Jerseyvirus SETP7 Salmonella phage SETP7 3. The generated TaxIds are not consecutive numbers, however some tools like MMSeqs2 required this, you can use the script below for convertion: https://github.com/apcamargo/ictv-mmseqs2-protein-database/blob/master/fix_taxdump.py Usage: taxonkit create-taxdump [flags] Flags: -A, --field-accession int field index of assembly accession (genome ID), for outputting taxid.map --field-accession-re string regular expression to extract assembly accession (default \"^\\\\w\\\\w_(.+)$\") --force overwrite existed output directory --gtdb input files are GTDB taxonomy file --gtdb-re-subs string regular expression to extract assembly accession as the subspecies (default \"^\\\\w\\\\w_GC[AF]_(.+)\\\\.\\\\d+$\") -h, --help help for create-taxdump --line-chunk-size int number of lines to process for each thread, and 4 threads is fast enough. (default 5000) --null strings null value of taxa (default [,NULL,NA]) -x, --old-taxdump-dir string taxdump directory of the previous version, for generating merged.dmp and delnodes.dmp -O, --out-dir string output directory -R, --rank-names strings names of all ranks, leave it empty to use the first row of input as rank names Examples: GTDB. See more: https://github.com/shenwei356/gtdb-taxdump $ taxonkit create-taxdump --gtdb ar53_taxonomy_r207.tsv.gz bac120_taxonomy_r207.tsv.gz --out-dir taxdump 16:42:35.213 [INFO] 317542 records saved to taxdump/taxid.map 16:42:35.460 [INFO] 401815 records saved to taxdump/nodes.dmp 16:42:35.611 [INFO] 401815 records saved to taxdump/names.dmp 16:42:35.611 [INFO] 0 records saved to taxdump/merged.dmp 16:42:35.611 [INFO] 0 records saved to taxdump/delnodes.dmp ICTV, See more: https://github.com/shenwei356/ictv-taxdump MGV . Only Order, Family, Genus information are available. $ cat mgv_contig_info.tsv \\ | csvtk cut -t -f ictv_order,ictv_family,ictv_genus,votu_id,contig_id \\ | sed 1d \\ > mgv.tsv $ taxonkit create-taxdump mgv.tsv --out-dir mgv --force -A 5 -R order,family,genus,species 23:33:18.098 [INFO] 189680 records saved to mgv/taxid.map 23:33:18.131 [INFO] 58102 records saved to mgv/nodes.dmp 23:33:18.150 [INFO] 58102 records saved to mgv/names.dmp 23:33:18.150 [INFO] 0 records saved to mgv/merged.dmp 23:33:18.150 [INFO] 0 records saved to mgv/delnodes.dmp $ head -n 5 mgv/taxid.map MGV-GENOME-0364295 677052301 MGV-GENOME-0364296 677052301 MGV-GENOME-0364303 1414406025 MGV-GENOME-0364311 1849074420 MGV-GENOME-0364312 2074846424 $ echo 677052301 | taxonkit lineage --data-dir mgv/ 677052301 Caudovirales;crAss-phage;OTU-61123 $ echo 677052301 | taxonkit reformat --data-dir mgv/ -I 1 -P 677052301 k__;p__;c__;o__Caudovirales;f__crAss-phage;g__;s__OTU-61123 $ grep MGV-GENOME-0364295 mgv.tsv Caudovirales crAss-phage NULL OTU-61123 MGV-GENOME-0364295 Custom lineages with the first row as rank names and treating one column as accession. $ csvtk pretty -t example/taxonomy.tsv id superkingdom phylum class order family genus species --------------- ------------ -------------- ------------------- ---------------- ------------------ -------------- -------------------------- GCF_001027105.1 Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus aureus GCF_001096185.1 Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus Streptococcus pneumoniae GCF_001544255.1 Bacteria Firmicutes Bacilli Lactobacillales Enterococcaceae Enterococcus Enterococcus faecium GCF_002949675.1 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Shigella Shigella dysenteriae GCF_002950215.1 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Shigella Shigella flexneri GCF_006742205.1 Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus epidermidis GCF_000006945.2 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Salmonella Salmonella enterica GCF_000017205.1 Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas Pseudomonas aeruginosa GCF_003697165.2 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli GCF_009759685.1 Bacteria Proteobacteria Gammaproteobacteria Moraxellales Moraxellaceae Acinetobacter Acinetobacter baumannii GCF_000148585.2 Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus Streptococcus mitis GCF_000392875.1 Bacteria Firmicutes Bacilli Lactobacillales Enterococcaceae Enterococcus Enterococcus faecalis GCF_000742135.1 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Klebsiella Klebsiella pneumonia # the first column as accession $ taxonkit create-taxdump -A 1 example/taxonomy.tsv -O example/taxdump 16:31:31.828 [INFO] I will use the first row of input as rank names 16:31:31.843 [INFO] 13 records saved to example/taxdump/taxid.map 16:31:31.843 [INFO] 39 records saved to example/taxdump/nodes.dmp 16:31:31.843 [INFO] 39 records saved to example/taxdump/names.dmp 16:31:31.843 [INFO] 0 records saved to example/taxdump/merged.dmp 16:31:31.843 [INFO] 0 records saved to example/taxdump/delnodes.dmp $ export TAXONKIT_DB=example/taxdump $ taxonkit list --ids 1 | taxonkit filter -E species | taxonkit lineage -r 1527235303 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus mitis species 2983929374 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus pneumoniae species 3809813362 Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecalis species 4145431389 Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecium species 1569132721 Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus aureus species 1920251658 Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus epidermidis species 3843752343 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas aeruginosa species 72054943 Bacteria;Proteobacteria;Gammaproteobacteria;Moraxellales;Moraxellaceae;Acinetobacter;Acinetobacter baumannii species 1678121664 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Salmonella;Salmonella enterica species 524994882 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Shigella;Shigella dysenteriae species 2695851945 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Shigella;Shigella flexneri species 3958205156 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae species 4093283224 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli species $ head -n 3 example/taxdump/taxid.map GCF_001027105.1 1569132721 GCF_001096185.1 2983929374 GCF_001544255.1 4145431389 Custom lineages with the first row as rank names (pure lineage data) $ csvtk cut -t -f 2- example/taxonomy.tsv | head -n 2 | csvtk pretty -t superkingdom phylum class order family genus species ------------ ---------- ------- ---------- ----------------- -------------- --------------------- Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus aureus $ csvtk cut -t -f 2- example/taxonomy.tsv \\ | taxonkit create-taxdump -O example/taxdump2 16:53:08.604 [INFO] I will use the first row of input as rank names 16:53:08.614 [INFO] 39 records saved to example/taxdump2/nodes.dmp 16:53:08.614 [INFO] 39 records saved to example/taxdump2/names.dmp 16:53:08.614 [INFO] 0 records saved to example/taxdump2/merged.dmp 16:53:08.615 [INFO] 0 records saved to example/taxdump2/delnodes.dmp $ export TAXONKIT_DB=example/taxdump2 $ taxonkit list --ids 1 | taxonkit filter -E species | taxonkit lineage -r | head -n 2 1527235303 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus mitis species 2983929374 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus pneumoniae species","title":"create-taxdump"},{"location":"usage/#genautocomplete","text":"Usage Generate shell autocompletion script Supported shell: bash|zsh|fish|powershell Bash: # generate completion shell taxonkit genautocomplete --shell bash # configure if never did. # install bash-completion if the \"complete\" command is not found. echo \"for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\" >> ~/.bash_completion echo \"source ~/.bash_completion\" >> ~/.bashrc Zsh: # generate completion shell taxonkit genautocomplete --shell zsh --file ~/.zfunc/_taxonkit # configure if never did echo 'fpath=( ~/.zfunc \"${fpath[@]}\" )' >> ~/.zshrc echo \"autoload -U compinit; compinit\" >> ~/.zshrc fish: taxonkit genautocomplete --shell fish --file ~/.config/fish/completions/taxonkit.fish Usage: taxonkit genautocomplete [flags] Flags: --file string autocompletion file (default \"/home/shenwei/.bash_completion.d/taxonkit.sh\") -h, --help help for genautocomplete --type string autocompletion type (currently only bash supported) (default \"bash\")","title":"genautocomplete"},{"location":"usage/#profile2cami","text":"Usage Convert metagenomic profile table to CAMI format Input format: 1. The input file should be tab-delimited 2. At least two columns needed: a) TaxId of taxon at species or lower rank. b) Abundance (could be percentage, automatically detected or use -p/--percentage). Attentions: 1. Some TaxIds may be merged to another ones in current taxonomy version, the abundances will be summed up. 2. Some TaxIds may be deleted in current taxonomy version, the abundances can be optionally recomputed with the flag -R/--recompute-abd. Usage: taxonkit profile2cami [flags] Flags: -a, --abundance-field int field index of abundance. input data should be tab-separated (default 2) -h, --help help for profile2cami -0, --keep-zero keep taxons with abundance of zero -p, --percentage abundance is in percentage -R, --recompute-abd recompute abundance if some TaxIds are deleted in current taxonomy version -s, --sample-id string sample ID in result file -r, --show-rank strings only show TaxIds and names of these ranks (default [superkingdom,phylum,class,order,family,genus,species,strain]) -i, --taxid-field int field index of taxid. input data should be tab-separated (default 1) -t, --taxonomy-id string taxonomy ID in result file Examples Test data, note that 2824115 is merged to 483329 and 1657696 is deleted in current taxonomy version. $ cat example/abundance.tsv 2824115 0.2 merged to 483329 483329 0.2 absord 2824115 239935 0.5 no change 1657696 0.1 deleted Example: $ taxonkit profile2cami -s sample1 -t 2021-10-01 \\ example/abundance.tsv 13:17:40.552 [WARN] taxid is deleted in current taxonomy version: 1657696 13:17:40.552 [WARN] you may recomputed abundance with the flag -R/--recompute-abd @SampleID:sample1 @Version:0.10.0 @Ranks:superkingdom|phylum|class|order|family|genus|species|strain @TaxonomyID:2021-10-01 @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 50.000000000000000 2759 superkingdom 2759 Eukaryota 40.000000000000000 74201 phylum 2|74201 Bacteria|Verrucomicrobia 50.000000000000000 6656 phylum 2759|6656 Eukaryota|Arthropoda 40.000000000000000 203494 class 2|74201|203494 Bacteria|Verrucomicrobia|Verrucomicrobiae 50.000000000000000 50557 class 2759|6656|50557 Eukaryota|Arthropoda|Insecta 40.000000000000000 48461 order 2|74201|203494|48461 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales 50.000000000000000 7041 order 2759|6656|50557|7041 Eukaryota|Arthropoda|Insecta|Coleoptera 40.000000000000000 1647988 family 2|74201|203494|48461|1647988 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae 50.000000000000000 57514 family 2759|6656|50557|7041|57514 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae 40.000000000000000 239934 genus 2|74201|203494|48461|1647988|239934 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia 50.000000000000000 57515 genus 2759|6656|50557|7041|57514|57515 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus 40.000000000000000 239935 species 2|74201|203494|48461|1647988|239934|239935 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila 50.000000000000000 483329 species 2759|6656|50557|7041|57514|57515|483329 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus|Nicrophorus carolina 40.000000000000000 Recompute (normalize) the abundance $ taxonkit profile2cami -s sample1 -t 2021-10-01 \\ example/abundance.tsv --recompute-abd 13:19:23.647 [WARN] taxid is deleted in current taxonomy version: 1657696 @SampleID:sample1 @Version:0.10.0 @Ranks:superkingdom|phylum|class|order|family|genus|species|strain @TaxonomyID:2021-10-01 @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 55.555555555555557 2759 superkingdom 2759 Eukaryota 44.444444444444450 74201 phylum 2|74201 Bacteria|Verrucomicrobia 55.555555555555557 6656 phylum 2759|6656 Eukaryota|Arthropoda 44.444444444444450 203494 class 2|74201|203494 Bacteria|Verrucomicrobia|Verrucomicrobiae 55.555555555555557 50557 class 2759|6656|50557 Eukaryota|Arthropoda|Insecta 44.444444444444450 48461 order 2|74201|203494|48461 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales 55.555555555555557 7041 order 2759|6656|50557|7041 Eukaryota|Arthropoda|Insecta|Coleoptera 44.444444444444450 1647988 family 2|74201|203494|48461|1647988 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae 55.555555555555557 57514 family 2759|6656|50557|7041|57514 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae 44.444444444444450 239934 genus 2|74201|203494|48461|1647988|239934 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia 55.555555555555557 57515 genus 2759|6656|50557|7041|57514|57515 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus 44.444444444444450 239935 species 2|74201|203494|48461|1647988|239934|239935 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila 55.555555555555557 483329 species 2759|6656|50557|7041|57514|57515|483329 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus|Nicrophorus carolina 44.444444444444450 See https://github.com/shenwei356/sun2021-cami-profiles","title":"profile2cami"},{"location":"usage/#cami-filter","text":"Usage Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile Input format: The CAMI (Taxonomic) Profiling Output Format - https://github.com/CAMI-challenge/contest_information/blob/master/file_formats/CAMI_TP_specification.mkd - One file with mutiple samples is also supported. How to: - No extra taxonomy data needed, so the original taxonomic information are used and not changed. - A mini taxonomic tree is built from records with abundance greater than zero, and only leaves are retained for later use. The rank of leaves may be \"strain\", \"species\", or \"no rank\". - Relative abundances (in percentage) are recomputed for all leaves (reference genome). - A new taxonomic tree is built from these leaves, and abundances are cumulatively added up from leaves to the root. Examples: 1. Remove Archaea, Bacteria, and EukaryoteS, only keep Viruses: taxonkit cami-filter -t 2,2157,2759 test.profile -o test.filter.profile 2. Remove Viruses: taxonkit cami-filter -t 10239 test.profile -o test.filter.profile Usage: taxonkit cami-filter [flags] Flags: --field-percentage int field index of PERCENTAGE (default 5) --field-rank int field index of taxid (default 2) --field-taxid int field index of taxid (default 1) --field-taxpath int field index of TAXPATH (default 3) --field-taxpathsn int field index of TAXPATHSN (default 4) -h, --help help for cami-filter --leaf-ranks strings only consider leaves at these ranks (default [species,strain,no rank]) --show-rank strings only show TaxIds and names of these ranks (default [superkingdom,phylum,class,order,family,genus,species,strain]) --taxid-sep string separator of taxid in TAXPATH and TAXPATHSN (default \"|\") -t, --taxids strings the parent taxid(s) to filter out -f, --taxids-file strings file(s) for the parent taxid(s) to filter out, one taxid per line Examples: Remove Eukaryota taxonkit profile2cami -s sample1 -t 2021-10-01 \\ example/abundance.tsv --recompute-abd \\ | taxonkit cami-filter -t 2759 @SampleID:sample1 @Version:0.10.0 @Ranks:superkingdom|phylum|class|order|family|genus|species|strain @TaxonomyID:2021-10-01 @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 100.000000000000000 74201 phylum 2|74201 Bacteria|Verrucomicrobia 100.000000000000000 203494 class 2|74201|203494 Bacteria|Verrucomicrobia|Verrucomicrobiae 100.000000000000000 48461 order 2|74201|203494|48461 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales 100.000000000000000 1647988 family 2|74201|203494|48461|1647988 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae 100.000000000000000 239934 genus 2|74201|203494|48461|1647988|239934 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia 100.000000000000000 239935 species 2|74201|203494|48461|1647988|239934|239935 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila 100.000000000000000 /** * RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS. * LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/ /* var disqus_config = function () { this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable }; */ (function() { // DON'T EDIT BELOW THIS LINE var d = document, s = d.createElement('script'); s.src = '//taxonkit.disqus.com/embed.js'; s.setAttribute('data-timestamp', +new Date()); (d.head || d.body).appendChild(s); })(); Please enable JavaScript to view the comments powered by Disqus.","title":"cami-filter"},{"location":"bench/","text":"Benchmark Benchmark 1: Getting lineage Data set NCBI taxonomy , version 2021-01-21 TaxIDs. Root node 1 is removed. And These data should be updated along with NCBI taxonomy dataset. Seven sizes of TaxIds are sampled from nodes.dmp . # shuffle all taxids cut -f 1 nodes.dmp | grep -w -v 1 | shuf > ids.txt # extract n taxids for testing for n in 1 10 100 1000 2000 4000 6000 8000 10000 20000 40000 60000 80000 100000; do head -n $n ids.txt > taxids.n$n.txt done Software Loading database from local database: ETE, version: 3.1.2 Directly parsing dump files: taxopy, version: 0.5.0 TaxonKit, version: 0.7.2 Environment OS: Linux 5.4.89-1-MANJARO CPU: AMD Ryzen 7 2700X Eight-Core Processor, 3.7GHz RAM: 64GB DDR4 3000MHz SSD: Samsung 970EVO 500G NVMe SSD Installation and Configurations ETE sudo pip3 install ete3 # create database # http://etetoolkit.org/docs/latest/tutorial/tutorial_ncbitaxonomy.html#upgrading-the-local-database from ete3 import NCBITaxa ncbi = NCBITaxa() ncbi.update_taxonomy_database() TaxonKit mkdir -p $HOME/.taxonkit mkdir -p $HOME/bin/ # data wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz -C $HOME/.taxonkit # binary wget https://github.com/shenwei356/taxonkit/releases/download/v0.7.2/taxonkit_linux_amd64.tar.gz tar -zxvf taxonkit_linux_amd64.tar.gz -C $HOME/bin/ taxopy sudo pip3 install -U taxopy # taxoopy identical dump files copied from taxonkit mkdir -p ~/.taxopy cp ~/.taxonkit/{nodes.dmp,names.dmp} ~/.taxopy Scripts and Commands Scripts/Command as listed below. Python scripts were written following to the official documents, and parallelized querying were not used, including TaxonKit . ETE get_lineage.ete.py < $infile > $outfile taxopy get_lineage.taxopy.py < $infile > $outfile taxonkit taxonkit lineage --threads 1 --delimiter \"; \" < $infile > $outfile A Python script memusg was used to computate running time and peak memory usage of a process. A Perl scripts run.pl is used to automatically running tests and generate data for plotting. Running benchmark: $ # emptying the buffers cache $ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\" time perl run.pl -n 3 run_benchmark.sh -o bench.get_lineage.tsv Checking result: $ md5sum taxids.n*.lineage # clear $ rm *.lineage *.out Plotting benchmark result. R libraries dplyr , ggplot2 , scales , ggthemes , ggrepel are needed. # reformat dataset # tools: https://github.com/shenwei356/csvtk/ for f in taxids.n*.txt; do wc -l $f; done \\ | sort -k 1,1n \\ | awk '{ print($2\"\\t\"$1) }' \\ > dataset_rename.tsv cat bench.get_lineage.tsv \\ | csvtk sort -t -L dataset:<(cut -f 1 dataset_rename.tsv) -k dataset:u -k app \\ | csvtk replace -t -f dataset -k dataset_rename.tsv -p '(.+)' -r '{kv}' \\ > bench.get_lineage.reformat.tsv ./plot2.R -i bench.get_lineage.reformat.tsv --width 6 --height 4 --dpi 600 \\ --labcolor \"log10(queries)\" --labshape \"Tools\" Result Benchmark 2: TaxonKit multi-threaded scalability Running benchmark: $ # emptying the buffers cache $ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\" $ time perl run.pl -n 3 run_benchmark_taxonkit.sh -o bench.taxonkit.tsv $ rm *.lineage *.out Plotting benchmark result. cat bench.taxonkit.tsv \\ | csvtk sort -t -L dataset:<(cut -f 1 dataset_rename.tsv) -k dataset:u -k app \\ | csvtk replace -t -f dataset -k dataset_rename.tsv -p '(.+)' -r '{kv}' \\ > bench.taxonkit.reformat.tsv ./plot_threads2.R -i bench.taxonkit.reformat.tsv --width 6 --height 4 --dpi 600 \\ --labcolor \"log10(queries)\" --labshape \"Threads\" Result /** * RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS. * LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/ /* var disqus_config = function () { this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable }; */ (function() { // DON'T EDIT BELOW THIS LINE var d = document, s = d.createElement('script'); s.src = '//taxonkit.disqus.com/embed.js'; s.setAttribute('data-timestamp', +new Date()); (d.head || d.body).appendChild(s); })(); Please enable JavaScript to view the comments powered by Disqus.","title":"Benchmark"},{"location":"bench/#benchmark","text":"","title":"Benchmark"},{"location":"bench/#benchmark-1-getting-lineage","text":"","title":"Benchmark 1: Getting lineage"},{"location":"bench/#data-set","text":"NCBI taxonomy , version 2021-01-21 TaxIDs. Root node 1 is removed. And These data should be updated along with NCBI taxonomy dataset. Seven sizes of TaxIds are sampled from nodes.dmp . # shuffle all taxids cut -f 1 nodes.dmp | grep -w -v 1 | shuf > ids.txt # extract n taxids for testing for n in 1 10 100 1000 2000 4000 6000 8000 10000 20000 40000 60000 80000 100000; do head -n $n ids.txt > taxids.n$n.txt done","title":"Data set"},{"location":"bench/#software","text":"Loading database from local database: ETE, version: 3.1.2 Directly parsing dump files: taxopy, version: 0.5.0 TaxonKit, version: 0.7.2","title":"Software"},{"location":"bench/#environment","text":"OS: Linux 5.4.89-1-MANJARO CPU: AMD Ryzen 7 2700X Eight-Core Processor, 3.7GHz RAM: 64GB DDR4 3000MHz SSD: Samsung 970EVO 500G NVMe SSD","title":"Environment"},{"location":"bench/#installation-and-configurations","text":"ETE sudo pip3 install ete3 # create database # http://etetoolkit.org/docs/latest/tutorial/tutorial_ncbitaxonomy.html#upgrading-the-local-database from ete3 import NCBITaxa ncbi = NCBITaxa() ncbi.update_taxonomy_database() TaxonKit mkdir -p $HOME/.taxonkit mkdir -p $HOME/bin/ # data wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz tar -zxvf taxdump.tar.gz -C $HOME/.taxonkit # binary wget https://github.com/shenwei356/taxonkit/releases/download/v0.7.2/taxonkit_linux_amd64.tar.gz tar -zxvf taxonkit_linux_amd64.tar.gz -C $HOME/bin/ taxopy sudo pip3 install -U taxopy # taxoopy identical dump files copied from taxonkit mkdir -p ~/.taxopy cp ~/.taxonkit/{nodes.dmp,names.dmp} ~/.taxopy","title":"Installation and Configurations"},{"location":"bench/#scripts-and-commands","text":"Scripts/Command as listed below. Python scripts were written following to the official documents, and parallelized querying were not used, including TaxonKit . ETE get_lineage.ete.py < $infile > $outfile taxopy get_lineage.taxopy.py < $infile > $outfile taxonkit taxonkit lineage --threads 1 --delimiter \"; \" < $infile > $outfile A Python script memusg was used to computate running time and peak memory usage of a process. A Perl scripts run.pl is used to automatically running tests and generate data for plotting. Running benchmark: $ # emptying the buffers cache $ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\" time perl run.pl -n 3 run_benchmark.sh -o bench.get_lineage.tsv Checking result: $ md5sum taxids.n*.lineage # clear $ rm *.lineage *.out Plotting benchmark result. R libraries dplyr , ggplot2 , scales , ggthemes , ggrepel are needed. # reformat dataset # tools: https://github.com/shenwei356/csvtk/ for f in taxids.n*.txt; do wc -l $f; done \\ | sort -k 1,1n \\ | awk '{ print($2\"\\t\"$1) }' \\ > dataset_rename.tsv cat bench.get_lineage.tsv \\ | csvtk sort -t -L dataset:<(cut -f 1 dataset_rename.tsv) -k dataset:u -k app \\ | csvtk replace -t -f dataset -k dataset_rename.tsv -p '(.+)' -r '{kv}' \\ > bench.get_lineage.reformat.tsv ./plot2.R -i bench.get_lineage.reformat.tsv --width 6 --height 4 --dpi 600 \\ --labcolor \"log10(queries)\" --labshape \"Tools\" Result","title":"Scripts and Commands"},{"location":"bench/#benchmark-2-taxonkit-multi-threaded-scalability","text":"Running benchmark: $ # emptying the buffers cache $ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\" $ time perl run.pl -n 3 run_benchmark_taxonkit.sh -o bench.taxonkit.tsv $ rm *.lineage *.out Plotting benchmark result. cat bench.taxonkit.tsv \\ | csvtk sort -t -L dataset:<(cut -f 1 dataset_rename.tsv) -k dataset:u -k app \\ | csvtk replace -t -f dataset -k dataset_rename.tsv -p '(.+)' -r '{kv}' \\ > bench.taxonkit.reformat.tsv ./plot_threads2.R -i bench.taxonkit.reformat.tsv --width 6 --height 4 --dpi 600 \\ --labcolor \"log10(queries)\" --labshape \"Threads\" Result /** * RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS. * LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/ /* var disqus_config = function () { this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable }; */ (function() { // DON'T EDIT BELOW THIS LINE var d = document, s = d.createElement('script'); s.src = '//taxonkit.disqus.com/embed.js'; s.setAttribute('data-timestamp', +new Date()); (d.head || d.body).appendChild(s); })(); Please enable JavaScript to view the comments powered by Disqus.","title":"Benchmark 2: TaxonKit multi-threaded scalability"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index f972da1..859a567 100644
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ