-
Notifications
You must be signed in to change notification settings - Fork 23
classify_taxonomy
Warning: classify_taxonomy is under active development and testing.
classify_taxonomy parses taxonomy string from the Q_ID of records in the stream. For each Q_ID a taxonomy tree
is created with nodes for each level (kingdom, phylum, class, etc) containing the taxonomic information at each
node as well as the count and mean identity score. Using the -l
switch will trim the taxonomic trees so that
the lowest common ancester is output. Using the -s
switch will add to the size the include cluster
size from the Q_ID where this may be suffixed with _<cluster count>
.
classify_taxonomy only works on headers of the GreenGenes format where the sequence name contains a taxonomy string of the format:
k__Archaea; p__Euryarchaeota; c__Methanococci; o__Methanococcales [...]
The records look like this:
REC_TYPE: Classification
LEVEL: phylum
NAME: SM2F11
COUNT: 3
SCORE: 0.65
---
... | classify_taxonomy [options]
[-? | --help] # Print full usage description.
[-m <uint> | --min_count] # Debranch nodes where count <= min_count.
[-l | --LCA] # Output lowest common ancestor.
[-s <uint> | --size=<uint>] # Parse cluster size from Q_IDs.
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Here is an example of a complete taxonomic pipeline:
read_sff -ci data.sff |
extract_seq -l 500 |
trim_seq -l 10 |
grab -e 'SEQ_LEN >= 50' |
denoise_seq -vi 1 -r 0.6 |
denoise_seq -vi 0.98 -c 2 |
findsim_seq -vSQd sequences_16S_all_gg_2011_1_unaligned.fasta.gz |
grab -e 'REC_TYPE eq findsim' |
classify_taxonomy -ls |
grab -e 'REC_TYPE eq Classification' |
write_tab -ck COUNT,SCORE,LEVEL,NAME -o result.tab -x
Martin Asser Hansen - Copyright (C) - All rights reserved.
October 2012
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
classify_taxonomy is part of the Biopieces framework.