-
Notifications
You must be signed in to change notification settings - Fork 23
uclust_seq
Sequences in the stream can be clustered based on a specified similarity using uclust_seq. For
each record in the stream containing a SEQ
and SEQ_NAME
key a CLUSTER
number will be added
where the value indicates the cluster this record belongs to. Also, a IDENT
key is added showing
the similarity in percent to the seed (or representative) sequence of the cluster. If IDENT
is
*
then that sequence is the seed:
SEQ_NAME: test3
SEQ: ggggttggtgtgtggtcgtctcgtgtctcgctcctctgcgttcgctctcgctgctgctctgctgctcgct
SEQ_LEN: 70
CLUSTER: 1
IDENT: *
---
SEQ_NAME: test4
SEQ: ggggttggtgtgtggtcgttcgtgtctcgctcctctgcgttcgctctcgctgctgctctgctgctcgct
SEQ_LEN: 69
CLUSTER: 1
IDENT: 100
---
Notice that the records are output in input order, which may not follow cluster order because of sequence sorting during clustering. If you want to output in cluster order add sort_records to your pipe.
Usearch v7.0.1001 or later must be installed in order for uclust_seq to work.
Read more here:
http://www.drive5.com/usearch/
... | uclust_seq [options]
[-? | --help] # Print full usage description.
[-c | --comp] # Match reverse-complement strand as well.
[-i <float> | --identity=<float>] # Minimum global identity - Default=0.9
[-C <uint> | --cpus=<uint>] # Number of CPUs to use - Default=1
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the following FASTA entries in the file test.fna
:
>test1
tgtacgtagctagctagctagctagctagctagctagctagctgactatcgtgatcgtg
>test1_100
tgtacgtagctagctagctagctagctagctagctagctagctgactatcgtgatcgtg
>test1_2
tgtacgtagctagctagctagctagcGagctagctagcAagctgactatcgtgatcgtg
>test2
ggttgtgtgtgtgtatcgatgtagtctacatcgtctatctgtactgacttactgactac
>test2_100
ggttgtgtgtgtgtatcgatgtagtctacatcgtctatctgtactgacttactgactac
>test2_1
ggtAgtgtgtgAgtatcgatgtagtctacatcgtctatctgtactgacttactgactac
>test1_rc
cacgatcacgatagtcagctagctagctagctagctagctagctagctagctacgtaca
Now we can cluster these sequences using uclust_seq:
read_fasta -i test.fna | uclust_seq -i 0.85
SEQ_NAME: test1
SEQ: tgtacgtagctagctagctagctagctagctagctagctagctgactatcgtgatcgtg
SEQ_LEN: 59
CLUSTER: 2
IDENT: *
---
SEQ_NAME: test1_100
SEQ: tgtacgtagctagctagctagctagctagctagctagctagctgactatcgtgatcgtg
SEQ_LEN: 59
CLUSTER: 2
IDENT: 100
---
SEQ_NAME: test1_2
SEQ: tgtacgtagctagctagctagctagcGagctagctagcAagctgactatcgtgatcgtg
SEQ_LEN: 59
CLUSTER: 2
IDENT: 96
---
SEQ_NAME: test2
SEQ: ggttgtgtgtgtgtatcgatgtagtctacatcgtctatctgtactgacttactgactac
SEQ_LEN: 59
CLUSTER: 0
IDENT: 100
---
SEQ_NAME: test2_100
SEQ: ggttgtgtgtgtgtatcgatgtagtctacatcgtctatctgtactgacttactgactac
SEQ_LEN: 59
CLUSTER: 0
IDENT: *
---
SEQ_NAME: test2_1
SEQ: ggtAgtgtgtgAgtatcgatgtagtctacatcgtctatctgtactgacttactgactac
SEQ_LEN: 59
CLUSTER: 0
IDENT: 96
---
SEQ_NAME: test1_rc
SEQ: cacgatcacgatagtcagctagctagctagctagctagctagctagctagctacgtaca
SEQ_LEN: 59
CLUSTER: 1
IDENT: *
---
Martin Asser Hansen - Copyright (C) - All rights reserved.
November 2011
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
uclust_seq is part of the Biopieces framework.