-
Notifications
You must be signed in to change notification settings - Fork 23
uniq_seq
uniq_seq counts all unique sequences in the stream and emits these sequences and the sequence count as seperate
records. Using the -c
flag also checks uniqueness for reverse-complement sequences. This is useful for creating a non-redundant set of e.g. sequence reads.
Note that uniq_seq if necessary.
... | uniq_seq [options]
[-? | --help] # Print full usage description.
[-c | --complement] # Complement sequences.
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the following FASTA file test.fna
:
>test1
ATGC
>test2
ATGC
>test3
GCAT
To locate all unique sequences we use read_fasta:
read_tab -i test.tab | uniq_seq
SEQ: ATGC
SEQ_LEN: 4
SEQ_COUNT: 2
---
SEQ: GCAT
SEQ_LEN: 4
SEQ_COUNT: 1
---
Using the -c
switch we can further reduce the records be checking the complement sequences for uniqueness:
read_fasta -i test.fna | uniq_seq -c
SEQ: GCAT
SEQ_LEN: 4
SEQ_COUNT: 3
---
If you want to output these unique and counted sequences in FASTA format you need to add a header using add_ident to append the count to the sequence name since it would otherwise be lost:
read_fasta -i test.fna | uniq_seq -c | add_ident -k SEQ_NAME | merge_vals -k SEQ_NAME,SEQ_COUNT | write_fasta -x
>ID00000000_3
GCAT
Martin Asser Hansen - Copyright (C) - All rights reserved.
February 2010
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
uniq_seq is part of the Biopieces framework.