Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 6 revisions

Biopiece: uniq_seq

Description

uniq_seq counts all unique sequences in the stream and emits these sequences and the sequence count as seperate records. Using the -c flag also checks uniqueness for reverse-complement sequences. This is useful for creating a non-redundant set of e.g. sequence reads.

Note that uniq_seq if necessary.

Usage

... | uniq_seq [options]

Options

[-?          | --help]               #  Print full usage description.
[-c          | --complement]         #  Complement sequences.
[-I <file!>  | --stream_in=<file!>]  #  Read input from stream file  -  Default=STDIN
[-O <file>   | --stream_out=<file>]  #  Write output to stream file  -  Default=STDOUT
[-v          | --verbose]            #  Verbose output.

Examples

Consider the following FASTA file test.fna:

>test1
ATGC
>test2
ATGC
>test3
GCAT

To locate all unique sequences we use read_fasta:

read_tab -i test.tab | uniq_seq

SEQ: ATGC
SEQ_LEN: 4
SEQ_COUNT: 2
---
SEQ: GCAT
SEQ_LEN: 4
SEQ_COUNT: 1
---

Using the -c switch we can further reduce the records be checking the complement sequences for uniqueness:

read_fasta -i test.fna | uniq_seq -c

SEQ: GCAT
SEQ_LEN: 4
SEQ_COUNT: 3
---

If you want to output these unique and counted sequences in FASTA format you need to add a header using add_ident to append the count to the sequence name since it would otherwise be lost:

read_fasta -i test.fna | uniq_seq -c | add_ident -k SEQ_NAME | merge_vals -k SEQ_NAME,SEQ_COUNT | write_fasta -x

>ID00000000_3
GCAT

See also

read_fasta

uniq_vals

uppercase_seq

merge_vals

write_fasta

Author

Martin Asser Hansen - Copyright (C) - All rights reserved.

[email protected]

February 2010

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

uniq_seq is part of the Biopieces framework.

http://www.biopieces.org

Clone this wiki locally