Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 6 revisions

Biopiece: find_homopolymers

Description

find_homopolymers locates the stretches of homopolymeric redidues in DNA sequences in the stream.

SEQ_NAME: test0
SEQ: ACCAACCTCAAGGCACACAGGGGATACGGG
SEQ_LEN: 30
HOMOPOL_PAT: GGGG
HOMOPOL_LEN: 4
HOMOPOL_POS: 19
---

Usage

... | find_homopolymers [options]

Options

[-?          | --help]               #  Print full usage description.
[-m <uint>   | --min=<uint>]         #  Minimum homopolymer length to find  -  Default=1
[-l <uint>   | --limit=<uint>]       #  Limit number of homopolymers to output.
[-L          | --longest]            #  Output longest homopolymer found.
[-I <file!>  | --stream_in=<file!>]  #  Read input stream from file         -  Default=STDIN
[-O <file>   | --stream_out=<file>]  #  Write output stream to file         -  Default=STDOUT
[-v          | --verbose]            #  Verbose output.

Examples

Consider the following FASTA entries in the file test.fna:

>test1
attcccggggnnnnn
>test2
nnnnnggggccctta
>test3
a
>test4
aA

We read the file with read_fasta:

read_fasta -i test.fna | find_homopolymers

SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: A
HOMOPOL_LEN: 1
HOMOPOL_POS: 0
---
SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: TT
HOMOPOL_LEN: 2
HOMOPOL_POS: 1
---
SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: CCC
HOMOPOL_LEN: 3
HOMOPOL_POS: 3
---
SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: GGGG
HOMOPOL_LEN: 4
HOMOPOL_POS: 6
---
SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: NNNNN
HOMOPOL_LEN: 5
HOMOPOL_POS: 10
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: NNNNN
HOMOPOL_LEN: 5
HOMOPOL_POS: 0
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: GGGG
HOMOPOL_LEN: 4
HOMOPOL_POS: 5
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: CCC
HOMOPOL_LEN: 3
HOMOPOL_POS: 9
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: TT
HOMOPOL_LEN: 2
HOMOPOL_POS: 12
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: A
HOMOPOL_LEN: 1
HOMOPOL_POS: 14
---
SEQ_NAME: test3
SEQ: a
SEQ_LEN: 1
HOMOPOL_PAT: A
HOMOPOL_LEN: 1
HOMOPOL_POS: 0
---
SEQ_NAME: test4
SEQ: aA
SEQ_LEN: 2
HOMOPOL_PAT: AA
HOMOPOL_LEN: 2
HOMOPOL_POS: 0
---

Or to locate homopolymer stretches longer than a minimum length use the -m switch:

read_fasta -i test.fna | find_homopolymers -m 3

SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: CCC
HOMOPOL_LEN: 3
HOMOPOL_POS: 3
---
SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: GGGG
HOMOPOL_LEN: 4
HOMOPOL_POS: 6
---
SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: NNNNN
HOMOPOL_LEN: 5
HOMOPOL_POS: 10
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: NNNNN
HOMOPOL_LEN: 5
HOMOPOL_POS: 0
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: GGGG
HOMOPOL_LEN: 4
HOMOPOL_POS: 5
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: CCC
HOMOPOL_LEN: 3
HOMOPOL_POS: 9
---
SEQ_NAME: test3
SEQ: a
SEQ_LEN: 1
---
SEQ_NAME: test4
SEQ: aA
SEQ_LEN: 2
---

We can speedup the process by limiting the number of homopolymers found in each sequencing using the -h switch to limit to e.g. a single hit like this:

read_fasta -i test.fna | find_homopolymers -m 3 -l 1

SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: CCC
HOMOPOL_LEN: 3
HOMOPOL_POS: 3
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: NNNNN
HOMOPOL_LEN: 5
HOMOPOL_POS: 0
---
SEQ_NAME: test3
SEQ: a
SEQ_LEN: 1
---
SEQ_NAME: test4
SEQ: aA
SEQ_LEN: 2
---

Alternatively, you can get the longest homopolymer in a sequence using the -L switch:

read_fasta -i test.fna | find_homopolymers -m 3 -L  

read_fasta -i test.fna | find_homopolymers -m 3 -L
SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: NNNNN
HOMOPOL_LEN: 5
HOMOPOL_POS: 10
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: NNNNN
HOMOPOL_LEN: 5
HOMOPOL_POS: 0
---
SEQ_NAME: test3
SEQ: a
SEQ_LEN: 1
---
SEQ_NAME: test4
SEQ: aA
SEQ_LEN: 2
---

See also

read_fasta

Author

Martin Asser Hansen - Copyright (C) - All rights reserved.

[email protected]

December 2010/June 2013

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

find_homopolymers is part of the Biopieces framework.

http://www.biopieces.org

Clone this wiki locally