-
Notifications
You must be signed in to change notification settings - Fork 23
find_homopolymers
Martin Asser Hansen edited this page Oct 2, 2015
·
6 revisions
find_homopolymers locates the stretches of homopolymeric redidues in DNA sequences in the stream.
SEQ_NAME: test0
SEQ: ACCAACCTCAAGGCACACAGGGGATACGGG
SEQ_LEN: 30
HOMOPOL_PAT: GGGG
HOMOPOL_LEN: 4
HOMOPOL_POS: 19
---
... | find_homopolymers [options]
[-? | --help] # Print full usage description.
[-m <uint> | --min=<uint>] # Minimum homopolymer length to find - Default=1
[-l <uint> | --limit=<uint>] # Limit number of homopolymers to output.
[-L | --longest] # Output longest homopolymer found.
[-I <file!> | --stream_in=<file!>] # Read input stream from file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output stream to file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the following FASTA entries in the file test.fna
:
>test1
attcccggggnnnnn
>test2
nnnnnggggccctta
>test3
a
>test4
aA
We read the file with read_fasta:
read_fasta -i test.fna | find_homopolymers
SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: A
HOMOPOL_LEN: 1
HOMOPOL_POS: 0
---
SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: TT
HOMOPOL_LEN: 2
HOMOPOL_POS: 1
---
SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: CCC
HOMOPOL_LEN: 3
HOMOPOL_POS: 3
---
SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: GGGG
HOMOPOL_LEN: 4
HOMOPOL_POS: 6
---
SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: NNNNN
HOMOPOL_LEN: 5
HOMOPOL_POS: 10
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: NNNNN
HOMOPOL_LEN: 5
HOMOPOL_POS: 0
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: GGGG
HOMOPOL_LEN: 4
HOMOPOL_POS: 5
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: CCC
HOMOPOL_LEN: 3
HOMOPOL_POS: 9
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: TT
HOMOPOL_LEN: 2
HOMOPOL_POS: 12
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: A
HOMOPOL_LEN: 1
HOMOPOL_POS: 14
---
SEQ_NAME: test3
SEQ: a
SEQ_LEN: 1
HOMOPOL_PAT: A
HOMOPOL_LEN: 1
HOMOPOL_POS: 0
---
SEQ_NAME: test4
SEQ: aA
SEQ_LEN: 2
HOMOPOL_PAT: AA
HOMOPOL_LEN: 2
HOMOPOL_POS: 0
---
Or to locate homopolymer stretches longer than a minimum length use the -m
switch:
read_fasta -i test.fna | find_homopolymers -m 3
SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: CCC
HOMOPOL_LEN: 3
HOMOPOL_POS: 3
---
SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: GGGG
HOMOPOL_LEN: 4
HOMOPOL_POS: 6
---
SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: NNNNN
HOMOPOL_LEN: 5
HOMOPOL_POS: 10
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: NNNNN
HOMOPOL_LEN: 5
HOMOPOL_POS: 0
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: GGGG
HOMOPOL_LEN: 4
HOMOPOL_POS: 5
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: CCC
HOMOPOL_LEN: 3
HOMOPOL_POS: 9
---
SEQ_NAME: test3
SEQ: a
SEQ_LEN: 1
---
SEQ_NAME: test4
SEQ: aA
SEQ_LEN: 2
---
We can speedup the process by limiting the number of homopolymers found in each
sequencing using the -h
switch to limit to e.g. a single hit like this:
read_fasta -i test.fna | find_homopolymers -m 3 -l 1
SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: CCC
HOMOPOL_LEN: 3
HOMOPOL_POS: 3
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: NNNNN
HOMOPOL_LEN: 5
HOMOPOL_POS: 0
---
SEQ_NAME: test3
SEQ: a
SEQ_LEN: 1
---
SEQ_NAME: test4
SEQ: aA
SEQ_LEN: 2
---
Alternatively, you can get the longest homopolymer in a sequence using the -L
switch:
read_fasta -i test.fna | find_homopolymers -m 3 -L
read_fasta -i test.fna | find_homopolymers -m 3 -L
SEQ_NAME: test1
SEQ: attcccggggnnnnn
SEQ_LEN: 15
HOMOPOL_PAT: NNNNN
HOMOPOL_LEN: 5
HOMOPOL_POS: 10
---
SEQ_NAME: test2
SEQ: nnnnnggggccctta
SEQ_LEN: 15
HOMOPOL_PAT: NNNNN
HOMOPOL_LEN: 5
HOMOPOL_POS: 0
---
SEQ_NAME: test3
SEQ: a
SEQ_LEN: 1
---
SEQ_NAME: test4
SEQ: aA
SEQ_LEN: 2
---
Martin Asser Hansen - Copyright (C) - All rights reserved.
December 2010/June 2013
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
find_homopolymers is part of the Biopieces framework.