Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 6 revisions

Biopiece: split_seq

Description

split_seq splits each sequences from records in the stream, into subsequences based on the word and step sizes. This results in overlapping or non-overlapping subsequences which are output as seperate records, while the original sequence is not output to the stream. The positions (1-based) of the subsequences related to the original sequence is appended in brackets to each subsequence name.

Note that the sequence is trimmed so only subsequences of full word length are output.

If quality scores are present in as values to the SCORES key, these are split as well.

Usage

... | split_seq [options]

Options

[-?         | --help]               #  Print full usage description.
[-w <uint>  | --word_size=<uint>]   #  Word size of subsequences       -  Default=7
[-s <uint>  | --step_size=<uint>]   #  Step size of sequence overlaps  -  Default=1
[-I <file!> | --stream_in=<file!>]  #  Read input from stream file     -  Default=STDIN
[-O <file>  | --stream_out=<file>]  #  Write output to stream file     -  Default=STDOUT
[-v         | --verbose]            #  Verbose output.

Examples

Consider the following FASTA entry in the file test.fna

>test
ATGCACATTCGACTAGCA

To read the sequence use read_fasta using the -w switch to chose a word size of 12:

read_fasta -i test.fna | split_seq -w 12

SEQ: ATGCACATTCGA
SEQ_LEN: 12
SEQ_NAME: test[1-12]
---
SEQ: TGCACATTCGAC
SEQ_LEN: 12
SEQ_NAME: test[2-13]
---
SEQ: GCACATTCGACT
SEQ_LEN: 12
SEQ_NAME: test[3-14]
---
SEQ: CACATTCGACTA
SEQ_LEN: 12
SEQ_NAME: test[4-15]
---
SEQ: ACATTCGACTAG
SEQ_LEN: 12
SEQ_NAME: test[5-16]
---
SEQ: CATTCGACTAGC
SEQ_LEN: 12
SEQ_NAME: test[6-17]
---
SEQ: ATTCGACTAGCA
SEQ_LEN: 12
SEQ_NAME: test[7-18]
---

Use the -s switch to get overlapping subsequences:

read_fasta -i test.fna | split_seq -w 8 -s 4

SEQ: ATGCACAT
SEQ_LEN: 8
SEQ_NAME: test[1-8]
---
SEQ: ACATTCGA
SEQ_LEN: 8
SEQ_NAME: test[5-12]
---
SEQ: TCGACTAG
SEQ_LEN: 8
SEQ_NAME: test[9-16]
---

Or non-overlapping subsequences if the step_size is equal to the word_size:

read_fasta -i test.fna | split_seq -w 9 -s 9

SEQ: ATGCACATT
SEQ_LEN: 9
SEQ_NAME: test[1-9]
---
SEQ: CGACTAGCA
SEQ_LEN: 9
SEQ_NAME: test[10-18]
---

See also

read_fasta

Author

Martin Asser Hansen - Copyright (C) - All rights reserved.

[email protected]

August 2007

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

split_seq is part of the Biopieces framework.

http://www.biopieces.org

Clone this wiki locally