-
Notifications
You must be signed in to change notification settings - Fork 23
split_seq
split_seq splits each sequences from records in the stream, into subsequences based on the word and step sizes. This results in overlapping or non-overlapping subsequences which are output as seperate records, while the original sequence is not output to the stream. The positions (1-based) of the subsequences related to the original sequence is appended in brackets to each subsequence name.
Note that the sequence is trimmed so only subsequences of full word length are output.
If quality scores are present in as values to the SCORES
key, these are split as well.
... | split_seq [options]
[-? | --help] # Print full usage description.
[-w <uint> | --word_size=<uint>] # Word size of subsequences - Default=7
[-s <uint> | --step_size=<uint>] # Step size of sequence overlaps - Default=1
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the following FASTA entry in the file test.fna
>test
ATGCACATTCGACTAGCA
To read the sequence use read_fasta using the -w
switch to
chose a word size of 12:
read_fasta -i test.fna | split_seq -w 12
SEQ: ATGCACATTCGA
SEQ_LEN: 12
SEQ_NAME: test[1-12]
---
SEQ: TGCACATTCGAC
SEQ_LEN: 12
SEQ_NAME: test[2-13]
---
SEQ: GCACATTCGACT
SEQ_LEN: 12
SEQ_NAME: test[3-14]
---
SEQ: CACATTCGACTA
SEQ_LEN: 12
SEQ_NAME: test[4-15]
---
SEQ: ACATTCGACTAG
SEQ_LEN: 12
SEQ_NAME: test[5-16]
---
SEQ: CATTCGACTAGC
SEQ_LEN: 12
SEQ_NAME: test[6-17]
---
SEQ: ATTCGACTAGCA
SEQ_LEN: 12
SEQ_NAME: test[7-18]
---
Use the -s
switch to get overlapping subsequences:
read_fasta -i test.fna | split_seq -w 8 -s 4
SEQ: ATGCACAT
SEQ_LEN: 8
SEQ_NAME: test[1-8]
---
SEQ: ACATTCGA
SEQ_LEN: 8
SEQ_NAME: test[5-12]
---
SEQ: TCGACTAG
SEQ_LEN: 8
SEQ_NAME: test[9-16]
---
Or non-overlapping subsequences if the step_size is equal to the word_size:
read_fasta -i test.fna | split_seq -w 9 -s 9
SEQ: ATGCACATT
SEQ_LEN: 9
SEQ_NAME: test[1-9]
---
SEQ: CGACTAGCA
SEQ_LEN: 9
SEQ_NAME: test[10-18]
---
Martin Asser Hansen - Copyright (C) - All rights reserved.
August 2007
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
split_seq is part of the Biopieces framework.