-
Notifications
You must be signed in to change notification settings - Fork 23
slice_align
slice_align slices an alignment to extract subsequence from all sequences in the stream.
This is done by specifying begin and end coordinates (1-based) using the -b
and -e' switches for cutting out the subsequences **including** gaps. Alternatively, it is possible to specify forward and reverse primers which will be used to search the first sequence in the stream (allowing for speficied maximum mismatches, insertion, and deletions when using the switches
-m,
-i, and
-d`, respectively) and the positions from the matches will be
used for slicing all sequences in the stream.
It is also possible to specify a template file using the -t
switch. The template file should
be a file with one FASTA formatted sequence from the alignment (with gaps). If a template file
is specified the begin and end coordinates will be using the nucleotide numbering from the
ungapped template. If both template file and primers are specified the template sequence is
used for the primer search and the positions will be used for slicing.
The sequences in the stream are replaced with the sliced subsequences.
... | slice_align [options]
[-? | --help] # Print full usage description.
[-b <uint> | --beg=<uint>] # Begin position of subsequence (first residue=1)
[-e <uint> | --end=<uint>] # End position of subsequence
[-t <file!> | --template_file=<string>] # File with one aligned sequence in FASTA format.
[-f <string> | --forward=<string>] # Forward primer (5'-3')
[-F <string> | --forward_rc=<string>] # Forward primer (3'-5')
[-r <string> | --reverse=<string>] # Reverse primer (3'-5')
[-R <string> | --reverse_rc=<string>] # Reverse primer (5'-3')
[-m <uint> | --mismatches=<uint>] # Max number of mismatchs - Default=2
[-i <uint> | --insertions=<uint>] # Max number of insertions - Default=1
[-d <uint> | --deletions=<uint>] # Max number of deletions - Default=1
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the following alingment in the file align.fna
:
>ID00000000
CCGCATACG-------CCCTGAGGGG----
>ID00000001
CCGCATGAT-------ACCTGAGGGT----
>ID00000002
CCGCATATACTCTTGACGCTAAAGCGTAGT
>ID00000003
CCGTATGTG-------CCCTTCGGGG----
>ID00000004
CCGGATAAG-------CCCTTACGGG----
>ID00000005
CCGGATAAG-------CCCTTACGGG----
We can slice the alignment with slice_align using coordinates:
read_fasta -i align.fna | slice_align -b 15 -e 28 | write_align -x
.
ID00000000 --CCCTGAGGGG--
ID00000001 --ACCTGAGGGT--
ID00000002 GACGCTAAAGCGTA
ID00000003 --CCCTTCGGGG--
ID00000004 --CCCTTACGGG--
ID00000005 --CCCTTACGGG--
Consensus: 50% --CCCT-A-GGG--
Or we could slice the alignment using a set of primers:
read_fasta -i align.fna | slice_align -f CGCATACG -r GAGGGG -m 0 -i 0 -d 0 | write_align -x
. .
ID00000000 CGCATACG-------CCCTGAGGGG
ID00000001 CGCATGAT-------ACCTGAGGGT
ID00000002 CGCATATACTCTTGACGCTAAAGCG
ID00000003 CGTATGTG-------CCCTTCGGGG
ID00000004 CGGATAAG-------CCCTTACGGG
ID00000005 CGGATAAG-------CCCTTACGGG
Consensus: 50% CG-AT----------CCCT-A-GGG
If we have a template file with the following FASTA entry:
>template
CTGAATACG-------CCATTCGATGG---
and spefifying primers these will be matched to the template and the hit positions used for slicing:
read_fasta -i align.fna | slice_align -t template.fna -f GAATACG -r ATTCGAT -m 0 -i 0 -d 0 | write_align -x
. .
ID00000000 GCATACG-------CCCTGAGGG
ID00000001 GCATGAT-------ACCTGAGGG
ID00000002 GCATATACTCTTGACGCTAAAGC
ID00000003 GTATGTG-------CCCTTCGGG
ID00000004 GGATAAG-------CCCTTACGG
ID00000005 GGATAAG-------CCCTTACGG
Consensus: 50% G-AT----------CCCT-A-GG
Specifying a template file and an interval the positions used for slicing will be the ungapped positions from the template sequence. This is useful if you are slicing 16S rRNA alignments and what the E.coli numbering - then use the E.coli sequence as template.
read_fasta -i align.fna | slice_align -t template.fna -b 5 -e 15 | write_align -x
.
ID00000000 ATACG-------CCCTGA
ID00000001 ATGAT-------ACCTGA
ID00000002 ATATACTCTTGACGCTAA
ID00000003 ATGTG-------CCCTTC
ID00000004 ATAAG-------CCCTTA
ID00000005 ATAAG-------CCCTTA
Consensus: 50% AT----------CCCT-A
Martin Asser Hansen - Copyright (C) - All rights reserved.
December 2013
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
slice_align is part of the Biopieces framework.