Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 6 revisions

Biopiece: slice_align

Description

slice_align slices an alignment to extract subsequence from all sequences in the stream. This is done by specifying begin and end coordinates (1-based) using the -b and -e' switches for cutting out the subsequences **including** gaps. Alternatively, it is possible to specify forward and reverse primers which will be used to search the first sequence in the stream (allowing for speficied maximum mismatches, insertion, and deletions when using the switches -m, -i, and -d`, respectively) and the positions from the matches will be used for slicing all sequences in the stream.

It is also possible to specify a template file using the -t switch. The template file should be a file with one FASTA formatted sequence from the alignment (with gaps). If a template file is specified the begin and end coordinates will be using the nucleotide numbering from the ungapped template. If both template file and primers are specified the template sequence is used for the primer search and the positions will be used for slicing.

The sequences in the stream are replaced with the sliced subsequences.

Usage

... | slice_align [options]

Options

[-?          | --help]                     #  Print full usage description.
[-b <uint>   | --beg=<uint>]               #  Begin position of subsequence (first residue=1)
[-e <uint>   | --end=<uint>]               #  End position of subsequence
[-t <file!>  | --template_file=<string>]   #  File with one aligned sequence in FASTA format.
[-f <string> | --forward=<string>]         #  Forward primer (5'-3')
[-F <string> | --forward_rc=<string>]      #  Forward primer (3'-5')
[-r <string> | --reverse=<string>]         #  Reverse primer (3'-5')
[-R <string> | --reverse_rc=<string>]      #  Reverse primer (5'-3')
[-m <uint>   | --mismatches=<uint>]        #  Max number of mismatchs      -  Default=2
[-i <uint>   | --insertions=<uint>]        #  Max number of insertions     -  Default=1
[-d <uint>   | --deletions=<uint>]         #  Max number of deletions      -  Default=1
[-I <file!>  | --stream_in=<file!>]        #  Read input from stream file  -  Default=STDIN
[-O <file>   | --stream_out=<file>]        #  Write output to stream file  -  Default=STDOUT
[-v          | --verbose]                  #  Verbose output.

Examples

Consider the following alingment in the file align.fna:

>ID00000000
CCGCATACG-------CCCTGAGGGG----
>ID00000001
CCGCATGAT-------ACCTGAGGGT----
>ID00000002
CCGCATATACTCTTGACGCTAAAGCGTAGT
>ID00000003
CCGTATGTG-------CCCTTCGGGG----
>ID00000004
CCGGATAAG-------CCCTTACGGG----
>ID00000005
CCGGATAAG-------CCCTTACGGG----

We can slice the alignment with slice_align using coordinates:

read_fasta -i align.fna | slice_align -b 15 -e 28 | write_align -x
                          .
ID00000000       --CCCTGAGGGG--
ID00000001       --ACCTGAGGGT--
ID00000002       GACGCTAAAGCGTA
ID00000003       --CCCTTCGGGG--
ID00000004       --CCCTTACGGG--
ID00000005       --CCCTTACGGG--
Consensus: 50%   --CCCT-A-GGG--

Or we could slice the alignment using a set of primers:

read_fasta -i align.fna | slice_align -f CGCATACG -r GAGGGG -m 0 -i 0 -d 0 | write_align -x
                          .         .
ID00000000       CGCATACG-------CCCTGAGGGG
ID00000001       CGCATGAT-------ACCTGAGGGT
ID00000002       CGCATATACTCTTGACGCTAAAGCG
ID00000003       CGTATGTG-------CCCTTCGGGG
ID00000004       CGGATAAG-------CCCTTACGGG
ID00000005       CGGATAAG-------CCCTTACGGG
Consensus: 50%   CG-AT----------CCCT-A-GGG

If we have a template file with the following FASTA entry:

>template
CTGAATACG-------CCATTCGATGG---

and spefifying primers these will be matched to the template and the hit positions used for slicing:

read_fasta -i align.fna | slice_align -t template.fna -f GAATACG -r ATTCGAT -m 0 -i 0 -d 0 | write_align -x
                          .         .
ID00000000       GCATACG-------CCCTGAGGG
ID00000001       GCATGAT-------ACCTGAGGG
ID00000002       GCATATACTCTTGACGCTAAAGC
ID00000003       GTATGTG-------CCCTTCGGG
ID00000004       GGATAAG-------CCCTTACGG
ID00000005       GGATAAG-------CCCTTACGG
Consensus: 50%   G-AT----------CCCT-A-GG

Specifying a template file and an interval the positions used for slicing will be the ungapped positions from the template sequence. This is useful if you are slicing 16S rRNA alignments and what the E.coli numbering - then use the E.coli sequence as template.

read_fasta -i align.fna | slice_align -t template.fna -b 5 -e 15 | write_align -x
                          .
ID00000000       ATACG-------CCCTGA
ID00000001       ATGAT-------ACCTGA
ID00000002       ATATACTCTTGACGCTAA
ID00000003       ATGTG-------CCCTTC
ID00000004       ATAAG-------CCCTTA
ID00000005       ATAAG-------CCCTTA
Consensus: 50%   AT----------CCCT-A

See also

read_fasta

write_align

extract_seq

pcr_seq

Author

Martin Asser Hansen - Copyright (C) - All rights reserved.

[email protected]

December 2013

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

slice_align is part of the Biopieces framework.

http://www.biopieces.org

Clone this wiki locally