-
Notifications
You must be signed in to change notification settings - Fork 23
find_adaptor
If you want to find adaptor sequences in sequences in the stream you can use find_adaptor. Adaptors are located by scanning each sequence from left to right allowing for ambiguity codes as well as mismatches, insertions, and deletions.
Adaptor sequences can be specified lexically using the -f
and -r
switches which corresponds to the forward adaptor
beginning with the 5'-end and the reverse-complement of the reverse adaptor beginning with the 3'-end. Using the
-F
and -R
swithes reverse-complements the adaptor sequences.
It is possible to enable the finding of partial adaptors at the ends, all the way down to length 1, by specifying
the minimum length of the adaptors to match using the -l
and -L
swithces for the left and rigth end, respectively.
The mismatches, insertions, and deletions are specified as percentages of the adaptor length to adjust for the reduced length of partial adaptors. In the below example we search a sequence for a reverse adaptor of length 10 and the 20%, 10%, 5% for mismatches, insertions, and deletions, respectively. Thus we initially allow for m=2, i=1, d=0:
TCAGTACTGAGCTACAGTACACGATGCGTCCAGGAACCATCGGATGGCAATCG - sequence
TCGTATGCCG (scan all positions until the end) - (m=2, i=1, d=1)
TCGTATGCC - (m=2, i=1, d=0)
TCGTATGC - (m=2, i=1, d=0)
TCGTATG - (m=1, i=1, d=0)
TCGTAT - (m=1, i=1, d=0)
TCGTA - (m=1, i=1, d=0)
TCGT - (m=1, i=0, d=0)
TCG - (m=1, i=0, d=0) -> match!
If a match is found a number of ADAPTOR_*
keys are added to the record:
read_fasta -i test.fna | find_adaptor -r TCGTATGCCG -L 1 -m 20 -i 10 -d 5
SEQ_NAME: test
SEQ: TCAGTACTGAGCTACAGTACACGATGCGTCCAGGAACCATCGGATGGCAATCG
SEQ_LEN: 53
ADAPTOR_POS_RIGHT: 50
ADAPTOR_LEN_RIGHT: 3
ADAPTOR_PAT_RIGHT: TCG
---
Once adaptors are locted with find_adaptor to remove adaptor sequence.
... | find_adaptor <-f adaptor | -r adaptor> [options]
[-? | --help] # Print full usage description.
[-f <string> | --forward=<string>] # Forward adaptor (5'-3') to locate.
[-F <string> | --forward_rc=<string>] # Forward adaptor (3'-5') to locate.
[-r <string> | --reverse=<string>] # Reverse adaptor (3'-5') to locate.
[-R <string> | --reverse_rc=<string>] # Reverse adaptor (5'-3') to locate.
[-l <uint] | --len_forward=<uint>] # Length of forward adaptor part to locate - Default=<forward adaptor length>
[-L <uint] | --len_reverse=<uint>] # Length of reverse adaptor part to locate - Default=<reverse adaptor length>
[-m <uint> | --mismatches=<uint>] # Max mismatch percent allowed - Default=10
[-i <uint> | --insertions=<uint>] # Max insertion percent allowed - Default=5
[-d <uint> | --deletions=<uint>] # Max deletion percent allowed - Default=5
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the following FASTA entry in the file test.fna
:
>test
TCAGTACTGAGCTACAGTACACGATGCGTCCAGGAACCATCGGATGGCAATCG
To locate the following adaptors GTACCGAGCT
and CGGATCGCAA
do:
read_fasta -i test.fna | find_adaptor -f GTACCGAGCT -r CGGATCGCAA
SEQ_NAME: test
SEQ: TCAGTACTGAGCTACAGTACACGATGCGTCCAGGAACCATCGGATGGCAATCG
SEQ_LEN: 53
ADAPTOR_POS_LEFT: 3
ADAPTOR_LEN_LEFT: 10
ADAPTOR_PAT_LEFT: GTACTGAGCT
ADAPTOR_POS_RIGHT: 40
ADAPTOR_LEN_RIGHT: 10
ADAPTOR_PAT_RIGHT: CGGATGGCAA
---
To find adaptors from both ends in an Illumina data set do:
read_fastq -i test.fq | find_adaptor -l 1 -L 1 -f ACACGACGCTCTTCCGATCT -r AGATCGGAAGAGCACACGTC ...
To find adaptors from both ends in a 454 data set do:
read_sff -i test.sff | find_adaptor -L 1 -f CGTATCGCCTCCCTCGCGCCATCAG -R CTATGCGCCTTGCCAGCCGCCAG ...
To get an overview of the adaptors found we can plot the positions of the adaptors to check if they are found at the expected positions:
read_sff -i test.sff |
find_adaptor -L 1 -f CGTATCGCCTCCCTCGCGCCATCAG -R CTATGCGCCTTGCCAGCCGCCAG |
plot_distribution -k ADAPTOR_POS_LEFT -o /dev/tty |
plot_distribution -k ADAPTOR_POS_RIGHT -x
We can also plot the length distribution of the adaptor parts get an over view of how much of the apaptor was truncated:
read_sff -i test.sff |
find_adaptor -L 1 -f CGTATCGCCTCCCTCGCGCCATCAG -R CTATGCGCCTTGCCAGCCGCCAG |
plot_distribution -k ADAPTOR_LEN_LEFT -o /dev/tty |
plot_distribution -k ADAPTOR_LEN_RIGTH -x
Martin Asser Hansen - Copyright (C) - All rights reserved.
April 2011
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
find_adaptor is part of the Biopieces framework.