Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 6 revisions

Biopiece: find_adaptor

Description

If you want to find adaptor sequences in sequences in the stream you can use find_adaptor. Adaptors are located by scanning each sequence from left to right allowing for ambiguity codes as well as mismatches, insertions, and deletions.

Adaptor sequences can be specified lexically using the -f and -r switches which corresponds to the forward adaptor beginning with the 5'-end and the reverse-complement of the reverse adaptor beginning with the 3'-end. Using the -F and -R swithes reverse-complements the adaptor sequences.

It is possible to enable the finding of partial adaptors at the ends, all the way down to length 1, by specifying the minimum length of the adaptors to match using the -l and -L swithces for the left and rigth end, respectively.

The mismatches, insertions, and deletions are specified as percentages of the adaptor length to adjust for the reduced length of partial adaptors. In the below example we search a sequence for a reverse adaptor of length 10 and the 20%, 10%, 5% for mismatches, insertions, and deletions, respectively. Thus we initially allow for m=2, i=1, d=0:

TCAGTACTGAGCTACAGTACACGATGCGTCCAGGAACCATCGGATGGCAATCG - sequence
TCGTATGCCG   (scan all positions until the end)       - (m=2, i=1, d=1)
                                            TCGTATGCC - (m=2, i=1, d=0)
                                             TCGTATGC - (m=2, i=1, d=0)
                                              TCGTATG - (m=1, i=1, d=0)
                                               TCGTAT - (m=1, i=1, d=0)
                                                TCGTA - (m=1, i=1, d=0)
                                                 TCGT - (m=1, i=0, d=0)
                                                  TCG - (m=1, i=0, d=0) -> match!

If a match is found a number of ADAPTOR_* keys are added to the record:

read_fasta -i test.fna | find_adaptor -r TCGTATGCCG -L 1 -m 20 -i 10 -d 5

SEQ_NAME: test
SEQ: TCAGTACTGAGCTACAGTACACGATGCGTCCAGGAACCATCGGATGGCAATCG
SEQ_LEN: 53
ADAPTOR_POS_RIGHT: 50
ADAPTOR_LEN_RIGHT: 3
ADAPTOR_PAT_RIGHT: TCG
---

Once adaptors are locted with find_adaptor to remove adaptor sequence.

Usage

... | find_adaptor <-f adaptor | -r adaptor> [options]

Options

[-?          | --help]                   #  Print full usage description.
[-f <string> | --forward=<string>]       #  Forward adaptor (5'-3') to locate.
[-F <string> | --forward_rc=<string>]    #  Forward adaptor (3'-5') to locate.
[-r <string> | --reverse=<string>]       #  Reverse adaptor (3'-5') to locate.
[-R <string> | --reverse_rc=<string>]    #  Reverse adaptor (5'-3') to locate.
[-l <uint]   | --len_forward=<uint>]     #  Length of forward adaptor part to locate  -  Default=<forward adaptor length>
[-L <uint]   | --len_reverse=<uint>]     #  Length of reverse adaptor part to locate  -  Default=<reverse adaptor length>
[-m <uint>   | --mismatches=<uint>]      #  Max mismatch percent allowed              -  Default=10
[-i <uint>   | --insertions=<uint>]      #  Max insertion percent allowed             -  Default=5
[-d <uint>   | --deletions=<uint>]       #  Max deletion percent allowed              -  Default=5
[-I <file!>  | --stream_in=<file!>]      #  Read input from stream file               -  Default=STDIN
[-O <file>   | --stream_out=<file>]      #  Write output to stream file               -  Default=STDOUT
[-v          | --verbose]                #  Verbose output.

Examples

Consider the following FASTA entry in the file test.fna:

>test
TCAGTACTGAGCTACAGTACACGATGCGTCCAGGAACCATCGGATGGCAATCG

To locate the following adaptors GTACCGAGCT and CGGATCGCAA do:

read_fasta -i test.fna | find_adaptor -f GTACCGAGCT -r CGGATCGCAA

SEQ_NAME: test
SEQ: TCAGTACTGAGCTACAGTACACGATGCGTCCAGGAACCATCGGATGGCAATCG
SEQ_LEN: 53
ADAPTOR_POS_LEFT: 3
ADAPTOR_LEN_LEFT: 10
ADAPTOR_PAT_LEFT: GTACTGAGCT
ADAPTOR_POS_RIGHT: 40
ADAPTOR_LEN_RIGHT: 10
ADAPTOR_PAT_RIGHT: CGGATGGCAA
---

To find adaptors from both ends in an Illumina data set do:

read_fastq -i test.fq | find_adaptor -l 1 -L 1 -f ACACGACGCTCTTCCGATCT -r AGATCGGAAGAGCACACGTC ...

To find adaptors from both ends in a 454 data set do:

read_sff -i test.sff | find_adaptor -L 1 -f CGTATCGCCTCCCTCGCGCCATCAG -R CTATGCGCCTTGCCAGCCGCCAG ...

To get an overview of the adaptors found we can plot the positions of the adaptors to check if they are found at the expected positions:

read_sff -i test.sff |
find_adaptor -L 1 -f CGTATCGCCTCCCTCGCGCCATCAG -R CTATGCGCCTTGCCAGCCGCCAG |
plot_distribution -k ADAPTOR_POS_LEFT -o /dev/tty |
plot_distribution -k ADAPTOR_POS_RIGHT -x

We can also plot the length distribution of the adaptor parts get an over view of how much of the apaptor was truncated:

read_sff -i test.sff |
find_adaptor -L 1 -f CGTATCGCCTCCCTCGCGCCATCAG -R CTATGCGCCTTGCCAGCCGCCAG |
plot_distribution -k ADAPTOR_LEN_LEFT -o /dev/tty |
plot_distribution -k ADAPTOR_LEN_RIGTH -x

See also

read_fasta

read_sff

clip_adaptor

Author

Martin Asser Hansen - Copyright (C) - All rights reserved.

[email protected]

April 2011

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

find_adaptor is part of the Biopieces framework.

http://www.biopieces.org

Clone this wiki locally