-
Notifications
You must be signed in to change notification settings - Fork 23
remove_adaptor
NB! remove_adaptor instead!
If you want to remove the adaptor sequence from sequences in the stream you can use remove_adaptor which will locate and remove the adaptor allowing for a number of mismatches (but no indels).
The remove modes available are:
- before - Removes sequence before and including the adaptor
- after - Removes sequence after and including the adaptor
- skip - Do not remove adaptor
For records with both sequence (SEQ) and a quality score string (SCORE) both will be trimmed in case of remove mode 'before' or 'after'. (make sure the SCORE string is ASCII encoded, not in semicolon seperated decimals).
NB! Only the first occurrence of the adaptor in any one sequence is located.
... | remove_adaptor [options]
[-? | --help] # Print full usage description.
[-a <string> | --adaptor=<string>] # Adaptor sequence to locate and remove.
[-m <uint> | --mismatches=<uint>] # Max number of mismatches - Default=0
[-o <uint> | --offset=<uint>] # Search sequence from offset (1-based) - Default=1
[-r <string> | --remove=<string>] # Remove mode: before|after|skip - Default=after
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the following FASTA entries in test.fna
:
>CE5_ID00000000
GAGGAAGAAGGAATATTTATCGTATGCCGTCTT
>CE5_ID00000001
GAGGAAGAAGGAATATTTTTCGTATGCCGTCTT
>CE5_ID00000002
GAATGTAAGGAAGTGTGTGGATTCGTATgCCGT
>CE5_ID00000003
GTTGTAAAGCTCTTTTGTCCtggaATCtTaTGc
>CE5_ID00000004
GTAGGATGAGTGACTACTCAAaTCGTATGCCGT
To locate the following standard Solexa 3' adaptor TCGTATGCCGTCTTCTGCTTG use remove_adaptor
with the first part of the adaptor and allow for two mismatches with the -m
switch:
read_fasta -i test.fna| remove_adaptor -a TCGTATGCC -m 2
The resulting output will have the adaptor sequence removed if it was found. Also an ADAPTOR_POS keys is added to the records. An ADAPTOR_POS of -1 indicates that no adaptor sequence was found and can be used with grab.
SEQ: GAGGAAGAAGGAATATTTA
ADAPTOR_POS: 19
SEQ_NAME: CE5_ID00000000
SEQ_LEN: 19
---
SEQ: GAGGAAGAAGGAATATTTT
ADAPTOR_POS: 19
SEQ_NAME: CE5_ID00000001
SEQ_LEN: 19
---
SEQ: GAATGTAAGGAAGTGTGTGGAT
ADAPTOR_POS: 22
SEQ_NAME: CE5_ID00000002
SEQ_LEN: 22
---
SEQ: GTAGGATGAGTGACTACTCAAa
ADAPTOR_POS: 22
SEQ_NAME: CE5_ID00000004
SEQ_LEN: 22
---
Use the -r before
switch to locate 5' adaptors:
read_fasta -i test.fna | remove_adaptor -a GAAGAAGG -r before
SEQ: AATATTTATCGTATGCCGTCTT
ADAPTOR_POS: 3
SEQ_NAME: CE5_ID00000000
SEQ_LEN: 22
---
SEQ: AATATTTTTCGTATGCCGTCTT
ADAPTOR_POS: 3
SEQ_NAME: CE5_ID00000001
SEQ_LEN: 22
---
SEQ: GAATGTAAGGAAGTGTGTGGATTCGTATgCCGT
ADAPTOR_POS: -1
SEQ_NAME: CE5_ID00000002
SEQ_LEN: 33
---
SEQ: GTTGTAAAGCTCTTTTGTCCtggaATCtTaTGc
ADAPTOR_POS: -1
SEQ_NAME: CE5_ID00000003
SEQ_LEN: 33
---
SEQ: GTAGGATGAGTGACTACTCAAaTCGTATGCCGT
ADAPTOR_POS: -1
SEQ_NAME: CE5_ID00000004
SEQ_LEN: 33
---
Using the -r skip
switch will suppress the adaptor removal:
read_fasta -i test.fna| remove_adaptor -a TCGTATGCC -m 2 -r skip
SEQ: GAGGAAGAAGGAATATTTATCGTATGCCGTCTT
ADAPTOR_POS: 19
SEQ_NAME: CE5_ID00000000
SEQ_LEN: 33
---
SEQ: GAGGAAGAAGGAATATTTTTCGTATGCCGTCTT
ADAPTOR_POS: 19
SEQ_NAME: CE5_ID00000001
SEQ_LEN: 33
---
SEQ: GAATGTAAGGAAGTGTGTGGATTCGTATgCCGT
ADAPTOR_POS: 22
SEQ_NAME: CE5_ID00000002
SEQ_LEN: 33
---
SEQ: GTAGGATGAGTGACTACTCAAaTCGTATGCCGT
ADAPTOR_POS: 22
SEQ_NAME: CE5_ID00000004
SEQ_LEN: 33
---
Martin Asser Hansen & Selene Fernandez - Copyright (C) - All rights reserved.
August 2008
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
remove_adaptor is part of the Biopieces framework.