Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 6 revisions

Biopiece: remove_adaptor

Description

NB! remove_adaptor instead!

If you want to remove the adaptor sequence from sequences in the stream you can use remove_adaptor which will locate and remove the adaptor allowing for a number of mismatches (but no indels).

The remove modes available are:

  • before - Removes sequence before and including the adaptor
  • after - Removes sequence after and including the adaptor
  • skip - Do not remove adaptor

For records with both sequence (SEQ) and a quality score string (SCORE) both will be trimmed in case of remove mode 'before' or 'after'. (make sure the SCORE string is ASCII encoded, not in semicolon seperated decimals).

NB! Only the first occurrence of the adaptor in any one sequence is located.

Usage

... | remove_adaptor [options]

Options

[-?          | --help]               #  Print full usage description.
[-a <string> | --adaptor=<string>]   #  Adaptor sequence to locate and remove.
[-m <uint>   | --mismatches=<uint>]  #  Max number of mismatches               -  Default=0
[-o <uint>   | --offset=<uint>]      #  Search sequence from offset (1-based)  -  Default=1
[-r <string> | --remove=<string>]    #  Remove mode: before|after|skip         -  Default=after
[-I <file!>  | --stream_in=<file!>]  #  Read input from stream file            -  Default=STDIN
[-O <file>   | --stream_out=<file>]  #  Write output to stream file            -  Default=STDOUT
[-v          | --verbose]            #  Verbose output.

Examples

Consider the following FASTA entries in test.fna:

>CE5_ID00000000
GAGGAAGAAGGAATATTTATCGTATGCCGTCTT
>CE5_ID00000001
GAGGAAGAAGGAATATTTTTCGTATGCCGTCTT
>CE5_ID00000002
GAATGTAAGGAAGTGTGTGGATTCGTATgCCGT
>CE5_ID00000003
GTTGTAAAGCTCTTTTGTCCtggaATCtTaTGc
>CE5_ID00000004
GTAGGATGAGTGACTACTCAAaTCGTATGCCGT

To locate the following standard Solexa 3' adaptor TCGTATGCCGTCTTCTGCTTG use remove_adaptor with the first part of the adaptor and allow for two mismatches with the -m switch:

read_fasta -i test.fna| remove_adaptor -a TCGTATGCC -m 2

The resulting output will have the adaptor sequence removed if it was found. Also an ADAPTOR_POS keys is added to the records. An ADAPTOR_POS of -1 indicates that no adaptor sequence was found and can be used with grab.

SEQ: GAGGAAGAAGGAATATTTA
ADAPTOR_POS: 19
SEQ_NAME: CE5_ID00000000
SEQ_LEN: 19
---
SEQ: GAGGAAGAAGGAATATTTT
ADAPTOR_POS: 19
SEQ_NAME: CE5_ID00000001
SEQ_LEN: 19
---
SEQ: GAATGTAAGGAAGTGTGTGGAT
ADAPTOR_POS: 22
SEQ_NAME: CE5_ID00000002
SEQ_LEN: 22
---
SEQ: GTAGGATGAGTGACTACTCAAa
ADAPTOR_POS: 22
SEQ_NAME: CE5_ID00000004
SEQ_LEN: 22
---

Use the -r before switch to locate 5' adaptors:

read_fasta -i test.fna | remove_adaptor -a GAAGAAGG -r before

SEQ: AATATTTATCGTATGCCGTCTT
ADAPTOR_POS: 3
SEQ_NAME: CE5_ID00000000
SEQ_LEN: 22
---
SEQ: AATATTTTTCGTATGCCGTCTT
ADAPTOR_POS: 3
SEQ_NAME: CE5_ID00000001
SEQ_LEN: 22
---
SEQ: GAATGTAAGGAAGTGTGTGGATTCGTATgCCGT
ADAPTOR_POS: -1
SEQ_NAME: CE5_ID00000002
SEQ_LEN: 33
---
SEQ: GTTGTAAAGCTCTTTTGTCCtggaATCtTaTGc
ADAPTOR_POS: -1
SEQ_NAME: CE5_ID00000003
SEQ_LEN: 33
---
SEQ: GTAGGATGAGTGACTACTCAAaTCGTATGCCGT
ADAPTOR_POS: -1
SEQ_NAME: CE5_ID00000004
SEQ_LEN: 33
---

Using the -r skip switch will suppress the adaptor removal:

read_fasta -i test.fna| remove_adaptor -a TCGTATGCC -m 2 -r skip
SEQ: GAGGAAGAAGGAATATTTATCGTATGCCGTCTT
ADAPTOR_POS: 19
SEQ_NAME: CE5_ID00000000
SEQ_LEN: 33
---
SEQ: GAGGAAGAAGGAATATTTTTCGTATGCCGTCTT
ADAPTOR_POS: 19
SEQ_NAME: CE5_ID00000001
SEQ_LEN: 33
---
SEQ: GAATGTAAGGAAGTGTGTGGATTCGTATgCCGT
ADAPTOR_POS: 22
SEQ_NAME: CE5_ID00000002
SEQ_LEN: 33
---
SEQ: GTAGGATGAGTGACTACTCAAaTCGTATGCCGT
ADAPTOR_POS: 22
SEQ_NAME: CE5_ID00000004
SEQ_LEN: 33
---

See also

read_fasta

grab

write_fasta

Author

Martin Asser Hansen & Selene Fernandez - Copyright (C) - All rights reserved.

[email protected]

August 2008

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

remove_adaptor is part of the Biopieces framework.

http://www.biopieces.org

Clone this wiki locally