-
Notifications
You must be signed in to change notification settings - Fork 23
find_barcodes
find_barcodes can be used to find and remove barcodes in the stream. Barcodes are DNA tags of 6-11 bases that are added to experiements allowing for simultaniously sequencing of multiple experiments, which can then be seperated after sequencing based on the barcodes.
For the Roche 454 platform the barcodes are called MID tags and begin after the sequencing key
have build-in barcodes for the Roche Genome Sequencing MIDs (GSMID) and Rapid Library MIDs (RLMID).
find_barcodes allow for up to 2 mismatches in the barcode sequence.
It is possible to supply a list of names/barcodes in an external file using the -b
switch.
The external file should be formatted with one name/barcode per line seperated by whitespace like
this:
Index_1 ATCACG
Index_2 CGATGT
Index_3 TTAGGC
If a barcode is found, a number of BARCODE
keys are added to the record:
SEQ_NAME: test_rl_4p
SEQ: atcgACACGACGACT
SEQ_LEN: 15
BARCODE: ACACGACGACT
BARCODE_NAME: RL1
BARCODE_POS: 4
BARCODE_LEN: 11
BARCODE_MISMATCHES: 0
---
... | find_barcodes [options]
[-? | --help] # Print full usage description.
[-b <file!> | --barcodes_in=<file!>] # File with names/barcodes - one per line.
[-p <uint> | --pos=<uint>] # Position of the barcode in the sequence - Default=0
[-m <uint> | --mismatches=<uint>] # Numbers of mismatches to allow (max 2) - Default=0
[-g | --gsmids] # Find Genome Sequencing MIDs (GSMIDs).
[-r | --rlmids] # Find Rapid Library MIDs (RLMIDs).
[-R | --remove] # Remove barcode (and left-hand sequence).
[-I <file!> | --stream_in=<file!>] # Read input stream from file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output stream to file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the following FASTA entries in the file test.fna
:
>test_RL12_0_mismatches
atcgACTCGCGTCGTgtgactgact
>test_RL12_1_mismatches
atcgACTCGCcTCGTgtgactgact
>test_RL12_2_mismatches
atcgACTCaCcTCGTgtgactgact
>test_GS99_0_mismatches
atcgCTGTACATACgtagtagtagt
>test_GS99_1_mismatches
atcgCTGTtCATACgtagtagtagt
>test_GS99_2_mismatches
atcgCTGTtCAgACgtagtagtagt
Barcodes like this:
read_fasta -i test.fna | find_barcodes -p 4 -rg
SEQ_NAME: test_RL12_0_mismatches
SEQ: atcgACTCGCGTCGTgtgactgact
SEQ_LEN: 25
BARCODE: ACTCGCGTCGT
BARCODE_NAME: RL12
BARCODE_POS: 4
BARCODE_LEN: 11
BARCODE_MISMATCHES: 0
---
SEQ_NAME: test_RL12_1_mismatches
SEQ: atcgACTCGCcTCGTgtgactgact
SEQ_LEN: 25
---
SEQ_NAME: test_RL12_2_mismatches
SEQ: atcgACTCaCcTCGTgtgactgact
SEQ_LEN: 25
---
SEQ_NAME: test_GS99_0_mismatches
SEQ: atcgCTGTACATACgtagtagtagt
SEQ_LEN: 25
BARCODE: CTGTACATAC
BARCODE_NAME: MID99
BARCODE_POS: 4
BARCODE_LEN: 10
BARCODE_MISMATCHES: 0
---
SEQ_NAME: test_GS99_1_mismatches
SEQ: atcgCTGTtCATACgtagtagtagt
SEQ_LEN: 25
---
SEQ_NAME: test_GS99_2_mismatches
SEQ: atcgCTGTtCAgACgtagtagtagt
SEQ_LEN: 25
---
It is also possible to allow for up to 2 mismatches using the -m
switch:
read_fasta -i test.fna | find_barcodes -p 4 -rg -m 2
SEQ_NAME: test_RL12_0_mismatches
SEQ: atcgACTCGCGTCGTgtgactgact
SEQ_LEN: 25
BARCODE: ACTCGCGTCGT
BARCODE_NAME: RL12
BARCODE_POS: 4
BARCODE_LEN: 11
BARCODE_MISMATCHES: 0
---
SEQ_NAME: test_RL12_1_mismatches
SEQ: atcgACTCGCcTCGTgtgactgact
SEQ_LEN: 25
BARCODE: ACTCGCGTCGT
BARCODE_NAME: RL12
BARCODE_POS: 4
BARCODE_LEN: 11
BARCODE_MISMATCHES: 1
---
SEQ_NAME: test_RL12_2_mismatches
SEQ: atcgACTCaCcTCGTgtgactgact
SEQ_LEN: 25
BARCODE: ACTCGCGTCGT
BARCODE_NAME: RL12
BARCODE_POS: 4
BARCODE_LEN: 11
BARCODE_MISMATCHES: 2
---
SEQ_NAME: test_GS99_0_mismatches
SEQ: atcgCTGTACATACgtagtagtagt
SEQ_LEN: 25
BARCODE: CTGTACATAC
BARCODE_NAME: MID99
BARCODE_POS: 4
BARCODE_LEN: 10
BARCODE_MISMATCHES: 0
---
SEQ_NAME: test_GS99_1_mismatches
SEQ: atcgCTGTtCATACgtagtagtagt
SEQ_LEN: 25
BARCODE: CTGTACATAC
BARCODE_NAME: MID99
BARCODE_POS: 4
BARCODE_LEN: 10
BARCODE_MISMATCHES: 1
---
SEQ_NAME: test_GS99_2_mismatches
SEQ: atcgCTGTtCAgACgtagtagtagt
SEQ_LEN: 25
BARCODE: CTGTACATAC
BARCODE_NAME: MID99
BARCODE_POS: 4
BARCODE_LEN: 10
BARCODE_MISMATCHES: 2
---
Use the -R
switch to remove the located barcodes (and any sequence to the left of it):
read_fasta -i test.fna | find_barcodes -p 4 -rg -m 2 -R
SEQ_NAME: test_RL12_0_mismatches
SEQ: gtgactgact
SEQ_LEN: 10
BARCODE: ACTCGCGTCGT
BARCODE_NAME: RL12
BARCODE_POS: 4
BARCODE_LEN: 11
BARCODE_MISMATCHES: 0
---
SEQ_NAME: test_RL12_1_mismatches
SEQ: gtgactgact
SEQ_LEN: 10
BARCODE: ACTCGCGTCGT
BARCODE_NAME: RL12
BARCODE_POS: 4
BARCODE_LEN: 11
BARCODE_MISMATCHES: 1
---
SEQ_NAME: test_RL12_2_mismatches
SEQ: gtgactgact
SEQ_LEN: 10
BARCODE: ACTCGCGTCGT
BARCODE_NAME: RL12
BARCODE_POS: 4
BARCODE_LEN: 11
BARCODE_MISMATCHES: 2
---
SEQ_NAME: test_GS99_0_mismatches
SEQ: gtagtagtagt
SEQ_LEN: 11
BARCODE: CTGTACATAC
BARCODE_NAME: MID99
BARCODE_POS: 4
BARCODE_LEN: 10
BARCODE_MISMATCHES: 0
---
SEQ_NAME: test_GS99_1_mismatches
SEQ: gtagtagtagt
SEQ_LEN: 11
BARCODE: CTGTACATAC
BARCODE_NAME: MID99
BARCODE_POS: 4
BARCODE_LEN: 10
BARCODE_MISMATCHES: 1
---
SEQ_NAME: test_GS99_2_mismatches
SEQ: gtagtagtagt
SEQ_LEN: 11
BARCODE: CTGTACATAC
BARCODE_NAME: MID99
BARCODE_POS: 4
BARCODE_LEN: 10
BARCODE_MISMATCHES: 2
---
Martin Asser Hansen - Copyright (C) - All rights reserved.
October 2011
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
find_barcodes is part of the Biopieces framework.