Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 6 revisions

Biopiece: find_barcodes

Description

find_barcodes can be used to find and remove barcodes in the stream. Barcodes are DNA tags of 6-11 bases that are added to experiements allowing for simultaniously sequencing of multiple experiments, which can then be seperated after sequencing based on the barcodes.

For the Roche 454 platform the barcodes are called MID tags and begin after the sequencing key

have build-in barcodes for the Roche Genome Sequencing MIDs (GSMID) and Rapid Library MIDs (RLMID).

find_barcodes allow for up to 2 mismatches in the barcode sequence.

It is possible to supply a list of names/barcodes in an external file using the -b switch. The external file should be formatted with one name/barcode per line seperated by whitespace like this:

Index_1  ATCACG
Index_2  CGATGT
Index_3  TTAGGC

If a barcode is found, a number of BARCODE keys are added to the record:

SEQ_NAME: test_rl_4p
SEQ: atcgACACGACGACT
SEQ_LEN: 15
BARCODE: ACACGACGACT
BARCODE_NAME: RL1
BARCODE_POS: 4
BARCODE_LEN: 11
BARCODE_MISMATCHES: 0
---

Usage

... | find_barcodes [options]

Options

[-?          | --help]                 #  Print full usage description.
[-b <file!>  | --barcodes_in=<file!>]  #  File with names/barcodes - one per line.
[-p <uint>   | --pos=<uint>]           #  Position of the barcode in the sequence  -  Default=0
[-m <uint>   | --mismatches=<uint>]    #  Numbers of mismatches to allow (max 2)   -  Default=0
[-g          | --gsmids]               #  Find Genome Sequencing MIDs (GSMIDs).
[-r          | --rlmids]               #  Find Rapid Library MIDs (RLMIDs).
[-R          | --remove]               #  Remove barcode (and left-hand sequence).
[-I <file!>  | --stream_in=<file!>]    #  Read input stream from file              -  Default=STDIN
[-O <file>   | --stream_out=<file>]    #  Write output stream to file              -  Default=STDOUT
[-v          | --verbose]              #  Verbose output.

Examples

Consider the following FASTA entries in the file test.fna:

>test_RL12_0_mismatches
atcgACTCGCGTCGTgtgactgact
>test_RL12_1_mismatches
atcgACTCGCcTCGTgtgactgact
>test_RL12_2_mismatches
atcgACTCaCcTCGTgtgactgact
>test_GS99_0_mismatches
atcgCTGTACATACgtagtagtagt
>test_GS99_1_mismatches
atcgCTGTtCATACgtagtagtagt
>test_GS99_2_mismatches
atcgCTGTtCAgACgtagtagtagt

Barcodes like this:

read_fasta -i test.fna | find_barcodes -p 4 -rg

SEQ_NAME: test_RL12_0_mismatches
SEQ: atcgACTCGCGTCGTgtgactgact
SEQ_LEN: 25
BARCODE: ACTCGCGTCGT
BARCODE_NAME: RL12
BARCODE_POS: 4
BARCODE_LEN: 11
BARCODE_MISMATCHES: 0
---
SEQ_NAME: test_RL12_1_mismatches
SEQ: atcgACTCGCcTCGTgtgactgact
SEQ_LEN: 25
---
SEQ_NAME: test_RL12_2_mismatches
SEQ: atcgACTCaCcTCGTgtgactgact
SEQ_LEN: 25
---
SEQ_NAME: test_GS99_0_mismatches
SEQ: atcgCTGTACATACgtagtagtagt
SEQ_LEN: 25
BARCODE: CTGTACATAC
BARCODE_NAME: MID99
BARCODE_POS: 4
BARCODE_LEN: 10
BARCODE_MISMATCHES: 0
---
SEQ_NAME: test_GS99_1_mismatches
SEQ: atcgCTGTtCATACgtagtagtagt
SEQ_LEN: 25
---
SEQ_NAME: test_GS99_2_mismatches
SEQ: atcgCTGTtCAgACgtagtagtagt
SEQ_LEN: 25
---

It is also possible to allow for up to 2 mismatches using the -m switch:

read_fasta -i test.fna | find_barcodes -p 4 -rg -m 2

SEQ_NAME: test_RL12_0_mismatches
SEQ: atcgACTCGCGTCGTgtgactgact
SEQ_LEN: 25
BARCODE: ACTCGCGTCGT
BARCODE_NAME: RL12
BARCODE_POS: 4
BARCODE_LEN: 11
BARCODE_MISMATCHES: 0
---
SEQ_NAME: test_RL12_1_mismatches
SEQ: atcgACTCGCcTCGTgtgactgact
SEQ_LEN: 25
BARCODE: ACTCGCGTCGT
BARCODE_NAME: RL12
BARCODE_POS: 4
BARCODE_LEN: 11
BARCODE_MISMATCHES: 1
---
SEQ_NAME: test_RL12_2_mismatches
SEQ: atcgACTCaCcTCGTgtgactgact
SEQ_LEN: 25
BARCODE: ACTCGCGTCGT
BARCODE_NAME: RL12
BARCODE_POS: 4
BARCODE_LEN: 11
BARCODE_MISMATCHES: 2
---
SEQ_NAME: test_GS99_0_mismatches
SEQ: atcgCTGTACATACgtagtagtagt
SEQ_LEN: 25
BARCODE: CTGTACATAC
BARCODE_NAME: MID99
BARCODE_POS: 4
BARCODE_LEN: 10
BARCODE_MISMATCHES: 0
---
SEQ_NAME: test_GS99_1_mismatches
SEQ: atcgCTGTtCATACgtagtagtagt
SEQ_LEN: 25
BARCODE: CTGTACATAC
BARCODE_NAME: MID99
BARCODE_POS: 4
BARCODE_LEN: 10
BARCODE_MISMATCHES: 1
---
SEQ_NAME: test_GS99_2_mismatches
SEQ: atcgCTGTtCAgACgtagtagtagt
SEQ_LEN: 25
BARCODE: CTGTACATAC
BARCODE_NAME: MID99
BARCODE_POS: 4
BARCODE_LEN: 10
BARCODE_MISMATCHES: 2
---

Use the -R switch to remove the located barcodes (and any sequence to the left of it):

read_fasta -i test.fna | find_barcodes -p 4 -rg -m 2 -R

SEQ_NAME: test_RL12_0_mismatches
SEQ: gtgactgact
SEQ_LEN: 10
BARCODE: ACTCGCGTCGT
BARCODE_NAME: RL12
BARCODE_POS: 4
BARCODE_LEN: 11
BARCODE_MISMATCHES: 0
---
SEQ_NAME: test_RL12_1_mismatches
SEQ: gtgactgact
SEQ_LEN: 10
BARCODE: ACTCGCGTCGT
BARCODE_NAME: RL12
BARCODE_POS: 4
BARCODE_LEN: 11
BARCODE_MISMATCHES: 1
---
SEQ_NAME: test_RL12_2_mismatches
SEQ: gtgactgact
SEQ_LEN: 10
BARCODE: ACTCGCGTCGT
BARCODE_NAME: RL12
BARCODE_POS: 4
BARCODE_LEN: 11
BARCODE_MISMATCHES: 2
---
SEQ_NAME: test_GS99_0_mismatches
SEQ: gtagtagtagt
SEQ_LEN: 11
BARCODE: CTGTACATAC
BARCODE_NAME: MID99
BARCODE_POS: 4
BARCODE_LEN: 10
BARCODE_MISMATCHES: 0
---
SEQ_NAME: test_GS99_1_mismatches
SEQ: gtagtagtagt
SEQ_LEN: 11
BARCODE: CTGTACATAC
BARCODE_NAME: MID99
BARCODE_POS: 4
BARCODE_LEN: 10
BARCODE_MISMATCHES: 1
---
SEQ_NAME: test_GS99_2_mismatches
SEQ: gtagtagtagt
SEQ_LEN: 11
BARCODE: CTGTACATAC
BARCODE_NAME: MID99
BARCODE_POS: 4
BARCODE_LEN: 10
BARCODE_MISMATCHES: 2
---

See also

read_fasta

write_fasta_files

Author

Martin Asser Hansen - Copyright (C) - All rights reserved.

[email protected]

October 2011

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

find_barcodes is part of the Biopieces framework.

http://www.biopieces.org

Clone this wiki locally