Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 9 revisions

Biopiece: digest_seq

Description

digest_seq split sequences in the stream at a given restriction enzyme's cleavage sites. For each digestion product a new record is created and output to the stream. digest_seq requires two arguments:

  1. The restriction enzyme recognition pattern
  2. Cut position relative to the above match

E.g. the common restriction enzyme BamHI recognizes the pattern GGATCC and the cut position for this enzyme is 1 indicating that the cleavage site is after the first G:

cut_pos      1 23456
pattern      G|GATCC
              ^

Patterns can contain IUPAC ambiguity codes.

Records with digestion products have a REC_TYPE: DIGEST key/value pair added. Furthermore, the sequence coordinates are appended to the sequence name in brackets (0-based) and as S_BEG and S_END keys.

To obtain the reverse-complement products use reverse_seq.

Usage

... | digest_seq [options]

Options

[-?          | --help]               #  Print full usage description.
[-p <string> | --pattern=<string>]   #  Restriction enzyme recognition pattern.
[-c <int>    | --cut_pos=<int>]      #  Cut position relative to pattern match.
[-I <file!>  | --stream_in=<file!>]  #  Read input from stream file  -  Default=STDIN
[-O <file>   | --stream_out=<file>]  #  Write output to stream file  -  Default=STDOUT
[-v          | --verbose]            #  Verbose output.

Examples

Consider the following sequence in the file test.fna:

>test
cgatcgatcGGATCCgagagggtgtgtagtgGAATTCcgctgc

We can digest this sequence with BamHI like this:

read_fasta -i test.fna | digest_seq -p GGATCC -c 1

SEQ_NAME: test[0-9]
SEQ: cgatcgatcG
SEQ_LEN: 10
S_BEG: 0
S_END: 9
REC_TYPE: DIGEST
---
SEQ_NAME: test[10-42]
SEQ: GATCCgagagggtgtgtagtgGAATTCcgctgc
SEQ_LEN: 33
S_BEG: 10
S_END: 42
REC_TYPE: DIGEST
---

To obtain the - strand digestion products we re-read the file and reverse-complment the sequence before digestion like this:

read_fasta -i test.fna | reverse_seq | complement_seq | digest_seq -p GGATCC -c 1

SEQ_NAME: test[0-28]
SEQ: gcagcgGAATTCcactacacaccctctcG
SEQ_LEN: 29
S_BEG: 0
S_END: 28
REC_TYPE: DIGEST
---
SEQ_NAME: test[29-42]
SEQ: GATCCgatcgatcg
SEQ_LEN: 14
S_BEG: 29
S_END: 42
REC_TYPE: DIGEST
---

It is also possible to do restriction enzyme digestion with multipe enzymes simply by piping the result through digest_seq multiple times. Here we first digest with BamHI (pattern: GGATCC, cut_pos: 1) and then with EcoRI (pattern: GAATTC, cut_pos: 1):

read_fasta -i test.fna | digest_seq -p GGATCC -c 1 | digest_seq -p GAATTC -c 1

SEQ_NAME: test[0-9][0-9]
SEQ: cgatcgatcG
SEQ_LEN: 10
S_BEG: 0
S_END: 9
REC_TYPE: DIGEST
---
SEQ_NAME: test[10-42][0-21]
SEQ: GATCCgagagggtgtgtagtgG
SEQ_LEN: 22
S_BEG: 0
S_END: 21
REC_TYPE: DIGEST
---
SEQ_NAME: test[10-42][22-32]
SEQ: AATTCcgctgc
SEQ_LEN: 11
S_BEG: 22
S_END: 32
REC_TYPE: DIGEST
---

See also

rescan_seq

read_fasta

reverse_seq

complement_seq

Author

Martin Asser Hansen - Copyright (C) - All rights reserved.

[email protected]

September 2010

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

digest_seq is part of the Biopieces framework.

http://www.biopieces.org

Clone this wiki locally