-
Notifications
You must be signed in to change notification settings - Fork 23
digest_seq
digest_seq split sequences in the stream at a given restriction enzyme's cleavage sites. For each digestion product a new record is created and output to the stream. digest_seq requires two arguments:
- The restriction enzyme recognition pattern
- Cut position relative to the above match
E.g. the common restriction enzyme BamHI recognizes the pattern GGATCC
and the cut position for this enzyme is 1 indicating
that the cleavage site is after the first G
:
cut_pos 1 23456
pattern G|GATCC
^
Patterns can contain IUPAC ambiguity codes.
Records with digestion products have a REC_TYPE: DIGEST
key/value pair added.
Furthermore, the sequence coordinates are appended to the sequence name in brackets (0-based) and as
S_BEG
and S_END
keys.
To obtain the reverse-complement products use reverse_seq.
... | digest_seq [options]
[-? | --help] # Print full usage description.
[-p <string> | --pattern=<string>] # Restriction enzyme recognition pattern.
[-c <int> | --cut_pos=<int>] # Cut position relative to pattern match.
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the following sequence in the file test.fna
:
>test
cgatcgatcGGATCCgagagggtgtgtagtgGAATTCcgctgc
We can digest this sequence with BamHI like this:
read_fasta -i test.fna | digest_seq -p GGATCC -c 1
SEQ_NAME: test[0-9]
SEQ: cgatcgatcG
SEQ_LEN: 10
S_BEG: 0
S_END: 9
REC_TYPE: DIGEST
---
SEQ_NAME: test[10-42]
SEQ: GATCCgagagggtgtgtagtgGAATTCcgctgc
SEQ_LEN: 33
S_BEG: 10
S_END: 42
REC_TYPE: DIGEST
---
To obtain the - strand digestion products we re-read the file and reverse-complment the sequence before digestion like this:
read_fasta -i test.fna | reverse_seq | complement_seq | digest_seq -p GGATCC -c 1
SEQ_NAME: test[0-28]
SEQ: gcagcgGAATTCcactacacaccctctcG
SEQ_LEN: 29
S_BEG: 0
S_END: 28
REC_TYPE: DIGEST
---
SEQ_NAME: test[29-42]
SEQ: GATCCgatcgatcg
SEQ_LEN: 14
S_BEG: 29
S_END: 42
REC_TYPE: DIGEST
---
It is also possible to do restriction enzyme digestion with multipe enzymes simply by piping the result through digest_seq multiple times. Here we first digest with BamHI (pattern: GGATCC, cut_pos: 1) and then with EcoRI (pattern: GAATTC, cut_pos: 1):
read_fasta -i test.fna | digest_seq -p GGATCC -c 1 | digest_seq -p GAATTC -c 1
SEQ_NAME: test[0-9][0-9]
SEQ: cgatcgatcG
SEQ_LEN: 10
S_BEG: 0
S_END: 9
REC_TYPE: DIGEST
---
SEQ_NAME: test[10-42][0-21]
SEQ: GATCCgagagggtgtgtagtgG
SEQ_LEN: 22
S_BEG: 0
S_END: 21
REC_TYPE: DIGEST
---
SEQ_NAME: test[10-42][22-32]
SEQ: AATTCcgctgc
SEQ_LEN: 11
S_BEG: 22
S_END: 32
REC_TYPE: DIGEST
---
Martin Asser Hansen - Copyright (C) - All rights reserved.
September 2010
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
digest_seq is part of the Biopieces framework.