Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 9 revisions

Biopiece: read_embl

Description

read_embl read in entries from EMBL files. A EMBL entry consists of three main parts:

  1. Generic info (such as accession number, species, references, taxonimy, version, etc.)
  2. Feature table (containing info on features within each EMBL entry)
  3. Sequence

read_embl per default read all these informations, but it is possible to specify which parts of the Generic info, Feature table, and Sequence that is read - which results in great speed improvements.

A Biopiece record is output per feature from the Feature table. The sequence for each feature is included.

Based on the Location of each feature S_BEG, S_END, and STRAND keys are added to the biopiece record.

For each feature the qualifiers are seperated with semi-colon per qualifier.

The EMBL format is notoriously evil to parse and read_embl uses a couple of compromises in order to focus on parsing information from the Feature table. E.g. the parsing of references from the Generic info section is crude.

Usage

read_embl [options] -i <EMBL file(s)>

Options

[-?          | --help]               #  Print full usage description.
[-i <files!> | --data_in=<files!>]   #  Comma separated list of files or glob expression to read.
[-n <uint>   | --num=<uint>]         #  Limit number of records to read.
[-k <list>   | --keys=<list>]        #  Match a subset of record keys only.
[-f <list>   | --features=<list>]    #  Match a subset of features only.
[-q <list>   | --qualifiers=<list>]  #  Match a subset of qualifiers only.
[-I <file!>  | --stream_in=<file!>]  #  Read input stream from file  -  Default=STDIN
[-O <file>   | --stream_out=<file>]  #  Write output stream to file  -  Default=STDOUT
[-v          | --verbose]            #  Verbose output.

Examples

Consider the following EMBL record in the file test.embl:

ID   U49845; SV 1; linear; genomic DNA; STD; FUN; 5028 BP.
XX
AC   U49845;
XX
DT   07-MAY-1996 (Rel. 47, Created)
DT   25-MAR-2010 (Rel. 104, Last updated, Version 5)
XX
DE   Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p (AXL2) and
DE   Rev7p (REV7) genes, complete cds.
XX
KW   .
XX
OS   Saccharomyces cerevisiae (baker's yeast)
OC   Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes;
OC   Saccharomycetales; Saccharomycetaceae; Saccharomyces.
XX
RN   [1]
RP   1-5028
RX   PUBMED; 8846915.
RA   Roemer T., Madden K., Chang J., Snyder M.;
RT   "Selection of axial growth sites in yeast requires Axl2p, a novel plasma
RT   membrane glycoprotein";
RL   Genes Dev. 10(7):777-793(1996).
XX
RN   [2]
RP   1-5028
RA   Roemer T.;
RT   ;
RL   Submitted (22-FEB-1996) to the INSDC.
RL   Biology, Yale University, New Haven, CT 06520, USA
XX
DR   Ensembl-Gn; YIL139C; Saccharomyces_cerevisiae.
DR   Ensembl-Gn; YIL140W; Saccharomyces_cerevisiae.
DR   Ensembl-Gn; YIL142W; Saccharomyces_cerevisiae.
DR   Ensembl-Tr; YIL139C; Saccharomyces_cerevisiae.
DR   Ensembl-Tr; YIL140W; Saccharomyces_cerevisiae.
DR   Ensembl-Tr; YIL142W; Saccharomyces_cerevisiae.
DR   EnsemblGenomes; YIL139C; Saccharomyces_cerevisiae.
DR   EnsemblGenomes; YIL140W; Saccharomyces_cerevisiae.
DR   EnsemblGenomes; YIL142W; Saccharomyces_cerevisiae.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..5028
FT                   /organism="Saccharomyces cerevisiae"
FT                   /chromosome="IX"
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:4932"
FT   mRNA            <1..>206
FT                   /product="TCP1-beta"
FT   CDS             <1..206
FT                   /codon_start=3
FT                   /product="TCP1-beta"
FT                   /db_xref="GOA:P39076"
FT                   /db_xref="InterPro:IPR002194"
FT                   /db_xref="InterPro:IPR002423"
FT                   /db_xref="InterPro:IPR012716"
FT                   /db_xref="InterPro:IPR017998"
FT                   /db_xref="PDB:3P9D"
FT                   /db_xref="UniProtKB/Swiss-Prot:P39076"
FT                   /protein_id="AAA98665.1"
FT                   /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEAA
FT                   EVLLRVDNIIRARPRTANRQHM"
FT   gene            <687..>3158
FT                   /gene="AXL2"
FT   mRNA            <687..>3158
FT                   /gene="AXL2"
FT                   /product="Axl2p"
FT   CDS             687..3158
FT                   /codon_start=1
FT                   /gene="AXL2"
FT                   /product="Axl2p"
FT                   /note="plasma membrane glycoprotein"
FT                   /db_xref="GOA:P38928"
FT                   /db_xref="InterPro:IPR006644"
FT                   /db_xref="InterPro:IPR008009"
FT                   /db_xref="InterPro:IPR013783"
FT                   /db_xref="InterPro:IPR014805"
FT                   /db_xref="InterPro:IPR015919"
FT                   /db_xref="SGD:S000001402"
FT                   /db_xref="UniProtKB/Swiss-Prot:P38928"
FT                   /protein_id="AAA98666.1"
FT                   /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESFT
FT                   FQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFNVI
FT                   LEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNEVFN
FT                   VTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPETSYS
FT                   FVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYVYLDDD
FT                   PISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYGDVIYFN
FT                   FEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQDHDWVKF
FT                   QSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSANATSTRSS
FT                   HHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIACGVAIPLGV
FT                   ILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLNNPFDDDASSY
FT                   DDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQSQSKEELLAKP
FT                   PVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDSYGSQKTVDTEKL
FT                   FDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTKHRNRHLQNIQDSQ
FT                   SGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRLVDFSNKSNVNVGQV
FT                   KDIHGRIPEML"
FT   gene            complement(<3300..>4037)
FT                   /gene="REV7"
FT   mRNA            complement(<3300..>4037)
FT                   /gene="REV7"
FT                   /product="Rev7p"
FT   CDS             complement(3300..4037)
FT                   /codon_start=1
FT                   /gene="REV7"
FT                   /product="Rev7p"
FT                   /db_xref="GOA:P38927"
FT                   /db_xref="InterPro:IPR003511"
FT                   /db_xref="SGD:S000001401"
FT                   /db_xref="UniProtKB/Swiss-Prot:P38927"
FT                   /protein_id="AAA98667.1"
FT                   /translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQF
FT                   VPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVDKD
FT                   DQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNRRVD
FT                   SLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEKLISG
FT                   DDKILNGVYSQYEEGESIFGSLF"
XX
SQ   Sequence 5028 BP; 1510 A; 1074 C; 835 G; 1609 T; 0 other;
     gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg        60
     ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct       120
     ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa       180
     gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg       240
     ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa       300
     agacgcgaaa aaaaaagaac aacgcgtcat agaacttttg gcaattcgcg tcacaaataa       360
     attttggcaa cttatgtttc ctcttcgagc agtactcgag ccctgtctca agaatgtaat       420
     aatacccatc gtaggtatgg ttaaagatag catctccaca acctcaaagc tccttgccga       480
     gagtcgccct cctttgtcga gtaattttca cttttcatat gagaacttat tttcttattc       540
     tttactctca catcctgtag tgattgacac tgcaacagcc accatcacta gaagaacaga       600
     acaattactt aatagaaaaa ttatatcttc ctcgaaacga tttcctgctt ccaacatcta       660
     cgtatatcaa gaagcattca cttaccatga cacagcttca gatttcatta ttgctgacag       720
     ctactatatc actactccat ctagtagtgg ccacgcccta tgaggcatat cctatcggaa       780
     aacaataccc cccagtggca agagtcaatg aatcgtttac atttcaaatt tccaatgata       840
     cctataaatc gtctgtagac aagacagctc aaataacata caattgcttc gacttaccga       900
     gctggctttc gtttgactct agttctagaa cgttctcagg tgaaccttct tctgacttac       960
     tatctgatgc gaacaccacg ttgtatttca atgtaatact cgagggtacg gactctgccg      1020
     acagcacgtc tttgaacaat acataccaat ttgttgttac aaaccgtcca tccatctcgc      1080
     tatcgtcaga tttcaatcta ttggcgttgt taaaaaacta tggttatact aacggcaaaa      1140
     acgctctgaa actagatcct aatgaagtct tcaacgtgac ttttgaccgt tcaatgttca      1200
     ctaacgaaga atccattgtg tcgtattacg gacgttctca gttgtataat gcgccgttac      1260
     ccaattggct gttcttcgat tctggcgagt tgaagtttac tgggacggca ccggtgataa      1320
     actcggcgat tgctccagaa acaagctaca gttttgtcat catcgctaca gacattgaag      1380
     gattttctgc cgttgaggta gaattcgaat tagtcatcgg ggctcaccag ttaactacct      1440
     ctattcaaaa tagtttgata atcaacgtta ctgacacagg taacgtttca tatgacttac      1500
     ctctaaacta tgtttatctc gatgacgatc ctatttcttc tgataaattg ggttctataa      1560
     acttattgga tgctccagac tgggtggcat tagataatgc taccatttcc gggtctgtcc      1620
     cagatgaatt actcggtaag aactccaatc ctgccaattt ttctgtgtcc atttatgata      1680
     cttatggtga tgtgatttat ttcaacttcg aagttgtctc cacaacggat ttgtttgcca      1740
     ttagttctct tcccaatatt aacgctacaa ggggtgaatg gttctcctac tattttttgc      1800
     cttctcagtt tacagactac gtgaatacaa acgtttcatt agagtttact aattcaagcc      1860
     aagaccatga ctgggtgaaa ttccaatcat ctaatttaac attagctgga gaagtgccca      1920
     agaatttcga caagctttca ttaggtttga aagcgaacca aggttcacaa tctcaagagc      1980
     tatattttaa catcattggc atggattcaa agataactca ctcaaaccac agtgcgaatg      2040
     caacgtccac aagaagttct caccactcca cctcaacaag ttcttacaca tcttctactt      2100
     acactgcaaa aatttcttct acctccgctg ctgctacttc ttctgctcca gcagcgctgc      2160
     cagcagccaa taaaacttca tctcacaata aaaaagcagt agcaattgcg tgcggtgttg      2220
     ctatcccatt aggcgttatc ctagtagctc tcatttgctt cctaatattc tggagacgca      2280
     gaagggaaaa tccagacgat gaaaacttac cgcatgctat tagtggacct gatttgaata      2340
     atcctgcaaa taaaccaaat caagaaaacg ctacaccttt gaacaacccc tttgatgatg      2400
     atgcttcctc gtacgatgat acttcaatag caagaagatt ggctgctttg aacactttga      2460
     aattggataa ccactctgcc actgaatctg atatttccag cgtggatgaa aagagagatt      2520
     ctctatcagg tatgaataca tacaatgatc agttccaatc ccaaagtaaa gaagaattat      2580
     tagcaaaacc cccagtacag cctccagaga gcccgttctt tgacccacag aataggtctt      2640
     cttctgtgta tatggatagt gaaccagcag taaataaatc ctggcgatat actggcaacc      2700
     tgtcaccagt ctctgatatt gtcagagaca gttacggatc acaaaaaact gttgatacag      2760
     aaaaactttt cgatttagaa gcaccagaga aggaaaaacg tacgtcaagg gatgtcacta      2820
     tgtcttcact ggacccttgg aacagcaata ttagcccttc tcccgtaaga aaatcagtaa      2880
     caccatcacc atataacgta acgaagcatc gtaaccgcca cttacaaaat attcaagact      2940
     ctcaaagcgg taaaaacgga atcactccca caacaatgtc aacttcatct tctgacgatt      3000
     ttgttccggt taaagatggt gaaaattttt gctgggtcca tagcatggaa ccagacagaa      3060
     gaccaagtaa gaaaaggtta gtagattttt caaataagag taatgtcaat gttggtcaag      3120
     ttaaggacat tcacggacgc atcccagaaa tgctgtgatt atacgcaacg atattttgct      3180
     taattttatt ttcctgtttt attttttatt agtggtttac agatacccta tattttattt      3240
     agtttttata cttagagaca tttaatttta attccattct tcaaatttca tttttgcact      3300
     taaaacaaag atccaaaaat gctctcgccc tcttcatatt gagaatacac tccattcaaa      3360
     attttgtcgt caccgctgat taatttttca ctaaactgat gaataatcaa aggccccacg      3420
     tcagaaccga ctaaagaagt gagttttatt ttaggaggtt gaaaaccatt attgtctggt      3480
     aaattttcat cttcttgaca tttaacccag tttgaatccc tttcaatttc tgctttttcc      3540
     tccaaactat cgaccctcct gtttctgtcc aacttatgtc ctagttccaa ttcgatcgca      3600
     ttaataactg cttcaaatgt tattgtgtca tcgttgactt taggtaattt ctccaaatgc      3660
     ataatcaaac tatttaagga agatcggaat tcgtcgaaca cttcagtttc cgtaatgatc      3720
     tgatcgtctt tatccacatg ttgtaattca ctaaaatcta aaacgtattt ttcaatgcat      3780
     aaatcgttct ttttattaat aatgcagatg gaaaatctgt aaacgtgcgt taatttagaa      3840
     agaacatcca gtataagttc ttctatatag tcaattaaag caggatgcct attaatggga      3900
     acgaactgcg gcaagttgaa tgactggtaa gtagtgtagt cgaatgactg aggtgggtat      3960
     acatttctat aaaataaaat caaattaatg tagcatttta agtataccct cagccacttc      4020
     tctacccatc tattcataaa gctgacgcaa cgattactat tttttttttc ttcttggatc      4080
     tcagtcgtcg caaaaacgta taccttcttt ttccgacctt ttttttagct ttctggaaaa      4140
     gtttatatta gttaaacagg gtctagtctt agtgtgaaag ctagtggttt cgattgactg      4200
     atattaagaa agtggaaatt aaattagtag tgtagacgta tatgcatatg tatttctcgc      4260
     ctgtttatgt ttctacgtac ttttgattta tagcaagggg aaaagaaata catactattt      4320
     tttggtaaag gtgaaagcat aatgtaaaag ctagaataaa atggacgaaa taaagagagg      4380
     cttagttcat cttttttcca aaaagcaccc aatgataata actaaaatga aaaggatttg      4440
     ccatctgtca gcaacatcag ttgtgtgagc aataataaaa tcatcacctc cgttgccttt      4500
     agcgcgtttg tcgtttgtat cttccgtaat tttagtctta tcaatgggaa tcataaattt      4560
     tccaatgaat tagcaatttc gtccaattct ttttgagctt cttcatattt gctttggaat      4620
     tcttcgcact tcttttccca ttcatctctt tcttcttcca aagcaacgat ccttctaccc      4680
     atttgctcag agttcaaatc ggcctctttc agtttatcca ttgcttcctt cagtttggct      4740
     tcactgtctt ctagctgttg ttctagatcc tggtttttct tggtgtagtt ctcattatta      4800
     gatctcaagt tattggagtc ttcagccaat tgctttgtat cagacaattg actctctaac      4860
     ttctccactt cactgtcgag ttgctcgttt ttagcggaca aagatttaat ctcgttttct      4920
     ttttcagtgt tagattgctc taattctttg agctgttctc tcagctcctc atatttttct      4980
     tgccatgact cagattctaa ttttaagcta ttcaatttct ctttgatc                   5028
//

Now, reading in the entire entry with read_embl yields:

read_embl -i test.embl

<6 records output - one for each feature>

To limit parsing of the Generic info section specify which keys you are interested in with the -k switch (using the first two letters in upper case):

read_embl -i test.embl -k AC

<6 records output - one for each feature>

To limit parsing of the Feature table specify which features you are interested in with the -f switch:

read_embl -i test.embl -k AC -f CDS

<3 records output - one for each CDS feature>

To limit parsing of the qualifiers specify which to parse using the -q switch:

read_embl -i test.embl -k AC -f CDS -q translation

<3 records output - one for each CDS feature>

See also

read_genbank

Author

Martin Asser Hansen - Copyright (C) - All rights reserved.

[email protected]

December 2011

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

read_embl is part of the Biopieces framework.

http://www.biopieces.org

Clone this wiki locally