GitHub - lcscs12345/ncbi-xsl

#Versatile and Lossless Conversion of NCBI GenBank Records

I've tested a handful Perl and Python scripts to retrieve annotations from GenBank flat files (.gbk, .gbff or .seq). However, accurate or lossless conversion by parsing GenBank flat files seems like a dream. A better option is to download gff files from ftp://ftp.ncbi.nlm.nih.gov/genomes/. But the gff collection is only available for a subset of refseq. In addition, some entries might be outdated or temporarily pulled off during curation.

Here is the official solution: parsing ASN.1 files instead of flat files using annotwriter from NCBI C++ toolkit. However, there is no precompiled binary for the 131 MB binary. See http://sourceforge.net/p/song/mailman/song-devel/thread/[email protected]/

Install NCBI C++ Toolkit. Warning: full installation is 21 GB. To compile annotwriter only, see http://www.ncbi.nlm.nih.gov/mailman/pipermail/cpp/2015q4/002738.html

 curl -O ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/ncbi_cxx--12_0_0.tar.gz
 tar zxvf ncbi_cxx--12_0_0.tar.gz
 cd ncbi_cxx--12_0_0
 ./configure --prefix=/ANY/DIR/ncbi_cxx--12_0_0
 make
 make install
 export PATH=$PATH:/ANY/DIR/ncbi_cxx--12_0_0/bin

Download Entrez Direct suite

 curl -O ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/edirect.zip
 unzip edirect.zip
 export PATH=$PATH:~/ANY/DIR/edirect

Download an ASN.1 file

 efetch -db nucleotide -id <gi> > <gi.asn>

Convert an ASN.1 file to gff3 file

 annotwriter -i <gi.asn> -format gff3 -full-annots -o <gi.gff>

Another solution which is highly versatile is by parsing INSDseq XML files. The steps described below use viral refseq as an example.

Retrieve all GI from viral.1.1.genomic.fna

 curl -O ftp://ftp.ncbi.nih.gov/refseq/release/viral/viral.1.1.genomic.fna.gz
 gunzip viral.1.1.genomic.fna.gz
 grep ">" viral.1.1.genomic.fna | awk 'BEGIN {FS="|"} {print $2}' > viral.1.1.genomic.gi

Download Entrez Direct suite

 curl -O ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/edirect.zip
 unzip edirect.zip
 export PATH=$PATH:~/ANY/DIR/edirect

Download viral refseq in INSDseq XML format using a list of GI.

NCBI Website and Data Usage Policies and Disclaimers: Run retrieval scripts on weekends or between 9 pm and 5 am Eastern Time weekdays for any series of more than 100 requests.
```
 while read name; do
     efetch -db nucleotide -id $name -format gpc > $name.xml;
     sleep 1;
 done < viral.1.1.genomic.gi 
```

Install XMLStarlet (optional)

on Ubuntu:

 sudo apt-get install xmlstarlet

on RedHat/CentOS/Fedora:

 yum install xmlstarlet

on Mac OSX:

 curl -O http://iweb.dl.sourceforge.net/project/xmlstar/xmlstarlet/1.6.1/xmlstarlet-1.6.1.tar.gz
 tar zxvf xmlstarlet-1.6.1.tar.gz
 cd xmlstarlet-1.6.1
 sudo ./configure
 sudo make
 sudo make install

View INSDseq XML structure (optional) - helps in coding a stylesheet. 10313991.xml is one of the fetched files.
```
 xmlstarlet el 10313991.xml
```
Parsing XML with a custom stylesheet, which is surprisingly easy to code.
```
 xsltproc --novalid insdseq2annotation.xsl 10313991.xml
```

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
xsl		xsl
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

License

lcscs12345/ncbi-xsl

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages