Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with ensDbFromGff using recent ensembl gff files #97

Open
stephenrong opened this issue Apr 17, 2019 · 17 comments
Open

Problem with ensDbFromGff using recent ensembl gff files #97

stephenrong opened this issue Apr 17, 2019 · 17 comments

Comments

@stephenrong
Copy link

stephenrong commented Apr 17, 2019

I'm having troubles with ensDbFromGff for some gff3 files downloaded from Ensembl ftp. For example, Danio rerio in Ensembl versions 94, 95, and 96 (but not Ensembl versions 93, 92, 91, 90, or 85), using the following code:

gff <- "../data/test_files/Danio_rerio.GRCz11.94.gff3.gz"
DB <- ensDbFromGff(gff=gff)
edb <- EnsDb(DB)
edb
transcripts(edb)

I get the corresponding warning message:

Warning messages:
1: In [<-.factor(*tmp*, idx, value = "transcript") :
invalid factor level, NA generated

And zero transcripts show up using transcripts(edb)

GRanges object with 0 ranges and 6 metadata columns:

This also happens for Ailuropoda_melanoleuca (works for Ens version 85 but not 91), so I bet this is a widespread problem across organisms.

@jorainer
Copy link
Owner

Thanks for reporting @stephenrong. My quick guess is that there has been a change in the gff file format from Ensembl. Is there any reason you're building the EnsDb from the GFF instead of using the official ones from AnnotationHub?

@weir12
Copy link

weir12 commented Jun 11, 2019

Hi
@jorainer
I ran into the same problem .
because 'Schizosaccharomyces pombe' does not exist on AnnotationHub. I had to download the GFF file for this species from ensembl.
The version of my ensembldb package is 2.6.8, R 3.5.1 env.
Attach the gff link: [(https://fungi.ensembl.org/Schizosaccharomyces_pombe/Info/Index)]
Here is the demo of GFF file

##gff-version 3
##sequence-region   I 1 5579133
##sequence-region   II 1 4539804
##sequence-region   III 1 2452883
##sequence-region   MT 1 19431
#!genome-build PomBase ASM294v2
#!genome-version ASM294v2
#!genome-date 2009-05
#!genome-build-accession GCA_000002945.2
#!genebuild-last-updated 2015-11
I	PomBase	gene	1	5662	.	-	.	ID=gene:SPAC212.11;Name=tlh1;biotype=protein_coding;description=RecQ type DNA helicase [Source:PomBase%3BAcc:SPAC212.11];gene_id=SPAC212.11;logic_name=pombase
I	PomBase	mRNA	1	5662	.	-	.	ID=transcript:SPAC212.11.1;Parent=gene:SPAC212.11;Name=tlh1;biotype=protein_coding;transcript_id=SPAC212.11.1
I	PomBase	exon	1	5662	.	-	.	Parent=transcript:SPAC212.11.1;Name=SPAC212.11.1:exon:1;constitutive=1;ensembl_end_phase=1;ensembl_phase=0;exon_id=SPAC212.11.1:exon:1;rank=1
I	PomBase	CDS	1	5662	.	-	0	ID=CDS:SPAC212.11.1:pep;Parent=transcript:SPAC212.11.1;protein_id=SPAC212.11.1:pep
###
I	PomBase	chromosome	1	5579133	.	.	.	ID=chromosome:I;Alias=CU329670.1,chr1
###
I	PomBase	pseudogene	5726	6331	.	-	.	ID=gene:SPAC212.10;Name=SPAC212.10;biotype=pseudogene;gene_id=SPAC212.10;logic_name=pombase
I	PomBase	pseudogenic_transcript	5726	6331	.	-	.	ID=transcript:SPAC212.10.1;Parent=gene:SPAC212.10;Name=SPAC212.10.1;biotype=pseudogene;transcript_id=SPAC212.10.1
I	PomBase	exon	5726	6331	.	-	.	Parent=transcript:SPAC212.10.1;Name=SPAC212.10.1:pseudogenic_exon:1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=SPAC212.10.1:pseudogenic_exon:1;rank=1

@jorainer
Copy link
Owner

Thanks for reporting @weir12 . Note that EnsDb databases built from GFF will miss some annotations (because they are not provided in the GFF files) I'd rather suggest to build the object directly from Ensembl with their perl API. Since installing the Perl API and using it is not trivial I could create the EnsDb for you if you want. Just let me know what Ensembl/ensemblgenomes version you need and I'll create it.

@weir12
Copy link

weir12 commented Jun 11, 2019

@jorainer
Many thanks, to be honest, I just want to get the transcript coordinates corresponding to the genomic coordinates of interest. So I found the function genomeToTranscript of the ensembldb package. Here is the NCBI link about the genome,annotation,transcript I used(https://www.ncbi.nlm.nih.gov/genome/?term=txid4896%5borgn%5d). If I am not mistaken, the genome version is ASM294v2 and the annotation file version is ASM294v2.43 in ensemble.
If there is a more convenient and quick way to solve the problem of coordinate conversion,I am glad to accept that.
thanks again

@jorainer
Copy link
Owner

I've created the EnsDb, you can download it from here.

To use it simply do:

> library(ensembldb)
> edb <- EnsDb("EnsDb.Spombe.v96.sqlite")
> genes(edb)
GRanges object with 7268 ranges and 8 metadata columns:
              seqnames      ranges strand |     gene_id   gene_name
                 <Rle>   <IRanges>  <Rle> | <character> <character>
  SPBC460.01c AB325691   1479-3197      - | SPBC460.01c SPBC460.01c
  SPBC460.02c AB325691   8856-9803      - | SPBC460.02c SPBC460.02c
          ...      ...         ...    ... .         ...         ...
     SPMTR.03      MTR 15593-15721      - |    SPMTR.03     mat3-Mm
     SPMTR.04      MTR 15856-16401      + |    SPMTR.04     mat3-Mc
                gene_biotype seq_coord_system
                 <character>      <character>
  SPBC460.01c protein_coding       chromosome
  SPBC460.02c protein_coding       chromosome
          ...            ...              ...
     SPMTR.03 protein_coding       chromosome
     SPMTR.04 protein_coding       chromosome
                                                                                                                   description
                                                                                                                   <character>
  SPBC460.01c                                                    amino-acid permease, unknown [Source:PomBase;Acc:SPBC460.01c]
  SPBC460.02c eukaryotic translation elongation factor, glutathione S-transferase (predicted) [Source:PomBase;Acc:SPBC460.02c]
          ...                                                                                                              ...
     SPMTR.03                                     silenced mating-type m-specific polypeptide Mi [Source:PomBase;Acc:SPMTR.03]
     SPMTR.04                                     silenced mating-type m-specific polypeptide Mc [Source:PomBase;Acc:SPMTR.04]
              gene_id_version      symbol            entrezid
                  <character> <character>              <list>
  SPBC460.01c     SPBC460.01c SPBC460.01c                  NA
  SPBC460.02c     SPBC460.02c SPBC460.02c                  NA
          ...             ...         ...                 ...
     SPMTR.03        SPMTR.03     mat3-Mm c(2539637, 3361261)
     SPMTR.04        SPMTR.04     mat3-Mc             2540048
  -------
  seqinfo: 6 sequences from ASM294v2 genome

You can then obviously also use the corrdinate mapping tools - and since it contains protein data you can also map to protein sequence-relative coordinates.

@weir12
Copy link

weir12 commented Jun 11, 2019

@jorainer
Cheers!The problem was solved perfectly.I was really entangled in this problem for a whole day.Sincerely thank you!

@guidohooiveld
Copy link

guidohooiveld commented Jul 11, 2019

Hi Johannes,
I face the exact same problem as reported in this thread, but for the organism 'Lottia gigantea'. This organisms does also not exists in the AnnotationHub. Any assistance would be appreciated! Thanks, Guido

http://metazoa.ensembl.org/Lottia_gigantea/Info/Index
GFF: ftp://ftp.ensemblgenomes.org/pub/metazoa/release-44/gff3/lottia_gigantea/Lottia_gigantea.Lotgi1.44.gff3.gz

> ensDbFromGff(gff="Lottia_gigantea.Lotgi1.44.gff3.gz", path="./out", organism="Lottia_gigantea", genomeVersion="Lotgi1", version="44")
Importing GFF ... OK
Fixing IDs ... OK
Processing genes ... OK
Processing transcripts ... OK
Processing exons ... OK
  -------------
Proceeding to create the database.
Error in connection_connect(dbname, loadable.extensions, flags, vfs) : 
  Could not connect to database:
unable to open database file
In addition: Warning message:
In `[<-.factor`(`*tmp*`, idx, value = "transcript") :
  invalid factor level, NA generated
> 
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 30 (Thirty)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

> traceback()
10: stop(list(message = "Could not connect to database:\nunable to open database file", 
        call = connection_connect(dbname, loadable.extensions, flags, 
            vfs), cppstack = list(file = "", line = -1L, stack = c("/usr/lib64/R/library/RSQLite/libs/RSQLite.so(Rcpp::exception::exception(char const*, bool)+0xa7) [0x7eff9101fbc7]", 
        "/usr/lib64/R/library/RSQLite/libs/RSQLite.so(void Rcpp::stop<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&)+0x54) [0x7eff9102284d]", 
        "/usr/lib64/R/library/RSQLite/libs/RSQLite.so(DbConnection::DbConnection(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x95) [0x7eff910222d5]", 
        "/usr/lib64/R/library/RSQLite/libs/RSQLite.so(connection_connect(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x42) [0x7eff910319f2]", 
        "/usr/lib64/R/library/RSQLite/libs/RSQLite.so(_RSQLite_connection_connect+0x35a) [0x7eff9102ae8a]", 
        "/usr/lib64/R/lib/libR.so(+0x139ed9) [0x7effa932ced9]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]", 
        "/usr/lib64/R/lib/libR.so(+0x13e1a5) [0x7effa93311a5]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14a54c) [0x7effa933d54c]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x2fc) [0x7effa933cd7c]", 
        "/usr/lib64/R/lib/libR.so(+0x142333) [0x7effa9335333]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(R_execMethod+0x258) [0x7effa933faf8]", 
        "/usr/lib64/R/library/methods/libs/methods.so(+0x53b2) [0x7eff9709f3b2]", 
        "/usr/lib64/R/lib/libR.so(+0x18e27f) [0x7effa938127f]", "/usr/lib64/R/lib/libR.so(+0x13c80e) [0x7effa932f80e]", 
        "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]", 
        "/usr/lib64/R/lib/libR.so(+0x13e1a5) [0x7effa93311a5]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]", 
        "/usr/lib64/R/lib/libR.so(+0x13e1a5) [0x7effa93311a5]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]", 
        "/usr/lib64/R/lib/libR.so(+0x13e1a5) [0x7effa93311a5]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(R_execMethod+0x258) [0x7effa933faf8]", 
        "/usr/lib64/R/library/methods/libs/methods.so(+0x53b2) [0x7eff9709f3b2]", 
        "/usr/lib64/R/lib/libR.so(+0x18e27f) [0x7effa938127f]", "/usr/lib64/R/lib/libR.so(+0x13c80e) [0x7effa932f80e]", 
        "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]", 
        "/usr/lib64/R/lib/libR.so(+0x13e1a5) [0x7effa93311a5]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]", 
        "/usr/lib64/R/lib/libR.so(+0x13e1a5) [0x7effa93311a5]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]", 
        "/usr/lib64/R/lib/libR.so(Rf_eval+0x2ba) [0x7effa933cd3a]", 
        "/usr/lib64/R/lib/libR.so(+0x14e3e7) [0x7effa93413e7]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x58d) [0x7effa933d00d]", 
        "/usr/lib64/R/lib/libR.so(Rf_ReplIteration+0x1f2) [0x7effa936e6d2]", 
        "/usr/lib64/R/lib/libR.so(+0x17ba60) [0x7effa936ea60]", "/usr/lib64/R/lib/libR.so(run_Rmainloop+0x50) [0x7effa936eb20]", 
        "/usr/lib64/R/bin/exec/R(main+0x1f) [0x55b69a5a30af]", "/lib64/libc.so.6(__libc_start_main+0xf3) [0x7effa9023f33]", 
        "/usr/lib64/R/bin/exec/R(_start+0x2e) [0x55b69a5a30ee]"))))
9: connection_connect(dbname, loadable.extensions, flags, vfs)
8: initialize(value, ...)
7: initialize(value, ...)
6: new("SQLiteConnection", ptr = connection_connect(dbname, loadable.extensions, 
       flags, vfs), dbname = dbname, flags = flags, vfs = vfs, loadable.extensions = loadable.extensions, 
       ref = new.env(parent = emptyenv()), bigint = bigint)
5: .local(drv, ...)
4: dbConnect(dbDriver("SQLite"), dbname = dbname)
3: dbConnect(dbDriver("SQLite"), dbname = dbname)
2: ensDbFromGRanges(theGff, outfile = outfile, path = path, organism = orgFromFile, 
       genomeVersion = genFromFile, version = ensFromFile, ...)
1: ensDbFromGff(gff = "Lottia_gigantea.Lotgi1.44.gff3.gz", path = "./out", 
       organism = "Lottia_gigantea", genomeVersion = "Lotgi1", version = "44")

@jorainer
Copy link
Owner

Hi @guidohooiveld , I'll create the EnsDb. I'll let you know when it is ready and from where you can get it.

@jorainer
Copy link
Owner

The EnsDb (Ensembl 97/Ensemblgenomes 44) is available here

@guidohooiveld
Copy link

Wow, within 1 hr a working database... Thanks very much Johannes, much appreciated!

@jorainer
Copy link
Owner

Let me know if you need an updated version or other too.

@Dataxplora
Copy link

Thanks for reporting @stephenrong. My quick guess is that there has been a change in the gff file format from Ensembl. Is there any reason you're building the EnsDb from the GFF instead of using the official ones from AnnotationHub?

I am struggling with a similar issue. But 'AnnotationHub' doesn't contain information for Pseudomonas aeruginosa

@jorainer
Copy link
Owner

@Olakmephd, bacteria are a little more difficult. Ensembl does not provide a single database for each bacteria, but has databases for a collection of bacteria. I did not yet find a good way to extract the data for a single bacteria from such a database using ensembldb (and the Ensembl Perl API).

You could however download the GFF file for this bacteria e.g. from:
ftp://ftp.ensemblgenomes.org/pub/release-53/bacteria/gff3/bacteria_98_collection/pseudomonas_aeruginosa_gca_003325605 (although I don't know if that's the actuall strain you are looking for) and then create the EnsDb using

> library(ensembldb)
> db <- ensDbFromGff("Pseudomonas_aeruginosa_gca_003325605.ASM332560v1.49.gff3.gz")
Importing GFF ... OK
Fixing IDs ... OK
Processing genes ... OK
Processing transcripts ... OK
Processing exons ... OK
  -------------
Proceeding to create the database.
Processing genes ... 
 Attribute availability:
  o gene_id ... OK
  o gene_name ... OK
  o entrezid ... Nope
  o gene_biotype ... OK
OK
Processing transcripts ... 
 Attribute availability:
  o transcript_id ... OK
  o gene_id ... OK
  o transcript_biotype ... OK
  o transcript_name ... Nope
OK
Processing exons ... OK
Processing chromosomes ... Fetch seqlengths from ensembl ... FAIL
OK
Processing metadata ... OK
Generating index ... OK
  -------------
Verifying validity of the information in the database:
Checking transcripts ... OK
Checking exons ... OK
Warning messages:
1: In `[<-.factor`(`*tmp*`, idx, value = "transcript") :
  invalid factor level, NA generated
2: In ensDbFromGRanges(theGff, outfile = outfile, path = path, organism = orgFromFile,  :
   I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!
3: In ensDbFromGRanges(theGff, outfile = outfile, path = path, organism = orgFromFile,  :
  I'm missing column(s): 'transcript_name'. The corresponding database columns will be empty!
4: In .getSeqlengthsFromMysqlFolder(organism = organism, ensembl = ensemblVersion,  :
  Can not get the sequence lengths from Ensembl or Ensemblgenomes. Seqinfo will lack the sequence lengths.
5: In tryGetSeqinfoFromEnsembl(organism, version, seqnames = chroms$seq_name) :
  Unable to retrieve sequence lengths from Ensembl.

You can ignore the warning messages. Also, the function fails to retrieve the correct chromosome lengths from Ensembl, so you'll not have that information in the database. You can then load this databases with:

> edb <- EnsDb(db)
> edb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.0.1
|Creation time: Mon May 16 06:06:51 2022
|ensembl_version: 49
|ensembl_host: unknown
|Organism: Pseudomonas_aeruginosa_gca_003325605
|genome_build: ASM332560v1
|DBSCHEMAVERSION: 1.0
|source_file: Pseudomonas_aeruginosa_gca_003325605.ASM332560v1.49.gff3.gz
| No. of genes: 8770.
| No. of transcripts: 8770.
> organism(edb)
[1] "Pseudomonas aeruginosa gca 003325605"

@IrshadUlHaq1
Copy link

IrshadUlHaq1 commented Apr 25, 2023

@jorainer
I am really sorry to ask you but I have been trying to make EnsDB for "Trichoderma reesei RUT-C30", however, I am getting the following warning messages. Could you please look into it and kindly assist me?

Thanks
`> gfffile <- "Trichoderma_reesei_rut_c_30_gca_000513815.TrireRUTC30_1.56.gff3.gz"

DB <- ensDbFromGff(gff = gfffile)
Importing GFF ... OK
Fixing IDs ... OK
Processing genes ... OK
Processing transcripts ... OK
Processing exons ... OK
Proceeding to create the database.
Processing genes ...
Attribute availability:
o gene_id ... OK
o gene_name ... OK
o entrezid ... Nope
o gene_biotype ... OK
OK
Processing transcripts ...
Attribute availability:
o transcript_id ... OK
o gene_id ... OK
o transcript_biotype ... OK
o transcript_name ... Nope
OK
Processing exons ... OK
Processing chromosomes ... Fetch seqlengths from ensembl ... FAIL
OK
Processing metadata ... OK
Generating index ... OK
Verifying validity of the information in the database:
Checking transcripts ... OK
Checking exons ... OK
Warning messages:
1: In [<-.factor(*tmp*, idx, value = "transcript") :
invalid factor level, NA generated
2: In ensDbFromGRanges(theGff, outfile = outfile, path = path, organism = orgFromFile, :
I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!
3: In ensDbFromGRanges(theGff, outfile = outfile, path = path, organism = orgFromFile, :
I'm missing column(s): 'transcript_name'. The corresponding database columns will be empty!
4: In .getSeqlengthsFromMysqlFolder(organism = organism, ensembl = ensemblVersion, :
Can not get the sequence lengths from Ensembl or Ensemblgenomes. Seqinfo will lack the sequence lengths.
5: In tryGetSeqinfoFromEnsembl(organism, version, seqnames = chroms$seq_name) :
Unable to retrieve sequence lengths from Ensembl.
`

@jorainer
Copy link
Owner

@IrshadUlHaq1 , the warnings just tell you that some of the content is not available in the EnsDb you created, because it is not stored/available in the GFF file. The EnsDb should however still be usable.

@IrshadUlHaq1
Copy link

@jorainer, thank you for your input. The generated 'edb' is usable, however, my end goal is to extract 'entrez' ID for a subset of the 'gene_id'. I need 'entrez' to perform pathway analyses. While I am aware that the 'gff' file does not contain 'entrez' info, I wonder if you have other suggestion for me? I tried to generate the 'edb' using Ensembl Perl API but I ran into problems on my Arch linux.

On another note, could the 'gene_id' be converted to 'entrez' id's by a different package such as 'ClusterProfiler'? Also, the 'gene_id' in the 'edb' are 'stable IDs' and I having a hard time to understand the differences here.

I appreciate your help and time.
Again, thank you.

@jorainer
Copy link
Owner

jorainer commented May 3, 2023

Unfortunately the scripts in ensembldb to create an EnsDb using the Ensembl Perl API don't work properly for bacteria, mostly because Ensembl does not provide a database for a single species/strain (they are bundled in "collections").

I guess it should be possible to convert the IDs in the EnsDb to Entrez IDs - maybe using biomaRt? Or any other package or resource that provides annotations for your species.

The gene_id in EnsDb is usually the Ensembl gene ID (ENSG...), but it depends what identifiers the GFF provides. It will be the gene identifier provided within the GFF file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants