Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cant retrieve reference information #110

Open
danielcgingerich opened this issue Nov 19, 2020 · 1 comment
Open

Cant retrieve reference information #110

danielcgingerich opened this issue Nov 19, 2020 · 1 comment

Comments

@danielcgingerich
Copy link

Someone please explain to me how to get the annotation from GRCh38 2020A and convert to a GRanges object

GRCh38_2020-A<-ensDbFromGtf(gtf = "http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.primary_assembly.annotation.gtf.gz",
                            path = 'C:/Users/danie/Desktop/Seurat Objects/snATAC seq preliminary analysis/ref.genome/',
                            organism = "Homo_sapiens",
                            genomeVersion = 'GRCh38',
                            version = 98)

Importing GTF file ... trying URL 'http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.primary_assembly.annotation.gtf.gz'
Content type 'application/octet-stream' length 43107903 bytes (41.1 MB)
downloaded 41.1 MB

OK
Processing metadata ... OK
Processing genes ... 
 Attribute availability:
  o gene_id ... OK
  o gene_name ... OK
  o entrezid ... Nope
  o gene_biotype ... Nope
OK
Processing transcripts ... 
 Attribute availability:
  o transcript_id ... OK
  o gene_id ... OK
  o source ... OK
OK
Processing exons ... OK
Processing chromosomes ... Fetch seqlengths from ensembl ... OK
Generating index ... Error: UNIQUE constraint failed: exon.exon_id
In addition: Warning messages:
1: In readLines(gtf, n = 10) : line 1 appears to contain an embedded nul
2: In readLines(gtf, n = 10) : line 2 appears to contain an embedded nul
3: In readLines(gtf, n = 10) : line 3 appears to contain an embedded nul
4: In readLines(gtf, n = 10) : line 6 appears to contain an embedded nul
5: In ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism,  :
   I'm missing column(s): 'entrezid','gene_biotype'. The corresponding database column(s) will be empty!
6: In .getSeqlengthsFromMysqlFolder(organism = organism, ensembl = ensemblVersion,  :
  Could not determine length for all seqnames.

Why?

@jorainer
Copy link
Owner

Hi, sorry for the late reply!

According to the error message it seems that the exon identifiers in the GTF file are not unique - not much we can do about. Generally, creating EnsDb objects/databases from GTF is tricky as the GTF file format is not too standardized. Creating databases from GTF files from Ensembl should work - for the ones from Gencode I don't know.

Note that there are pre-build annotation resources for all Ensembl releases:

> library(AnnotationHub)
> ah <- AnnotationHub()
snapshotDate(): 2020-11-02
> query(ah, "EnsDb.Hsapiens.v98")
AnnotationHub with 1 record
# snapshotDate(): 2020-11-02
# names(): AH75011
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2019-05-02
# $title: Ensembl 98 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("98", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene",
#   "Protein", "Transcript") 
# retrieve record with 'object[["AH75011"]]' 

Since the Gencode 32 is based on Ensembl 98 - would this work for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants