Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The output from proteinToGenome has empty value #157

Closed
zyh4482 opened this issue Jun 17, 2024 · 6 comments
Closed

The output from proteinToGenome has empty value #157

zyh4482 opened this issue Jun 17, 2024 · 6 comments

Comments

@zyh4482
Copy link

zyh4482 commented Jun 17, 2024

As an example, I would like to map the 1-28 amino acid of ENSP00000518892 back to its genomic coordinates. According to biomaRt query result, the respective transcript sequence should be:

ATGCCCAGGGCTCAGTTGCCTGAGGACAGCAGTGCAGTTGACATGGATATTCTCTTTCCTCTGGACAGTGTTATTGGTACAGAGCTGTGCCCCAGCCCCATTCCCCAGATCATCCATTTCGTCCTCTTTGTTGTGTTCAGCCTGGTGATCCTGATTATCTTACGCCTCTACATTCCCAGGGAGCCGTCCTCAGTGCCTCCCAGAGAGGAGGACAGCGAGAATGATCAAGCTGAAGTGGGGGAATGGCTCAGGATCGGAAATAAATATATCACTTTGAAAGATTACAGAATTCTCTTGAAAGAACTGGAGAACCTTGAGATCTACACTTTCCTGTCGAAAAAGTGCCTGAAGAAGCTCTCTAGGGAGGGCAGCTCCCATCACCTTCCACGCCAAGTCCGCCCAGGGCCAGTGTACAAACCAGCACCTGCTAGGAACCACCGGCCACGTGGGGGGCGTGGGAAAGCTTCTCCCACCAGCTTCCATGTGTCCCCACGGGCTCCCCTGGCTCCTCTGGCCTCCATGCCGTCATCAGTCCCGAAGACCTCCGTAGAGTCCTTGGGGTCTCCATCATCCCTGAGCTCCTCCAAGCCACGAGAGCCTCTGTGTCCCCTGAAGCACCCTTCACACCAGCCACCTGCGAGCACCCTATCACCAAACCCGACCAGCTCCACAGAATCCTTGGGGTATCTGTCATCCCTGAGCTCCTCCCAGCCACCAGAGCCTTTGCGTCCCCTGAAGCACCCTTCACACAAGCCACGTGGGCGTTCCCTTCCCCGACGACGGAATCCTGGCTGGGTGTCCTGGTCCGACTCCATGCAGGCTGATTCCGAAACTGACACCATAATATGCCCAATGTGCAAGGCCCCTGAGCGCTCCTGTCCACACACCTGGTGGGTGCCTTCTAGCCCTCGAGTGATCCGAGGCGTTGGTCGCTGCAGTGATCCCAACCTGGGCCTCTCCTGGAGGCAGGAGGCTGCTAGAGCCTGGTGCCACTGCACCTCCTCACAGTTCCCATTCAAGCACCCTAATCTTCCCACCCACCTACCAAAGGCTTCCTTCTAG

The peptide sequence is as below:

MPRAQLPEDSSAVDMDILFPLDSVIGTELCPSPIPQIIHFVLFVVFSLVILIILRLYIPREPSSVPPREEDSENDQAEVGEWLRIGNKYITLKDYRILLKELENLEIYTFLSKKCLKKLSREGSSHHLPRQVRPGPVYKPAPARNHRPRGGRGKASPTSFHVSPRAPLAPLASMPSSVPKTSVESLGSPSSLSSSKPREPLCPLKHPSHQPPASTLSPNPTSSTESLGYLSSLSSSQPPEPLRPLKHPSHKPRGRSLPRRRNPGWVSWSDSMQADSETDTIICPMCKAPERSCPHTWWVPSSPRVIRGVGRCSDPNLGLSWRQEAARAWCHCTSSQFPFKHPNLPTHLPKASF

But I cannot get this query from the proteinToGenome output. It returns an empty record:

> genoloci.ah[["ENSP00000518892"]]
GRanges object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
  -------
  seqinfo: no sequences

May I ask why this happens? I don't understand why the sequence can all be found in the Ensembl database but cannot be mapped using proteinToGenome.

Thank you.

@jorainer
Copy link
Owner

jorainer commented Jul 8, 2024

I tried with current Bioconductor package versions (3.19) and Ensembl release 112 and it seems to work:

library(ensembldb)
library(AnnotationHub)
edb <- ah[["AH116860"]]
rng <- IRanges(start = 1, end = 28)
names(rng) <- "ENSP00000518892"
proteinToGenome(rng, edb)
Fetching CDS for 1 proteins ... 1 found
Checking CDS and protein sequence lengths ... 1/1 OK
$ENSP00000518892
GRanges object with 1 range and 7 metadata columns:
      seqnames              ranges strand |      protein_id           tx_id
         <Rle>           <IRanges>  <Rle> |     <character>     <character>
  [1]        5 179644937-179645020      - | ENSP00000518892 ENST00000511063
              exon_id exon_rank    cds_ok protein_start protein_end
          <character> <integer> <logical>     <integer>   <integer>
  [1] ENSE00003874488         1      TRUE             1          28
  -------
  seqinfo: 1 sequence from GRCh38 genome

maybe you used a different Ensembl release?

@mschubert
Copy link

I stumbled over this yesterday too: The reason was that it wasn't obvious to me that the ranges need to have protein ID names.

The names are set in the Examples section of ?proteinToGenome, but it would have saved me a bit of time if either:

  1. That the records should be named with protein IDs was stated in the Description field
  2. An IRanges object with positive length and no names would raise an error

@jorainer
Copy link
Owner

thanks for the comment @mschubert ! When you write

was stated in the Description field

do you mean the help page of the proteinToGenome() function?

Regarding the second point, yes that makes sense. I'll look into it - and see if that has no other unwanted effects.

@zyh4482
Copy link
Author

zyh4482 commented Aug 21, 2024

@mschubert This is quite unexpected... I've never thought about this before. Thank you!

jorainer added a commit that referenced this issue Aug 21, 2024
- Improve documentation of `proteinToGenome()` describing the input parameter
  better.
- Throw error if provided `IRanges` does not contain protein identifiers.
jorainer added a commit that referenced this issue Aug 21, 2024
- Improve `proteinToGenome()` documentation.
- Check input parameters of `proteinToGenome()` to ensure protein identifiers
  were provided (issue #157).
@mschubert
Copy link

do you mean the help page of the proteinToGenome() function?

Yes, exactly. I see you already added that, great, thank you!

@jorainer
Copy link
Owner

I'm closing this issue now - feel free to reopen if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants