Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will the API offer an alias to digest conversion endpoint? #4

Open
nsheff opened this issue Nov 12, 2020 · 4 comments
Open

Will the API offer an alias to digest conversion endpoint? #4

nsheff opened this issue Nov 12, 2020 · 4 comments

Comments

@nsheff
Copy link
Member

nsheff commented Nov 12, 2020

One of the use cases brought up was this. What if a user wants to get the sequence collection checksum(s) from either the name of the collections (e.g. grch38).

We determined that Sequence collections should be congruent with the approach taken by refget in terms of allowing human-readable alias-based queries.

In this issue: samtools/hts-specs/issues/329 it seems clear that refget was not intended to do this.

@andrewyatz says:

I viewed the aliases section as a bit where an API can say "I believe this is a known alias for this ID". Nothing more. Those known aliases could be other checksums e.g. if UniParc implemented this they could provide their crc64 checksums as an alias. Part of me feels that this is a buyer beware situation.

Secondly refget is not built to support sequence retrieval using an alias. Imagine the following URL /sequence/alias/chr1 and how impossible this is to resolve without additional metadata. Refget is trying to resolve this situation by using checksums so supporting alias lookup feels like it's going against refget's ethos.

That hopefully puts clear water between aliases e.g. chr1 and alternative methods of generating the checksum identifiers. We never intended to query the server by alias.

In light of this, I'd propose the seqcol spec specifically not provide endpoints that operate on human-readable aliases.

On the other hand, 'chr1' is a much more universal identifier than something like 'hg38', so perhaps there is some value in returning a list of identifiers that include "hg38" under "aliases".

@andrewyatz
Copy link
Collaborator

The big issue for me is what we mean by GRCh38 or hg38 (apologies for my GRCh38 bias). Because this could be:

  • GCh38.p13 (GCA_000001405.28) - latest patch version
  • GRCh38 (GCA_000001405.15) - first release of GRCh38
  • Ensembl and UCSC's representation of GRCh38
  • Just the chromosomes
  • All of the above

In all likeliness all of the above is the right answer, since GRC patches are additive it means to refer to p13 is to refer to all prior, but since patches are frequently released you want to use the "tag". Also it means GRCh38 applies equally to all of these.

So there is an imprecise query coming in "I want the assembly that refers to hg38" which we cannot give an exact answer to because seqcol is going to be very precise about what you're going to work with.

@tcezard
Copy link
Collaborator

tcezard commented Nov 23, 2020

This is the reverse lookup use case and similar to the discussion with refget reverse lookup workstream so I guess I can add my current thinking here:
For me the reverse lookup (which is a generalisation of the alias lookup) is a search in the metadata (as defined in issue #3).
So if the metadata contains an alias field, you should be able to filter using it and get all the result that have this alias.
The results would be implementation dependent because different implementation might have different metadata associated with the same collection.

  • A UCSC implementation could recognise hg38 as GCA_000001405.15 as they describe here
  • An NCBI implementation might not have this alias and only recognise GRCh38 and chose to either return only GCA_000001405.15 or a list of all GCA_000001405.15 to GCA_000001405.28

We could specify in the /metadata-schema endpoint a set of fields that would be searchable via the reverse lookup endpoint so clients can know what filters are available.

@andrewyatz
Copy link
Collaborator

I think this is the right way to think about the issue so we can combine our thinking for sequence reverse lookup and this. Having this be an implementation specific issue is a good way around the problem, but I do think any service that's worth its salt will register all known aliases.

The bigger problem now will be how to handle the ambiguity and pass back the "correct" and precise collection or sequence from an imprecise query. I don't think that's this API's business but something that'll have to be an out of scope manual curation process. Though I can see someone from a genome provider like UCSC, Ensembl or INSDC making those calls.

@nsheff
Copy link
Member Author

nsheff commented Nov 25, 2020

The bigger problem now will be how to handle the ambiguity and pass back the "correct" and precise collection or sequence from an imprecise query.

Well, if it doesn't want to make an authoritative claim on what a human readable alias means it would pass back all the possible matches. If or if it does want to make an authoritative claim, it would pass back just the one it claims is the match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants