Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pathogenicity predictions and allele to frequency #108

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

costero-e
Copy link
Collaborator

@costero-e costero-e commented Aug 17, 2023

This is the work that @dglemos has been working on and already deployed in the genomicVariations feature branch. This is ready to be merged into the develop branch IMO as the work seems finished.
Can you please review, @mbaudis and @redmitry ?

@costero-e costero-e requested review from mbaudis and redmitry August 17, 2023 09:01
@jrambla
Copy link
Contributor

jrambla commented Jan 9, 2024

I have some questions about the suggested changes. I know the PR has been waiting for some months.

  1. Why do we need to add the allele if, actually, the section is inside a given genomicVariation?
  2. The pathogenicityPrediction should be linked to an specific phenotypic effect, is not generic to any condition.
  3. The "phred_score:0.381" example doesn't seem compliant with the ontologyTerm, is it?

@costero-e
Copy link
Collaborator Author

@dglemos Can you check @jrambla questions?

@dglemos
Copy link
Collaborator

dglemos commented May 15, 2024

Hi @jrambla,

For the following query: assemblyId=GRCh38;referenceName=9;start=22125503;referenceBases=G;alternateBases=C
The Ensembl Beacon returns a match: rs1333049 -> this variant has two alt alleles A and C. As some of our data is not attached to a specific allele (only attached to ID rs1333049) our beacon returns both alleles in the response.
This is how the alleles are returned in the response:

"variation": {
              "referenceBases": "G",
              "alternateBases": "A,C",
              "variantType": "SNP",
              "location": {
                "interval": {
                  "type": "SequenceInterval",
                  "end": {
                    "value": 22125504,
                    "type": "Number"
                  },
                  "start": {
                    "type": "Number",
                    "value": 22125503
                  }
                },
                "type": "SequenceLocation",
                "sequence_id": "9"
              }
            }
  1. We prefer to attach the specific alt allele associated with the frequency in the response. This way we make sure the user knows which one has the frequency.
  2. The same applies here. We don't have pathogenic predictions directly linked to phenotypic effects.
  3. I'll review this one.

@jrambla
Copy link
Contributor

jrambla commented Oct 8, 2024

@mbaudis I'm not convinced that the approach here matches the overall genomicVariations approach and I fear the we could be creating confusion. Opinions?

@mbaudis
Copy link
Member

mbaudis commented Oct 8, 2024

@jordi @dglemos The root cause here is the treatment of variants with the strange VCF concept where you are able to capture haplotypes for observations but each VCFvar (i.e. line) can express multiple alternative possibilities (0|1|2... alleles changed etc.). Therefore when storing/expressing variants as VCF lines you always will have ambiguous responses (the 0 wouldn't match, the 1 or 2 alleles would and the might be different). So reporting back becomes an issue - in the current Beacon way the caseLevelData would be ambiguous:

Do we report all the samples which have a match on the request but where the observation might differ as one or as different variants depending on their allelic composition (i.e. here separate C,C, A,C variants and their samples)? The latter option makes sense but compared to VCF represents a kind of de-composition. And we do not have this defined. And makes sense comes with a big caveat: Are we talking about haplotypes (we shouldn't) or alleles?

My take:

  • Beacon queries are for alleles so any variant (allele or haplotype etc.) matching the query should be returned.
  • The Beacon legacy variation does not allow for the representation of haplotypes, so in this case all matches would look the same.
  • We have to work on a new variant schema (I'd started a draft... which basically should be lifted from current VRS v2.n.
  • We have to revise the confusing mix of variants and calls and e.g. get rid of the caseLevelData embedding in favour of reporting individual observations (no ambiguities there) and referencing common identifiers etc. (and @redmitry has made some technical points why this is necessary, beyond the logic of a hierarchical model currently being broken).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants