Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: can't locate complexes specified in our GPI #910

Open
ValWood opened this issue Aug 6, 2024 · 22 comments
Open

Question: can't locate complexes specified in our GPI #910

ValWood opened this issue Aug 6, 2024 · 22 comments

Comments

@ValWood
Copy link

ValWood commented Aug 6, 2024

I can't find this complex in Noctua:

ComplexAc: CPX-566

Even though it has been in our GPI since 2024-05-04

Can you let us know of we are doing anything wrong in the file or is a Noctua loading issue?
I tried both "activity unit" and the "protein complex" entity annotons

@ValWood
Copy link
Author

ValWood commented Aug 6, 2024

cc @kimrutherford

@kltm
Copy link
Member

kltm commented Aug 6, 2024

Noting that this should be ComplexPortal:CPX-566 (as http://noctua-amigo.berkeleybop.org/amigo/term/ComplexPortal:CPX-566 in our side instance)

I am able to find this term in at least some locations in the Noctua interface.

@ValWood
Copy link
Author

ValWood commented Aug 7, 2024

So I should always put ComplexPortal:CPX-566?
I find other complex IDs without the prefix, and we don't need the prefix to locate other entities?

cc @PCarme

@kltm
Copy link
Member

kltm commented Aug 7, 2024

@ValWood We can dig into this a little when @vanaukenk is back, but it may be that the difference is what is supplied in the synonyms, etc.

@ValWood
Copy link
Author

ValWood commented Aug 8, 2024

Are there any docs for how to specify complexes in GPI ? @kimrutherford can check that we are doing it correctly. No hurry until @vanaukenk is back.

@ValWood
Copy link
Author

ValWood commented Sep 5, 2024

Hi @vanaukenk can you let us know how complexes should be specified in the GPAD so we can check that we are doing it correctly?
Thanks,
val

@ValWood ValWood changed the title can't locate complexes specified in our GPI Question: can't locate complexes specified in our GPI Sep 5, 2024
@hattrill
Copy link

hattrill commented Sep 5, 2024

@vanaukenk I have just got our devs to add complexes to our gpi (not in production yet) based on SGD's gpi and would like to check that the file is spec'd correctly as well.

@vanaukenk
Copy link

@ValWood @hattrill

There are some issues surrounding use of ComplexPortal ids in GO-CAMs that need to be definitively resolved.
I propose that we use next week's Pathways2GO and GO-CAM call time slots to focus on that and then we will know better what to do wrt the gpi file.

Are you both available next Thursday?

@hattrill
Copy link

hattrill commented Sep 5, 2024

@vanaukenk that is good for me. Thanks

@pgaudet
Copy link

pgaudet commented Sep 9, 2024

Hi @vanaukenk can you let us know how complexes should be specified in the GPAD so we can check that we are doing it correctly?

Is this helpful?

https://geneontology.org/docs/gene-product-information-gpi-format/

We can add a protein complex example.

@kimrutherford
Copy link

Is this helpful?
https://geneontology.org/docs/gene-product-information-gpi-format/
We can add a protein complex example.

Thanks Pascale.

We've been using the GPI 2.0 spec:
https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md

Perhaps that's a problem?

@kimrutherford
Copy link

We've been using the GPI 2.0 spec:

We're putting the complex members in column 9 ("Protein_Containing_Complex_Members"), following the spec. The spec says Prefix ':' Local_ID separated by pipes. Prefix in our case is "PomBase". The example in the spec has UniProt IDs. Maybe that causes a problem?

This is an example of column 9 from our GPI file:

PomBase:SPAC29E6.08|PomBase:SPBC13E7.10c|PomBase:SPCC1919.14c

suzialeksander added a commit to geneontology/geneontology.github.io that referenced this issue Sep 20, 2024
removing excess IDs, note also that this "example" doesn't match what SGD will be providing after next ingestion of SGD gpi- see geneontology/noctua#910
@suzialeksander
Copy link

Noting here there are still some issues with SGD's complexes:

https://release.geneontology.org/2024-09-08/annotations/sgd.gpi.gz has

SGD:S000218003 CPX-1852 RPD3L histone deacetylase complex|RPD3L complex|RPD3(L)|RPD3/SIN3 large histone deacetylase complex|3.5.1.98|8GA8|EMD-29892|8HPO|EMD-34935 protein_complex taxon:559292

https://release.geneontology.org/2024-09-08/products/upstream_and_raw_data/sgd-src.gpi.gz has:

SGD:S000218003 ASH1:CTI6:DEP1:PHO23:RPD3:RXT2:RXT3:SAP30:SDS3:SIN3:UME1:UME6 RPD3L histone deacetylase complex GO:0032991 taxon:559292 S000005274|S000005274|S000001346|S000001346|S000001346|S000005364|S000005364|S000005364|S000005364|S000000299|S000000299|S000000299|S000000299|S000000299|S000006060|S000006102|S000004876|S000004876|S000005041|S000005041|S000005041|S000002234|S000002234|S000000011|S000000011|S000000011 ComplexPortal:CPX-1852

To find this complex in Noctua, the only current way is to enter S000218003 or SGD:S000218003 in the Term box, where the entity pops up with as ASH1CTI6DEP1PHO23RPD3RXT2RXT3SAP30SDS3SIN3UME1UME6 Scer. Curators expect CPX-1852 to work but doesn't, although I've found that searching for ASH in the Term box also works, but that's not an obvious name and isn't in the GPI provided by SGD.

SGD is modifying the supplied GPI and the next available GPI from SGD will look more like the /annotations/sgd.gpi :

SGD:S000218003 CPX-1852 RPD3L histone deacetylase complex GO:0032991 taxon:559292 S000005274|S000005274|S000001346|S000001346|S000001346|S000005364|S000005364|S000005364|S000005364|S000000299|S000000299|S000000299|S000000299|S000000299|S000006060|S000006102|S000004876|S000004876|S000005041|S000005041|S000005041|S000002234|S000002234|S000000011|S000000011|S000000011 ComplexPortal:ASH1:CTI6:DEP1:PHO23:RPD3:RXT2:RXT3:SAP30:SDS3:SIN3:UME1:UME6

Strongly related ticket #914

@ValWood
Copy link
Author

ValWood commented Oct 16, 2024

Update, I have been able to locate complexes only If I omit the hyphen from the identifier.
So, if I search for "CPX 566" instead of the actual ID "CPX-566" I find it???

@ValWood
Copy link
Author

ValWood commented Oct 16, 2024

...but the has_parts are not automatically imported

@vanaukenk
Copy link

I'm looking into this some more today.

So far, what I find when searching in the gene product field is:

CPX-566 does return the right complex, but it is very far down on the autocomplete selection list, i.e. the 40th entity listed
CPX 566 floats the entry to the top of the list
ComplexPortal:CPX-566 floats the entry to the top of the list
CPX-566 SPom has the entry second in the list

The search behavior is the same in the VPE as well as the standard annotation editor. I'll ask @tmushayahama about the search criteria to see if there's anything we can do to bump the right enty to the top of the search list when using CPX-566, as I am assuming that's the entry you'd most likely make? @ValWood

@suzialeksander I'm still looking into the SGD issues, as I can't find the SGD complexes in noctua-amigo, suggesting that this is a different problem.

@vanaukenk
Copy link

vanaukenk commented Oct 22, 2024

@suzialeksander

I've been looking into the SGD gpi and protein complexes and honestly don't understand what's happening here. I see the exact same behavior you see.

I'll need some help troubleshooting from @kltm and @tmushayahama

@vanaukenk
Copy link

...but the has_parts are not automatically imported

@ValWood - we haven't done any work yet to implement this functionality, but are aware it would be very helpful.

@ValWood
Copy link
Author

ValWood commented Oct 22, 2024

if there's anything we can do to bump the right enty to the top of the search list when using CPX-566, as I am assuming that's the entry you'd most likely make?

Yes. Its strange that IDs with spaces take priority over the correct identifier. As far as I'm aware, identifiers never have spaces?

@kltm
Copy link
Member

kltm commented Oct 22, 2024

I wanted to clarify a little about what is going on here wrt CPX-566. I'm not justifying it or saying it's good--issues with the autocomplete are well known and numerous geneontology/amigo#120 geneontology/amigo#102, but I wanted to give context for the mechanisms here.

I'd have to look into the exact math to be sure, but essentially, when looking for http://noctua-amigo.berkeleybop.org/amigo/term/ComplexPortal:CPX-566 , there are a few ways to get at it.

If we look at the general index search on the noctua autocomplete AmiGO instance (http://noctua-amigo.berkeleybop.org, upper-right):

ComplexPortal:CPX-566: first result
CPX-566: first result
CPX 566: second result, with first result being "understandable"
"CPX-566": first and only result
CPX-566 Spom: first result

If we look at the "Filter by Term" "ontology" search on the Noctua landing page:

ComplexPortal:CPX-566: first result
CPX-566: hard to find
CPX 566: first result
"CPX-566": first and only result
CPX-566 Spom: first result

First, to reiterate, this should not be an issue and we would like to prioritize fully fixing our search at so that we don't need to have these conversations. That aside, for context for what we're seeing here today:

The two indexes here treat a couple of things a little differently, which is why we get the different results. What is likely happening in the second case (that is being used by the Noctua interface) is that when the CPX-566 string is being read in the dash is removed and the index looks for things that have "CPX"-ness or "566"-ness. Having a lot of "CPX"-ness outweighs having a little "566"-ness, so the desired result here gets lost. When the string CPX 566 is used, the string is read in and it understands that there needs to be "CPX"-ness and "566"-ness; there is only one thing that fits both of those criteria best and the desired results gets returned.

Technically speaking, there are things one can do in a case like this to ensure better results (e.g. when there is a dash also search for the quoted string or something), but we will need to weigh the effort needed to make and tune that versus the effort to just "start over" on the autocomplete with a newer and more robust system.

EDIT:

Noting that we have a redo NEO pipeline (geneontology/project-management#52) and some notes on redoing AmiGO, it might be worth it to spec out redoing NEO and Noctua autocomplete as a separate standalone project that could be almost a drop-in replacement, then use that to inform future AmiGO and GO API work (or feed it into the GO API first).

@ValWood
Copy link
Author

ValWood commented Oct 23, 2024

Just to say, shouldn't gene product searches alsways be exact matches, exactly as the user typed the, (i.e no 'fuzziness')
v

@kltm
Copy link
Member

kltm commented Oct 23, 2024

@ValWood Again, I'm not talking about what should be--I think we all agree on that00just clarifying the mechanics what is now for anybody diving into this. A fix can be applied either in the backend or frontend, with the immediate issue being around the mishandling of the dash in the identifier (which is essentially being treated like whitespace in this case). Special-case coding could likely be added to fix this edge case, but it might be worth weighting that against longer-term fixes and other fixes that are being queued up for Noctua.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

8 participants