Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve Shex failures in MGI annotations due to invalid identifiers for binding input #79

Open
3 of 4 tasks
ukemi opened this issue Mar 20, 2020 · 38 comments
Open
3 of 4 tasks
Assignees

Comments

@ukemi
Copy link

ukemi commented Mar 20, 2020

When annotations are imported into GO-CAM, binding annotations are transformed. If the annotation has an IPI evidence code, the value of the 'with' field is converted to a has_input for the GO-CAM binding function.
The rationale for this is that historically, curators captured the binding partners using IPI and the partner in the 'with' field.
Since the 'with' field was traditionally entered as free text in MGI annotations, the values were never confirmed to be valid identifiers with respect to the MGI GPI file. Over the years, many of those identifiers have been entered as UniProtKB ids to represent proteins. These identifiers fail the Shex because valid identifiers for proteins for MGI come from PRO.
I agree with @goodb that this should be fixed in the MGI source file rather than having @dustine32 manipulate the file downstream. This allows for MGI curators to check the validity of the identifier with respect to the gene objects at MGI.

  • 1. QC will be done at MGI to be sure that all the identifiers in the 'with' field for binding annotations are valid objects.
  • 2. If a 'with' field identifier at MGI corresponds to a mouse marker and is a UniProt id, it will be converted to a PRO id and we will check that they are in the GPI file as annotatable objects. This should allow all of these to pass the Shex.
  • 3. If the PRO identifiers for the UniProt IDs are not available or are not in the MGI GPI file, we will investigate this and either change the 'with' field appropriately or modify the GPI file.
  • 4. In some cases, the value of the 'with' field represents an object from another species because the experiment was done with multiple species. In these cases we do not want the value of the 'with' field to be converted to an input for the GO-CAM binding function and the annotation should be imported without conversion, IPI evidence code and the value of the 'with' field stays in the with field. Note that these annotations are valid in this format because in this case, the with field is also supporting the evidence for the binding function.

In anticipation of a question I expect from @goodb. The reason why we are not converting the annotations at source to make the 'with' fields inputs is because curators and users have informed us that they use the 'with' field values in legacy tools and the change would break their procedures. Therefore we have decided to convert to make the GO-CAM models 'correct' and we will need to back convert when we generate conventional annotations (gafs and gpad)from the GO-CAM models.

@ukemi ukemi self-assigned this Mar 20, 2020
@ukemi
Copy link
Author

ukemi commented Mar 20, 2020

A few numbers from @mdolanme
Number of binding annotations using IPI and the 'with' column (non-chebi): 3968
Number whose values are already in the GPI file: 77
Number whose value will be in the GPI when we switch UniProtKB: to PR: : 3111
Number that are either non-mouse, don't have a PRO id or have typos: 780
Number of these that are UniProt: 726/780
Number of these that are Refseq: 37/780
Number of these that are PRO ids: 9/780
Number of these that are typos: 7/780
Number of these that are EMBL: 1/780
Number of these UniProt ids that are definitely non-mouse: 304/726

@dustine32
Copy link
Collaborator

non_mouse_uniprot_with_ids.txt
Attached this list from @ukemi of non-mouse UniProtKB identifiers that appear in with/from field in some MGI binding annotations. These are examples of task 4 in @ukemi's initial issue description, which should not be converted to has_input edges and instead live in the with/from field of the evidence individual.

@leemdi
Copy link

leemdi commented Apr 29, 2020

04/29/2020 : for dustin
David H : add any comments about what is special about this version
http://www.informatics.jax.org/downloads/custom/mgi.gpa
http://www.informatics.jax.org/downloads/custom/mgi.gpa.gz

@ukemi
Copy link
Author

ukemi commented Apr 29, 2020

Hi @dustine32. In this version we have replaced the 'with' field values in binding annotations with PRO ids. They should now be validate for the Shex checks since they will be in our GPI file and therefore in NEO. @loricorbani replaced several thousand automatically and I replaced close to a thousand manually. I'd like to see how many models are still failing the Shex checks to determine where I have missed any. We can discuss in more detail on the GO-CAM specs call.

@dustine32
Copy link
Collaborator

@ukemi @loricorbani Awesome! I tested a one-off model for MGI:MGI:1917115 with this binding GPAD line:

MGI     MGI:1917115     enables GO:0005515      MGI:MGI:5613094|PMID:24916387   ECO:0000353     PR:Q91WT8               20150211        MGI             contributor=http://orcid.org/0000-0001-7476-6306

And the with/from value PR:Q91WT8 at least appears to be in NEO since it resolves the ID to a label:
image
I'll run the ShEx minerva validator on this model to see if it is valid. Currently having trouble getting my Noctua instance "Use reasoner" option to work so heads up in case you try it too.

@goodb
Copy link
Contributor

goodb commented Apr 29, 2020

@dustine32 let me know if you need help with noctua/minerva reasoner situation. Its been in flux, so if you are running from a dev branch you may need to adjust some command line params. Should be stabilizing soon...

@dustine32
Copy link
Collaborator

Thanks @goodb ! Yeah, I figured I would eventually need to hit you up on gitter or something to sort that out. I can probably attack that after the noon call today.

@ukemi
Copy link
Author

ukemi commented May 15, 2020

Thanks @loricorbani. @dustine32, in this set of files we have changed the way we are retrieving PRO identifiers for proteins that correspond to MGI genes. I suggest that for further testing we use the gpi files supplied at this location.

@leemdi
Copy link

leemdi commented May 22, 2020

05/22/2020 : for dustin
David H : add any comments about what is special about this version
http://www.informatics.jax.org/downloads/custom/mgi.gpa.gz
http://www.informatics.jax.org/downloads/custom/mgi.gpi.gz

These contain MGI/Production from 05/21.
Using PRO/GPI file.
With converted "contributor" values.

@ukemi
Copy link
Author

ukemi commented May 22, 2020

Hi @dustine32. In this version I have fixed the binding inputs for all of the MFs with the IPI evidnce code that you are converting to inputs. Once you have updated to the complete cell and anatomy ontology, could you run these through the Shex validation again and I will clean up any stragglers. There still may be a few. Thanks!!!

@dustine32
Copy link
Collaborator

non_mouse_uniprot_with_ids.txt

Attaching newest list of non-mouse identifiers described above in #79 (comment).

@ukemi
Copy link
Author

ukemi commented Jun 4, 2020

Hi @dustine32. I just noticed another issue in the models with respect to item 4. If you look at the model titled MGI:MGI:1929601, you will notice that in some cases, human proteins have been converted to inputs and are valid presumably because they are in Neo. These should be treated like the ones above #79 (comment). Clearly, my trying to find these manually is not an optimal strategy. Maybe we should switch to converting only identifiers that are found in the mouse GPI file to inputs and leaving all the rest in the 'with' field. This will get around the hand-built list in the comment above and the few more that I found today. I am pretty convinced that I have caught most of the 'true' errors that existed and those identifiers have all been corrected as either valid MGI identifiers or PR identifiers from the mouse GPI file. We can chat about this if it's not clear.

@leemdi
Copy link

leemdi commented Jun 8, 2020

06/08/2020 : for dustin
David H : add any comments about what is special about this version
http://www.informatics.jax.org/downloads/custom/mgi.gpa
http://www.informatics.jax.org/downloads/custom/mgi.gpa.gz

@ukemi
Copy link
Author

ukemi commented Jun 8, 2020

This version should have fixed the Shex errors.

@leemdi
Copy link

leemdi commented Jun 9, 2020

06/09/2020 : for dustin
David H : add any comments about what is special about this version
http://www.informatics.jax.org/downloads/custom/mgi.gpa
http://www.informatics.jax.org/downloads/custom/mgi.gpa.gz

@ukemi
Copy link
Author

ukemi commented Jun 9, 2020

It should be very special. It should pass both the logic and Shex checks.

@leemdi
Copy link

leemdi commented Jun 15, 2020

06/15/2020 : for dustin
David H : add any comments about what is special about this version
http://www.informatics.jax.org/downloads/custom/mgi.gpa
http://www.informatics.jax.org/downloads/custom/mgi.gpa.gz

@ukemi
Copy link
Author

ukemi commented Jun 15, 2020

Fixed a typo.

@leemdi
Copy link

leemdi commented Jun 29, 2020

06/29/2020 : for dustin
David H : add any comments about what is special about this version
http://www.informatics.jax.org/downloads/custom/mgi.gpa
http://www.informatics.jax.org/downloads/custom/mgi.gpa.gz

@leemdi
Copy link

leemdi commented Aug 5, 2020

08/05/2020 : for dustin
This is GPI version 2 from MGI
David H : add any comments about what is special about this version
http://www.informatics.jax.org/downloads/custom/mgi2.gpi

@ukemi
Copy link
Author

ukemi commented Aug 5, 2020

This version does not have values in column 8 yet and does not have protein complexes yet. Once we (GOC) decides exactly what is supposed to go into column 8, we (MGI) will populate it. Adding the complexes should also be straightforward and we can run a test with them at some point to make sure they are then available for curation in Noctua. Thanks @loricorbani !!!!!

@leemdi
Copy link

leemdi commented Aug 12, 2020

08/12/2020 : for dustin
This is GPI version 2 from MGI
David H : add any comments about what is special about this version
http://www.informatics.jax.org/downloads/custom/mgi2.gpi

@ukemi
Copy link
Author

ukemi commented Aug 12, 2020

This version has mouse protein complexes from PRO along with all the other changes from the previous version.

@leemdi
Copy link

leemdi commented Sep 11, 2020

09/11/2020 : for dustin
This is GPI version 2 and GPAD version 2 from MGI
David H : add any comments about what is special about this version
http://www.informatics.jax.org/downloads/custom/mgi2.gpi
http://www.informatics.jax.org/downloads/custom/mgi2.gpad

@ukemi
Copy link
Author

ukemi commented Sep 11, 2020

This GPAD2.0 file has everything including all of the properties that we will use for the initial import. We can go over the details on the call next Tuesday.

@leemdi
Copy link

leemdi commented Sep 17, 2020

09/17/2020 : for dustin
This is GPI version 2 and GPAD version 2 from MGI
David H : add any comments about what is special about this version

http://www.informatics.jax.org/downloads/custom/mgi2.gpi

MGI-curated only
http://www.informatics.jax.org/downloads/custom/mgi2.gpad

@ukemi
Copy link
Author

ukemi commented Sep 17, 2020

This file has been filtered so it only contains the annotations made by MGI curators using the MGI editorial interface.

@ukemi
Copy link
Author

ukemi commented Sep 17, 2020

See geneontology/go-site#1553

@leemdi
Copy link

leemdi commented Sep 24, 2020

09/24/2020 : for dustin
This is GPI version 2 and GPAD version 2 from MGI
David H : add any comments about what is special about this version

MGI-curated only
http://www.informatics.jax.org/downloads/custom/mgi2.gpad

@ukemi
Copy link
Author

ukemi commented Sep 24, 2020

Hi @dustine32 and @dougli1sqrd
In this version we fixed the bug where contributes_to (RO:0002326) wasn't being added to the file correctly.

@leemdi
Copy link

leemdi commented Oct 7, 2020

10/07/2020 : for dustin
This is GPI version 2 and GPAD version 2 from MGI
David H : add any comments about what is special about this version

MGI-curated only
http://www.informatics.jax.org/downloads/custom/mgi2.gpad

@ukemi
Copy link
Author

ukemi commented Oct 7, 2020

Note that both of these files are filtered for annotations made by MGI curators. The previous version were filtered for annotations made by MGI, but included ones made automatically by our orthology pipeline.

@leemdi
Copy link

leemdi commented Oct 23, 2020

10/23/2020 : for dustin
This is GPI version 2 and GPAD version 2 from MGI
David H : add any comments about what is special about this version

MGI-curated only
http://www.informatics.jax.org/downloads/custom/mgi2.gpad

@ukemi
Copy link
Author

ukemi commented Oct 23, 2020

@dustine32 In this file I have (I hope) cleaned up all of the MGI annotations and @loricorbani has replaced all of the annotation extension relations with the new ones that we decided on. So this is a test for 'the real thing".

@leemdi
Copy link

leemdi commented Feb 25, 2021

New versions

!gaf-version: 2.2
!gpa-version: 2.0
!gpi-version: 2.0

David H : add any comments about what is special about this version

MGI-curated only
http://www.informatics.jax.org/downloads/custom/go_cam_mgi.gpad
http://www.informatics.jax.org/downloads/custom/go_cam_gene_association.mgi
http://www.informatics.jax.org/downloads/custom/mgi.gpi

@ukemi
Copy link
Author

ukemi commented Feb 25, 2021

@dustine32. This version has fixed the bug where we were using commas instead of pipes. It should resolve a lot of the nesting issues that I saw with our annotations because those will now be separated. Thanks @loricorbani

@leemdi
Copy link

leemdi commented Feb 25, 2021

I have installed an automated script that will generate/copy the go_cam files from our production database to:

http://www.informatics.jax.org/downloads/custom

on a daily basis. You can pick them up at your convenience/when you need fresh copies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants