-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve Shex failures in MGI annotations due to invalid identifiers for binding input #79
Comments
A few numbers from @mdolanme |
non_mouse_uniprot_with_ids.txt |
04/29/2020 : for dustin |
Hi @dustine32. In this version we have replaced the 'with' field values in binding annotations with PRO ids. They should now be validate for the Shex checks since they will be in our GPI file and therefore in NEO. @loricorbani replaced several thousand automatically and I replaced close to a thousand manually. I'd like to see how many models are still failing the Shex checks to determine where I have missed any. We can discuss in more detail on the GO-CAM specs call. |
@ukemi @loricorbani Awesome! I tested a one-off model for MGI:MGI:1917115 with this binding GPAD line:
And the |
@dustine32 let me know if you need help with noctua/minerva reasoner situation. Its been in flux, so if you are running from a dev branch you may need to adjust some command line params. Should be stabilizing soon... |
Thanks @goodb ! Yeah, I figured I would eventually need to hit you up on gitter or something to sort that out. I can probably attack that after the noon call today. |
05/15/2020 : for dustin |
Thanks @loricorbani. @dustine32, in this set of files we have changed the way we are retrieving PRO identifiers for proteins that correspond to MGI genes. I suggest that for further testing we use the gpi files supplied at this location. |
05/22/2020 : for dustin These contain MGI/Production from 05/21. |
Hi @dustine32. In this version I have fixed the binding inputs for all of the MFs with the IPI evidnce code that you are converting to inputs. Once you have updated to the complete cell and anatomy ontology, could you run these through the Shex validation again and I will clean up any stragglers. There still may be a few. Thanks!!! |
non_mouse_uniprot_with_ids.txt Attaching newest list of non-mouse identifiers described above in #79 (comment). |
Hi @dustine32. I just noticed another issue in the models with respect to item 4. If you look at the model titled MGI:MGI:1929601, you will notice that in some cases, human proteins have been converted to inputs and are valid presumably because they are in Neo. These should be treated like the ones above #79 (comment). Clearly, my trying to find these manually is not an optimal strategy. Maybe we should switch to converting only identifiers that are found in the mouse GPI file to inputs and leaving all the rest in the 'with' field. This will get around the hand-built list in the comment above and the few more that I found today. I am pretty convinced that I have caught most of the 'true' errors that existed and those identifiers have all been corrected as either valid MGI identifiers or PR identifiers from the mouse GPI file. We can chat about this if it's not clear. |
06/08/2020 : for dustin |
This version should have fixed the Shex errors. |
06/09/2020 : for dustin |
It should be very special. It should pass both the logic and Shex checks. |
06/15/2020 : for dustin |
Fixed a typo. |
06/29/2020 : for dustin |
08/05/2020 : for dustin |
This version does not have values in column 8 yet and does not have protein complexes yet. Once we (GOC) decides exactly what is supposed to go into column 8, we (MGI) will populate it. Adding the complexes should also be straightforward and we can run a test with them at some point to make sure they are then available for curation in Noctua. Thanks @loricorbani !!!!! |
08/12/2020 : for dustin |
This version has mouse protein complexes from PRO along with all the other changes from the previous version. |
09/11/2020 : for dustin |
This GPAD2.0 file has everything including all of the properties that we will use for the initial import. We can go over the details on the call next Tuesday. |
09/17/2020 : for dustin http://www.informatics.jax.org/downloads/custom/mgi2.gpi MGI-curated only |
This file has been filtered so it only contains the annotations made by MGI curators using the MGI editorial interface. |
09/24/2020 : for dustin MGI-curated only |
Hi @dustine32 and @dougli1sqrd |
10/07/2020 : for dustin MGI-curated only |
Note that both of these files are filtered for annotations made by MGI curators. The previous version were filtered for annotations made by MGI, but included ones made automatically by our orthology pipeline. |
10/23/2020 : for dustin MGI-curated only |
@dustine32 In this file I have (I hope) cleaned up all of the MGI annotations and @loricorbani has replaced all of the annotation extension relations with the new ones that we decided on. So this is a test for 'the real thing". |
New versions !gaf-version: 2.2 David H : add any comments about what is special about this version MGI-curated only |
@dustine32. This version has fixed the bug where we were using commas instead of pipes. It should resolve a lot of the nesting issues that I saw with our annotations because those will now be separated. Thanks @loricorbani |
I have installed an automated script that will generate/copy the go_cam files from our production database to: http://www.informatics.jax.org/downloads/custom on a daily basis. You can pick them up at your convenience/when you need fresh copies. |
When annotations are imported into GO-CAM, binding annotations are transformed. If the annotation has an IPI evidence code, the value of the 'with' field is converted to a has_input for the GO-CAM binding function.
The rationale for this is that historically, curators captured the binding partners using IPI and the partner in the 'with' field.
Since the 'with' field was traditionally entered as free text in MGI annotations, the values were never confirmed to be valid identifiers with respect to the MGI GPI file. Over the years, many of those identifiers have been entered as UniProtKB ids to represent proteins. These identifiers fail the Shex because valid identifiers for proteins for MGI come from PRO.
I agree with @goodb that this should be fixed in the MGI source file rather than having @dustine32 manipulate the file downstream. This allows for MGI curators to check the validity of the identifier with respect to the gene objects at MGI.
In anticipation of a question I expect from @goodb. The reason why we are not converting the annotations at source to make the 'with' fields inputs is because curators and users have informed us that they use the 'with' field values in legacy tools and the change would break their procedures. Therefore we have decided to convert to make the GO-CAM models 'correct' and we will need to back convert when we generate conventional annotations (gafs and gpad)from the GO-CAM models.
The text was updated successfully, but these errors were encountered: