Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After refactoring the subclass of sync, a lot of changes #8432

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

matentzn
Copy link
Member

@matentzn matentzn commented Nov 21, 2024

Build PR for #8431

I eyeballed some of the, and they seem correct to me.. To review this:

  1. Check out this file: https://raw.githubusercontent.com/monarch-initiative/mondo-ingest/main/src/ontology/reports/sync-subClassOf.confirmed.tsv (this is what is merged into mondo-edit)
  2. Confirm that the changes reflect the changes you see to the edit file
  3. Confirm that the changes to the sync-subClassOf.confirmed.tsv are intended

The most important is (3), as I already did 1 and a few spotchecks re 2.

I suggest @joeflack4 you take a look here is well.

I eyeballed some of the, and they seem correct to me.. To review this:

1. Check out this file: https://raw.githubusercontent.com/monarch-initiative/mondo-ingest/main/src/ontology/reports/sync-subClassOf.confirmed.tsv (this is what is merged into mondo-edit)
2. Confirm that the changes reflect the changes you see to the edit file
3. Confirm that the changes to the sync-subClassOf.confirmed.tsv are intended

The most important is (3), as I already did 1 and a few spotchecks re 2.
@twhetzel
Copy link
Collaborator

twhetzel commented Nov 22, 2024

I'm not sure I am looking at this in the "right" direction. For example, for MONDO:0000022 it has these changes:

- is_a: MONDO:0024290 {source="NCIT:C118172", source="icd11.foundation:1048673005"} ! enuresis
+ is_a: MONDO:0024290 {source="icd11.foundation:1048673005"} ! enuresis

So in Mondo MONDO:0000022 'nocturnal enuresis' is a subclassof MONDO:0024290 'enuresis'. The Mondo term has xref: icd11.foundation:1048673005 {source="MONDO:equivalentTo"} and icd11.foundation has this same subclassof relation between 'nocturnal enuresis' and 'enuresis' and is a source for subclassof relationship in Mondo. But the Mondo term also has xref: NCIT:C118172 {source="MONDO:equivalentTo"} and the subclassof relationship between between 'nocturnal enuresis' and 'enuresis' exists in NCIT, but NCIT:C118172 was removed as a source for the subclassof relationship in Mondo.

Also, many of the subclassof sources that were removed are from DOID and NCIT. When I looked in the mondo-ingest repo at tmp/component-download-ncit.owl.owl I could find 'nocturnal enuresis' and 'enuresis', but when I looked at components/ncit.owl I could not find either class 'nocturnal enuresis' and 'enuresis'.

@matentzn
Copy link
Member Author

matentzn commented Nov 22, 2024

Also, many of the subclassof sources that were removed are from DOID and NCIT. When I looked in the mondo-ingest repo at tmp/component-download-ncit.owl.owl I could find 'nocturnal enuresis' and 'enuresis', but when I looked at components/ncit.owl I could not find either class 'nocturnal enuresis' and 'enuresis'

At first I found this concerning, but then I checked our specs and they clearly state that we only ever sync the neoplasm branch with NCIT. enuresis is in the psychiatric disorder branch..

The DO is more interesting/concerning:

[Term]
id: MONDO:0000248
name: dengue shock syndrome
xref: DOID:0050125 {source="MONDO:equivalentTo"}
is_a: MONDO:0005502 {source="DOID:0050125", source="MONDO:Redundant"} ! dengue disease
[Term]
id: MONDO:0005502
name: dengue disease
xref: DOID:12205 {source="MONDO:equivalentTo", source="EFO:0005547"}

And in DO:

<owl:Class rdf:about="http://purl.obolibrary.org/obo/DOID_0050125">
    <rdfs:subClassOf rdf:resource="http://purl.obolibrary.org/obo/DOID_12205"/>
    <rdfs:label xml:lang="en">dengue shock syndrome</rdfs:label>
</owl:Class>

The confirmed subclass table is clearly missing this, only containing these two entries related to dengue:

MONDO:0005358	Dengue hemorrhagic fever	MONDO:0005502	DOID:12206	DOID:12205	dengue disease
MONDO:0000259	asymptomatic dengue	MONDO:0005502	DOID:0050143	DOID:12205	dengue disease

In my opinion its a bug in the subclass sync, one that we didn't notice because my previous subclass pipeline didn't delete all the evidence..

So, I think we need to:

let me know what you think

@twhetzel
Copy link
Collaborator

I added the ticket monarch-initiative/mondo-ingest#708, unclear what availability @joeflack4 has to fix this as an urgent priority. No, we're generally not ditching the develop branch. BUT in this specific case for the December release may need to make some careful adjustments to get these bug fixes addressed.

@twhetzel twhetzel marked this pull request as draft November 26, 2024 17:15
@twhetzel
Copy link
Collaborator

Converting to Draft until this is fixed monarch-initiative/mondo-ingest#708

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General review

I spot checked some things. Looks like I'd expect; the deletions and lack thereof.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double checking the is_a source

Just using the example case from monarch-initiative/mondo-ingest#708, even though that is an direct-source-indirect-mondo and not part of this PR.

subject_mondo_id subject_mondo_label object_mondo_id subject_source_id object_source_id object_mondo_label
ID SC % >A oboInOwl:source
MONDO:0000248 dengue shock syndrome MONDO:0005502 DOID:0050125 DOID:12205 dengue disease
id: MONDO:0000248
name: dengue shock syndrome
xref: DOID:0050125 {source="MONDO:equivalentTo"}
is_a: MONDO:0005502 {source="DOID:0050125", source="MONDO:Redundant"} ! dengue disease

I'm pretty sure source="DOID:0050125" is correct but I just wanted to double check that this shouldn't be `source="DOID:12205", because in a way I feel like it makes sense to express the evidence that way as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-added?

The PR title just says "subclass sync", but your comments are only about -confirmed, so that's what I checked. Have we not implemented -added for the subclass sync, or has it just not been run for this PR?

Copy link
Collaborator

@joeflack4 joeflack4 Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirm that the changes to the sync-subClassOf.confirmed.tsv are intended

@matentzn What are you asking precisely? I don't think you mean to the TSV; Like I don't think you're asking me to go to mondo-ingest and look at the diff of that file over time.

It sounds like what you're asking is that the changes by the TSV are intended.

By intended, it sounds like you want to take this time to double check the mondo-ingest subclass sync pipeline itself; to spot check a few cases where we can match up what we'd expect to see appear in the -confirmed TSV, and then check to see if those entries do indeed appear in the TSV. Is that what you are asking?

I don't really have any doubts about this, but I did do a couple checks:

  • Checked a random SCR that wasn't removed by this PR, verified the matches in mondo.sssom.tsv exist, and checked that the component .owl does corroborate the SCR.
  • Checked a couple cases like this: A deletion, (MONDO:0017626 is_a: MONDO:0018100 {source="Orphanet:306522"} ! familial primary hypomagnesemia. Checked and saw that that MONDO:0018100 has no Orphanet mapping in mondo.sssom.tsv, so makes sense that this one was removed.
  • I wanted to check a case like: Mondo is_a evidence for a source was removed, and exist in the mondo.sssom.tsv, but it was only removed because the SCR didn't exist in the source, but it's best if I don't spend time digging around for such a case unless requested.

@twhetzel
Copy link
Collaborator

twhetzel commented Jan 18, 2025

Following up from the Tech call and analysis of the missing subclassOf source annotations from NCIT (https://github.com/monarch-initiative/mondo-curation-analysis/tree/main/mondo-subclass-sync-analysis), the decision was to add the source provenance of MONDO:notVerified. I also decided to keep the existing source provenance as well since multiple provenance annotations can be maintained and if this information is ever curated it's easier to remove the MONDO:notVerified after verification of the existing subclass provenance than start from scratch to find the provenance. That decision can be easily changed by removing from the ROBOT header from the file here, header column ncitSource2. Also, this currently static file can be moved into mondo-ingest, for time sake it was easier to pull from mondo-curation-analysis.

I ran the subclass sync locally as sh run.sh make update-subclass-sync -B.

Here are some data check results:
Regex pattern to find a subclassOf relationship with no provenance:
grep -c -E 'is_a: MONDO:[0-9]+ ! ' mondo-edit.obo
--> before fix for missing NCIT non-neoplasm sources there are 6106
--> after NCIT fix: 5287

Regex pattern to find subclassOf relationship with provenance:
grep -c -E 'is_a: MONDO:[0-9]+ \{source=.*?\} ! ' mondo-edit.obo
--> before fix for missing NCIT non-neoplasm source there are 33621
--> after NCIT fix: 34442

These number changes match the content of ncit_subclass_provenance_robot_df.tsv.

@matentzn can you let me know if you prefer any changes to this?

@twhetzel twhetzel marked this pull request as ready for review January 18, 2025 02:19
@twhetzel twhetzel marked this pull request as draft January 18, 2025 07:17
@matentzn
Copy link
Member Author

@matentzn can you let me know if you prefer any changes to this?

Thank you @twhetzel for the NCIT stuff this is fine, and indeed the most correct way; thanks for going the extra mile!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants