Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate Overall Mondo Stats #8570

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Generate Overall Mondo Stats #8570

wants to merge 11 commits into from

Conversation

twhetzel
Copy link
Collaborator

@twhetzel twhetzel commented Jan 14, 2025

Generate Mondo Stats as requested here. See the README for stats and how they were generated.

@sabrinatoro As future work, it would be helpful to know if any of these stats should be generated as report(s) for the Mondo releases. This was mentioned at the very end of the Tech call.

@twhetzel twhetzel marked this pull request as draft January 14, 2025 18:15
@twhetzel twhetzel requested review from matentzn and removed request for nicolevasilevsky January 15, 2025 01:23
@twhetzel twhetzel marked this pull request as ready for review January 15, 2025 01:25
Copy link
Collaborator

@sabrinatoro sabrinatoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twhetzel
Thank you for these statistics.
I added a few comments, mostly to get confirmation on a few points, though there are a few questions too.

Would it be possible to get these numbers in a table, please? It takes careful reading to make sure we understand what the numbers refer to. (if it is too complicated, we can deal in the immediate)

- ICD10CM 1116
- NCIT 3452
- OMIM 9141
- Orphanet 8176
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For orphanet, we should limit our analysis to the orphanet disease and disease subtypes.
I don't think this is the case here (based on how many are not reported to be in Mondo)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added another query for Orphanet to get the count limited to only ordo_disorder and ordo_subtype_of_a_disorder subsets. I am not sure what you mean by "I don't think this is the case here (based on how many are not reported to be in Mondo)".

- OMIM
- Orphanet
- ICD10CM
- UMLS (as provided by MedGen)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we count the UMLS?
Counting how many are given to MedGen is not representative of how many are in the source itself.

Copy link
Collaborator Author

@twhetzel twhetzel Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's where the xref value is from UMLS. I don't recall why a mention of MedGen is here from my notes when I talked to Nico about this in our 1:1. All of the information on how the data was generated and what query was run is included in this PR.

- DOID
- NCIT (neoplasm branch) --> special processing needed
- OMIM
- Orphanet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the "whole" orphanet or focused on the Orphanet disease and disease subtypes?
We should count only the one that are in the orphanet disease and disease subtypes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, it is currently all of Orphanet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is updated now.

- GARD
- MedGen
- NANDO
- NORD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't UMLS be included here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this might be a better place since that data is in the EMC pipeline. However, UMLS was listed as a Source twice in the document. Was there a reason for that?

1071
?cancerClassCount
4710
?hereditaryClassCount
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was hereditary class counted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

children of MONDO_0003847

15641
?infectiousClassCount
1071
?cancerClassCount
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how was cancer class counted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

children of MONDO_0045024

?rareClassCount
15641
?infectiousClassCount
1071
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how was infectious class counted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

children of MONDO_0005550. I updated the README to highlight where these queries can be found.

?broadSynonymCount
1385
?rareClassCount
15641
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was rare counted?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these are the number of diseases in the rare subset, it is ok.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

children of MONDO_0000001 in the rare subset

These can be generated as: `sh run.sh make create-mondo-stats`.

#### Results - General Statistics
All Mondo Stats created on: Tue Jan 14 05:02:59 UTC 2025
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please confirm that these are for human diseases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the section Request 4 - general statistics the doc said "Summary statistics across all Mondo concepts" and did not mention this should be limited to 'human disease'. These are counts for all children of MONDO_0000001.


#### Results - Number of Sources for Synonyms
Number of Sources Count of Synonyms
0 12389
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that we have 12389 synonyms that do not have any provenance/x-ref?

Copy link
Collaborator Author

@twhetzel twhetzel Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there are 12389 exact synonyms without provenance/xref for Mondo terms that are children of 'human disease'

@sabrinatoro
Copy link
Collaborator

@twhetzel, I think it would be fantastic if these numbers could be generated and updated in our document with each release.

- NANDO
- NORD

#### Results for Externally Managed Content
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please confirm what these numbers are showing? Is this the number of Mondo terms that have x-ref to gard?

For MedGen, is that the number of terms that have x-ref to MedGen? How about the number of UMLS x-ref that we get though MedGen?

For Nando: is this the number of Mondo terms with at least one x-ref to NANDO? or the number of NANDO xref?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the count of Mondo terms from the inferred hierarchy that are children of 'human disease' that have an xref to each of these sources.

Copy link
Collaborator

@sabrinatoro sabrinatoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more comment

@twhetzel
Copy link
Collaborator Author

@sabrinatoro I believe I have addressed all comments and if additional query changes are needed. Let's talk more about which of all these stats would be useful to include for the monthly Mondo releases.

@twhetzel twhetzel requested a review from sabrinatoro January 25, 2025 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants