-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate Overall Mondo Stats #8570
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@twhetzel
Thank you for these statistics.
I added a few comments, mostly to get confirmation on a few points, though there are a few questions too.
Would it be possible to get these numbers in a table, please? It takes careful reading to make sure we understand what the numbers refer to. (if it is too complicated, we can deal in the immediate)
- ICD10CM 1116 | ||
- NCIT 3452 | ||
- OMIM 9141 | ||
- Orphanet 8176 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For orphanet, we should limit our analysis to the orphanet disease and disease subtypes.
I don't think this is the case here (based on how many are not reported to be in Mondo)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added another query for Orphanet to get the count limited to only ordo_disorder
and ordo_subtype_of_a_disorder
subsets. I am not sure what you mean by "I don't think this is the case here (based on how many are not reported to be in Mondo)".
- OMIM | ||
- Orphanet | ||
- ICD10CM | ||
- UMLS (as provided by MedGen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we count the UMLS?
Counting how many are given to MedGen is not representative of how many are in the source itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's where the xref value is from UMLS. I don't recall why a mention of MedGen is here from my notes when I talked to Nico about this in our 1:1. All of the information on how the data was generated and what query was run is included in this PR.
- DOID | ||
- NCIT (neoplasm branch) --> special processing needed | ||
- OMIM | ||
- Orphanet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the "whole" orphanet or focused on the Orphanet disease and disease subtypes?
We should count only the one that are in the orphanet disease and disease subtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, it is currently all of Orphanet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is updated now.
- GARD | ||
- MedGen | ||
- NANDO | ||
- NORD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't UMLS be included here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this might be a better place since that data is in the EMC pipeline. However, UMLS was listed as a Source twice in the document. Was there a reason for that?
1071 | ||
?cancerClassCount | ||
4710 | ||
?hereditaryClassCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How was hereditary class counted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
children of MONDO_0003847
15641 | ||
?infectiousClassCount | ||
1071 | ||
?cancerClassCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how was cancer class counted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
children of MONDO_0045024
?rareClassCount | ||
15641 | ||
?infectiousClassCount | ||
1071 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how was infectious class counted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
children of MONDO_0005550
. I updated the README to highlight where these queries can be found.
?broadSynonymCount | ||
1385 | ||
?rareClassCount | ||
15641 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How was rare counted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if these are the number of diseases in the rare subset, it is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
children of MONDO_0000001
in the rare
subset
These can be generated as: `sh run.sh make create-mondo-stats`. | ||
|
||
#### Results - General Statistics | ||
All Mondo Stats created on: Tue Jan 14 05:02:59 UTC 2025 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please confirm that these are for human diseases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the section Request 4 - general statistics the doc said "Summary statistics across all Mondo concepts" and did not mention this should be limited to 'human disease'. These are counts for all children of MONDO_0000001
.
|
||
#### Results - Number of Sources for Synonyms | ||
Number of Sources Count of Synonyms | ||
0 12389 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that we have 12389 synonyms that do not have any provenance/x-ref?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, there are 12389 exact synonyms without provenance/xref for Mondo terms that are children of 'human disease'
@twhetzel, I think it would be fantastic if these numbers could be generated and updated in our document with each release. |
- NANDO | ||
- NORD | ||
|
||
#### Results for Externally Managed Content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please confirm what these numbers are showing? Is this the number of Mondo terms that have x-ref to gard?
For MedGen, is that the number of terms that have x-ref to MedGen? How about the number of UMLS x-ref that we get though MedGen?
For Nando: is this the number of Mondo terms with at least one x-ref to NANDO? or the number of NANDO xref?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the count of Mondo terms from the inferred hierarchy that are children of 'human disease' that have an xref to each of these sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more comment
@sabrinatoro I believe I have addressed all comments and if additional query changes are needed. Let's talk more about which of all these stats would be useful to include for the monthly Mondo releases. |
Generate Mondo Stats as requested here. See the README for stats and how they were generated.
@sabrinatoro As future work, it would be helpful to know if any of these stats should be generated as report(s) for the Mondo releases. This was mentioned at the very end of the Tech call.