You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
curategpt includes an annotate command that functions as traditional text annotation / CR. Give it some text, and it will give back the ontology term IDs. It's not guaranteed to find the spans, but that could be done as post-processing. The priority has been to find the concepts.
We really need to (a) have better docstrings and (b) expose this via sphinx... but for now this issue serves as temporary docs.
We'll use these texts as a running example:
A minimum diagnostic criterion is the combination of either the skin tumours
or multiple odontogenic keratocysts plus a positive family history for this disorder,
bifid ribs, lamellar calcification of the falx cerebri or any one of the skeletal
abnormalities typical of this syndrome
A clinical concept has been produced, with a diagnostic check list including
a genetic and a dermatological routine work up as well as a radiological survey
of the jaws and skeleton
And assume we have hpo pre-indexed using the standard curate-gpt loader.
inline
This is a RAG-based approach that finds the N most relevant concepts in the given ontology (pre-indexed in chromadb). It then presents this in the (system) prompt as a CSV of id, label pairs.
This method is designed to return the annotated spans "inlined" into the existing text, via this prompt:
Your role is to annotate the supplied text with selected concepts.
return the original text with each conceptID in square brackets.
After the occurrence of that concept.
You can use synonyms. For example, if the concept list contains
zucchini // DB:12345
Then for the text 'I love courgettes!' you should return
'I love [courgettes DB:12345]!'
Always try and match the longest span.
the concept ID should come only from the list of candidate concepts supplied to you.
ideally the system prompt will include DB:12345,courgettes but the chance of this diminishes with the size of the input document and to a lesser extent the size of the ontology.
A minimum diagnostic criterion is the combination of either the skin
tumours or multiple [odontogenic keratocysts HP:0010603] plus a positive family
history for this disorder, [bifid ribs HP:0030280], [lamellar calcification of the
falx cerebri HP:0005462] or any one of the skeletal abnormalities typical of this
syndrome
that was an easy one since the mentions in the text are more or less exact matches with hpo
For the other text:
A clinical concept has been produced, with a diagnostic check list
including a genetic and a [dermatological routine work up HP:0001005] as well as
a radiological survey of the [jaws HP:0000347] and [skeleton HP:0033127].
Hmm. We can see the concepts here:
spans:
- text: dermatological routine work upstart: nullend: nullconcept_id: HP:0001005concept_label: Dermatological manifestations of systemic disordersis_suspect: false
- text: jawsstart: nullend: nullconcept_id: HP:0000347concept_label: Micrognathiais_suspect: false
- text: skeletonstart: nullend: nullconcept_id: HP:0033127concept_label: Abnormality of the musculoskeletal systemis_suspect: false
so it's getting creative, and this is wrong, the actual phenotype in GG is jaw cysts not small jaws...
concept list
This is similar to inline but doesn't attempt to relate the match to a span of text.
This is a RAG-based approach that finds the N most relevant concepts in the given ontology (pre-indexed in chromadb). It then presents this in the (system) prompt as a CSV of id, label pairs.
It uses the prompt:
Your role is to list all instances of the supplied candidate concepts in the supplied text.
Return the concept instances as a CSV of ID,label,text pairs, where the ID
is the concept ID, label is the concept label, and text is the mention of the
concept in the text.
The concept ID and label should come only from the list of candidate concepts supplied to you.
Only include a row if the meaning of the text section is that same as the concept.
If there are no instances of a concept in the text, return an empty string.
Do not include additional verbiage.
- text: '"odontogenic keratocysts"'start: nullend: nullconcept_id: HP:0010603concept_label: Odontogenic keratocysts of the jawis_suspect: false
- text: '"calcification of the falx cerebri"'start: nullend: nullconcept_id: HP:0005462concept_label: Calcification of falx cerebriis_suspect: false
- text: '"bifid ribs"'start: nullend: nullconcept_id: HP:0030280concept_label: Rib gapis_suspect: false
for the harder text it has lower recall but nothing is IMO outright wrong:
spans:
- text: radiological survey of the jawsstart: nullend: nullconcept_id: HP:0010603concept_label: Odontogenic keratocysts of the jawis_suspect: false
- text: radiological survey of the skeletonstart: nullend: nullconcept_id: HP:0033127concept_label: Abnormality of the musculoskeletal systemis_suspect: false
two pass
This does a first pass where it asks for all concepts found in the doc to be listed (no RAG, very vanilla chatpgpt usage). These are requested as human readable terms, not IDs, to limit hallucination. It asks for the concepts to be inlined in square brackets
Then a second pass is done on each concept, essentially grounding them. The grounding DOES use the concept_list/inline RAG method above - but in theory this should be more accurate and include the relevant concepts in the cutoff since we are just grounding rather than presenting the whole text.
The grounding prompt is:
Your role is to assign a concept ID that best matches the supplied text, using
the supplied list of candidate concepts.
Return as a string "CONCEPT NAME // CONCEPT ID".
Only return a result if the input text represents the same or equivalent
concept, in the provided context.
If there is no match, return an empty string.
(we can use a lower value for -l as for grounding the top 5 is likely to include the right concept [untested])
annotated_text: A minimum diagnostic criterion is the combination of either the [skintumours] or multiple [odontogenic keratocysts] plus a positive family history forthis disorder, [bifid ribs], [lamellar calcification of the falx cerebri] or anyone of the [skeletal abnormalities] typical of this syndromespans:
- text: skin tumoursstart: nullend: nullconcept_id: HP:0008069concept_label: Neoplasm of the skinis_suspect: false
- text: odontogenic keratocystsstart: nullend: nullconcept_id: HP:0010603concept_label: Odontogenic keratocysts of the jawis_suspect: false
- text: bifid ribsstart: nullend: nullconcept_id: HP:0000892concept_label: Bifid ribsis_suspect: false
- text: lamellar calcification of the falx cerebristart: nullend: nullconcept_id: HP:0005462concept_label: Calcification of falx cerebriis_suspect: false
- text: skeletal abnormalitiesstart: nullend: nullconcept_id: HP:0011842concept_label: Abnormal skeletal morphologyis_suspect: false
and the harder one:
annotated_text: A clinical concept has been produced, with a diagnostic check listincluding a [genetic] and a [dermatological] routine work up as well as a [radiological]survey of the [jaws] and [skeleton].spans:
- text: jawsstart: nullend: nullconcept_id: HP:0012802concept_label: Broad jawis_suspect: false
- text: skeletonstart: nullend: nullconcept_id: C0037253concept_label: nullis_suspect: false
The text was updated successfully, but these errors were encountered:
curategpt includes an
annotate
command that functions as traditional text annotation / CR. Give it some text, and it will give back the ontology term IDs. It's not guaranteed to find the spans, but that could be done as post-processing. The priority has been to find the concepts.annotate has a
--method
option with values:inline
concept_list
two_pass
Under the hood it uses https://github.com/monarch-initiative/curate-gpt/blob/main/src/curate_gpt/agents/concept_recognition_agent.py
We really need to (a) have better docstrings and (b) expose this via sphinx... but for now this issue serves as temporary docs.
We'll use these texts as a running example:
or multiple odontogenic keratocysts plus a positive family history for this disorder,
bifid ribs, lamellar calcification of the falx cerebri or any one of the skeletal
abnormalities typical of this syndrome
a genetic and a dermatological routine work up as well as a radiological survey
of the jaws and skeleton
And assume we have hpo pre-indexed using the standard curate-gpt loader.
inline
This is a RAG-based approach that finds the N most relevant concepts in the given ontology (pre-indexed in chromadb). It then presents this in the (system) prompt as a CSV of id, label pairs.
This method is designed to return the annotated spans "inlined" into the existing text, via this prompt:
ideally the system prompt will include
DB:12345,courgettes
but the chance of this diminishes with the size of the input document and to a lesser extent the size of the ontology.example of output from
curategpt annotate -M inline -m gpt-4-1106-preview --prefix HP --category PhenotypicFeature -l 50 -c ont_hp -I original_id -p stagedb -s -i tests/input/example-disease-text.txt
The output annotated text in one run is:
that was an easy one since the mentions in the text are more or less exact matches with hpo
For the other text:
Hmm. We can see the concepts here:
so it's getting creative, and this is wrong, the actual phenotype in GG is jaw cysts not small jaws...
concept list
This is similar to
inline
but doesn't attempt to relate the match to a span of text.This is a RAG-based approach that finds the N most relevant concepts in the given ontology (pre-indexed in chromadb). It then presents this in the (system) prompt as a CSV of id, label pairs.
It uses the prompt:
for the easier text:
curategpt annotate -M concept_list -m gpt-4-1106-preview --prefix HP --category PhenotypicFeature -l 50 -c ont_hp -I original_id -p stagedb -s -i tests/input/example-disease-text.txt
for the harder text it has lower recall but nothing is IMO outright wrong:
two pass
This does a first pass where it asks for all concepts found in the doc to be listed (no RAG, very vanilla chatpgpt usage). These are requested as human readable terms, not IDs, to limit hallucination. It asks for the concepts to be inlined in square brackets
Then a second pass is done on each concept, essentially grounding them. The grounding DOES use the concept_list/inline RAG method above - but in theory this should be more accurate and include the relevant concepts in the cutoff since we are just grounding rather than presenting the whole text.
The grounding prompt is:
let's see how it does on the easier text:
curategpt annotate -M two_pass -m gpt-4-1106-preview --prefix HP --category PhenotypicFeature -l 5 -c ont_hp -I original_id -p stagedb -s -i tests/input/example-disease-text.txt
(we can use a lower value for
-l
as for grounding the top 5 is likely to include the right concept [untested])and the harder one:
The text was updated successfully, but these errors were encountered: