Include examples for using the annotate command #26

cmungall · 2024-02-09T19:58:18Z

curategpt includes an annotate command that functions as traditional text annotation / CR. Give it some text, and it will give back the ontology term IDs. It's not guaranteed to find the spans, but that could be done as post-processing. The priority has been to find the concepts.

annotate has a --method option with values:

inline
concept_list
two_pass

Under the hood it uses https://github.com/monarch-initiative/curate-gpt/blob/main/src/curate_gpt/agents/concept_recognition_agent.py

We really need to (a) have better docstrings and (b) expose this via sphinx... but for now this issue serves as temporary docs.

We'll use these texts as a running example:

A minimum diagnostic criterion is the combination of either the skin tumours
or multiple odontogenic keratocysts plus a positive family history for this disorder,
bifid ribs, lamellar calcification of the falx cerebri or any one of the skeletal
abnormalities typical of this syndrome
A clinical concept has been produced, with a diagnostic check list including
a genetic and a dermatological routine work up as well as a radiological survey
of the jaws and skeleton

And assume we have hpo pre-indexed using the standard curate-gpt loader.

inline

This is a RAG-based approach that finds the N most relevant concepts in the given ontology (pre-indexed in chromadb). It then presents this in the (system) prompt as a CSV of id, label pairs.

This method is designed to return the annotated spans "inlined" into the existing text, via this prompt:

Your role is to annotate the supplied text with selected concepts.
return the original text with each conceptID in square brackets.
After the occurrence of that concept.
You can use synonyms. For example, if the concept list contains
zucchini // DB:12345
Then for the text 'I love courgettes!' you should return
'I love [courgettes DB:12345]!'
Always try and match the longest span.
the concept ID should come only from the list of candidate concepts supplied to you.

ideally the system prompt will include DB:12345,courgettes but the chance of this diminishes with the size of the input document and to a lesser extent the size of the ontology.

example of output from

curategpt annotate -M inline -m gpt-4-1106-preview --prefix HP --category PhenotypicFeature -l 50 -c ont_hp -I original_id -p stagedb -s -i tests/input/example-disease-text.txt

The output annotated text in one run is:

A minimum diagnostic criterion is the combination of either the skin
  tumours or multiple [odontogenic keratocysts HP:0010603] plus a positive family
  history for this disorder, [bifid ribs HP:0030280], [lamellar calcification of the
  falx cerebri HP:0005462] or any one of the skeletal abnormalities typical of this
  syndrome

that was an easy one since the mentions in the text are more or less exact matches with hpo

For the other text:

A clinical concept has been produced, with a diagnostic check list
  including a genetic and a [dermatological routine work up HP:0001005] as well as
  a radiological survey of the [jaws HP:0000347] and [skeleton HP:0033127].

Hmm. We can see the concepts here:

spans:
- text: dermatological routine work up
  start: null
  end: null
  concept_id: HP:0001005
  concept_label: Dermatological manifestations of systemic disorders
  is_suspect: false
- text: jaws
  start: null
  end: null
  concept_id: HP:0000347
  concept_label: Micrognathia
  is_suspect: false
- text: skeleton
  start: null
  end: null
  concept_id: HP:0033127
  concept_label: Abnormality of the musculoskeletal system
  is_suspect: false

so it's getting creative, and this is wrong, the actual phenotype in GG is jaw cysts not small jaws...

concept list

This is similar to inline but doesn't attempt to relate the match to a span of text.

This is a RAG-based approach that finds the N most relevant concepts in the given ontology (pre-indexed in chromadb). It then presents this in the (system) prompt as a CSV of id, label pairs.

It uses the prompt:

Your role is to list all instances of the supplied candidate concepts in the supplied text.
Return the concept instances as a CSV of ID,label,text pairs, where the ID
is the concept ID, label is the concept label, and text is the mention of the
concept in the text.
The concept ID and label should come only from the list of candidate concepts supplied to you.
Only include a row if the meaning of the text section is that same as the concept.
If there are no instances of a concept in the text, return an empty string.
Do not include additional verbiage.

for the easier text:

curategpt annotate -M concept_list -m gpt-4-1106-preview --prefix HP --category PhenotypicFeature -l 50 -c ont_hp -I original_id -p stagedb -s -i tests/input/example-disease-text.txt

- text: '"odontogenic keratocysts"'
  start: null
  end: null
  concept_id: HP:0010603
  concept_label: Odontogenic keratocysts of the jaw
  is_suspect: false
- text: '"calcification of the falx cerebri"'
  start: null
  end: null
  concept_id: HP:0005462
  concept_label: Calcification of falx cerebri
  is_suspect: false
- text: '"bifid ribs"'
  start: null
  end: null
  concept_id: HP:0030280
  concept_label: Rib gap
  is_suspect: false

for the harder text it has lower recall but nothing is IMO outright wrong:

spans:
- text: radiological survey of the jaws
  start: null
  end: null
  concept_id: HP:0010603
  concept_label: Odontogenic keratocysts of the jaw
  is_suspect: false
- text: radiological survey of the skeleton
  start: null
  end: null
  concept_id: HP:0033127
  concept_label: Abnormality of the musculoskeletal system
  is_suspect: false

two pass

This does a first pass where it asks for all concepts found in the doc to be listed (no RAG, very vanilla chatpgpt usage). These are requested as human readable terms, not IDs, to limit hallucination. It asks for the concepts to be inlined in square brackets

Then a second pass is done on each concept, essentially grounding them. The grounding DOES use the concept_list/inline RAG method above - but in theory this should be more accurate and include the relevant concepts in the cutoff since we are just grounding rather than presenting the whole text.

The grounding prompt is:

Your role is to assign a concept ID that best matches the supplied text, using
the supplied list of candidate concepts.
Return as a string "CONCEPT NAME // CONCEPT ID".
Only return a result if the input text represents the same or equivalent
concept, in the provided context.
If there is no match, return an empty string.

let's see how it does on the easier text:

curategpt annotate -M two_pass -m gpt-4-1106-preview --prefix HP --category PhenotypicFeature -l 5 -c ont_hp -I original_id -p stagedb -s -i tests/input/example-disease-text.txt

(we can use a lower value for -l as for grounding the top 5 is likely to include the right concept [untested])

annotated_text: A minimum diagnostic criterion is the combination of either the [skin
  tumours] or multiple [odontogenic keratocysts] plus a positive family history for
  this disorder, [bifid ribs], [lamellar calcification of the falx cerebri] or any
  one of the [skeletal abnormalities] typical of this syndrome
spans:
- text: skin tumours
  start: null
  end: null
  concept_id: HP:0008069
  concept_label: Neoplasm of the skin
  is_suspect: false
- text: odontogenic keratocysts
  start: null
  end: null
  concept_id: HP:0010603
  concept_label: Odontogenic keratocysts of the jaw
  is_suspect: false
- text: bifid ribs
  start: null
  end: null
  concept_id: HP:0000892
  concept_label: Bifid ribs
  is_suspect: false
- text: lamellar calcification of the falx cerebri
  start: null
  end: null
  concept_id: HP:0005462
  concept_label: Calcification of falx cerebri
  is_suspect: false
- text: skeletal abnormalities
  start: null
  end: null
  concept_id: HP:0011842
  concept_label: Abnormal skeletal morphology
  is_suspect: false

and the harder one:

annotated_text: A clinical concept has been produced, with a diagnostic check list
  including a [genetic] and a [dermatological] routine work up as well as a [radiological]
  survey of the [jaws] and [skeleton].
spans:
- text: jaws
  start: null
  end: null
  concept_id: HP:0012802
  concept_label: Broad jaw
  is_suspect: false
- text: skeleton
  start: null
  end: null
  concept_id: C0037253
  concept_label: null
  is_suspect: false

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include examples for using the annotate command #26

Include examples for using the annotate command #26

cmungall commented Feb 9, 2024 •

edited

Loading

Include examples for using the annotate command #26

Include examples for using the annotate command #26

Comments

cmungall commented Feb 9, 2024 • edited Loading

inline

concept list

two pass

cmungall commented Feb 9, 2024 •

edited

Loading