Skip to content

microsoft-healthcare-madison/hackweek-2023-12

Repository files navigation

Fun with Forms and LLMs

  • Day 1: Creating FHIR Questionnaires from unstructured sources
stateDiagram

    w: Welcome Screen
    u: Upload form
    g: Generate Questionnaire
    v: Validate Questionnaire
    r: Render Questionnaire
    i: Iterate with User

    [*] --> w
    w --> u

    u --> g
    g --> v
    v --> g

    v --> r
    r --> i
    i --> g

    i --> [*]
Loading
  • Day 2: Auto-populating items based on EHR data
flowchart TD
    A[Questionnaire] --> C[Break into individual items]
    subgraph Processing for Each Question
        D["Generate *keyword query tree* to identify\nCandidate EHR Snippets\n(Alt: Use Embeddings)"]
        E["Generate *interface FactModel {}* for abstracting\nCandidate EHR Snippets"]
        D --> F["Abstract all Candidate Snippets into\nFactModel[]"]
        E --> F
    F --> H[Summarize set of facts into a\nPrepopulation Suggestion\nas QuestionnaireResponse.item]
    end
    C --> D
    C --> E
    H --> I[End]
Loading
  • Day 3: UX for filling forms (with auto-population assistance)

Themes

(In hackmd for editing.)


Overview Video run-of-show

  • Overview of "hack week" goals

  • This week's topic: Clinical forms & structuring healthcare data

    • Day 1 / Day 2 / Day 3 goals
    • Build demos and build our own intuitions
    • We know full well these are each deep problems that we could get lost in
    • Try to keep moving forward and learning; catalog problems and set aside
  • Day 1: Structuring free-text data into FHIR Questionnaires

    • Demo: a clinical instrument that has never been "LOINC-ified" (http://www.tbi-impact.org/cde/mod_templates/12_F_06_Rivermead.pdf)
    • https://github.com/microsoft-healthcare-madison/hackweek-2023-12/tree/main/day-1-fhir-questionnaire-creation/examples
    • Key theme: Lots of forms go through a formal standards process with, e.g. LOINC (~100K concepts and growing)... but Still there's a very long tail of clinical content
      • Specific disease areas / clinical research
      • Bespoke intake forms and local processes
    • Key theme: With in-context schema documentation, zero-shot generation is impressive
    • Key theme: Closed loop with FHIR validator improves output quality in complex cases
      • Better quality validation --> Better results
      • Better, clearer error messages --> Better results (Bonus: good for humans too)
    • Key theme: Iterating on large forms leads to the output from the model generalizing or produces snippits making it hard to code/edit - need to reduce scope and be more explicit with defining changes
    • Key theme: Need to QA results - hallucinations can be subtle (e.g., inventing questions instead of reading text)
      • Can be developer's fault (e.g., not providing enough context)
      • Can be system's fault (e.g., request failure results in explicit "I'll make this up")
      • Output includes only a "snippet" or theme for the change e.g. "... apply same to other questions..."
    • Gino: LOINC Coding Experiments
  • Day 2: Automatically gathering and populating data from EHR

  • Day 3: Brian screenshare on entering / pre-populating responses, pulling from existing content

    • SDC prepop if editors write expressions ahed of time
    • Could produce these extensions automatically

Retrospective & Lessons Learned

Josh

  • UX Paradigms for interacting with LLMs are still being invented
    • We explore some structured and unstructured examples
  • Getting the best out of your LLM
    • GPT-4 has strong built-in capabilities
    • GPT-4 samples at T=.7-1 have tremendous variability
    • Taming the variability without stifling the creativity
      • In-context guidance through detailed prompting is table stakes
      • In-context learning (e.g., with FHIR schema) helps steer generation tasks
      • External feedback loops (e.g., with FHIR validator or code interpreter) can be magical
      • "Narrow tasks" are helpful
        • Give LLM a specific, narrow task when "creative" or "insightful" work is involved
        • Use structured approaches to prevent errors downstream
          • e.g., gating application of changes with a human-facing decision
          • e.g., working out implications in a structured way, rather than asking the LLM
  • Real/realistic are hard to come by
    • Critical for testing and intuition-building
    • Critical for evaluating a model's performance
    • Synthetic clinical notes can be very pragmatic... maybe for both

Brian

  • Also found that the approach using the typescript "interface" style of interacting worked VERY well.
  • getting lists from the LLM wasn't obvious, though simple in the end, just including an array of objects in the interface
  • The code comments on the interface actually worked very well as part of the prompt without other instructions.
  • When receiving large questionnaires with fields that are similar can have trouble evaluating the difference between them. Further research required here.
    • It wasn't able to split the difference between Full name, title, preferred name. However it was able to chunk out given, middle and family names.
    • The inclusion of confidence, reasoning and source do help inform UI after response from AI to better post process for display.
  • Getting the LLM to try and summarize a forms purpose with limited information really goes to show that poor form design can lead to more challenging use - which I guess is consistent with real usage.
    • Maybe an area that could be investigated further is continuing to review/refine a definition to improve usability/clarity.
  • Using the LLM as a closed box and just sending it everything can be difficult to understand why something isn't working
  • Breaking up process into smaller tasks and processing that way helps reduce unknown failures, and with smaller token count requests appear to be faster too.
  • Breaking up questionnaires and providing subsets will be common, and may need multiple calls when trying to process a form
    • investigated a few approaches to dicing large form definitions with many groups/questions.
    • Approach ended up with was to chain through groups together, but also include the chain from the base of the questionnaire.
  • As a UX experience, just blasting the result into the form without confirmation wasn't great, and particularly without the ability to discuss where it came from.
  • UX Feedback that AI is still processing is important as it can take quite some time and what it is working on.
    • More important when doing multiple steps, as need to know which steps are taking the time.
  • More open source SDC based tooling to help work with FHIR Questionnaires can be improved.
  • When generating the form definitions, often the engine would give "snippits" that had do same here embedded which then needed to be split out again - makes processing into a real form via code difficult, and keeping a large form in context can be challenging/slow.
  • Quality of error messages from validator can greatly improve the chances that GPT will correct things properly.

Gino

  • gpt-4 feels less like an 'expert' and more like a 'junior/senior pretending to be an expert'.
    • It can do a lot of things.
    • When right, it feels like magic.
      • Extracting the questions from the PRAPARE-English PDF file.
      • Selecting correct LOINC code based on a related-text query
    • When wrong, it can be extremely blatant or very subtle.
      • Extracting most of the questions of PRAPARE, but making up the last five.
      • Inventing a new LOINC code
      • Using an existing LOINC code, but inventing a new description.
    • All output needs to be verified and validated.
  • gpt-4 performance and cost means that it is best to lean on AI for isolated and well-defined tasks instead of entire worfklows
  • Testing is time-consuming and time-sensitive (contamination problem)
    • It is hard to know if your test is something it knows specifically (e.g., seen that brainteaser before)
    • Publishing test scenarios properly means that data could be included in a future set of models.
    • Care should be taken to ensure that tests are novel.

Rough ideas for current appropriate uses:

  • Developer assistant
    • Models are good at writing code, somewhere between 80% and 100% correct.
    • Developer (in theory) has expertise to evaluate and fix.
  • Concept derivation
    • Distilling free-form text into concise language from a particular context.
  • Concept explanation
    • Expounding on a terse description (iteratively) with a user to achieve understanding.
  • Visual data extraction
    • Great at pulling data, but needs visual comparison/review to ensure correctness.
  • And feedback looping
    • Makes a good sounding board, even when results are incorrect.
  • Overall
    • Processes that tolerate/integrate non-determinisim and fuzziness.