Skip to content

aphp-datascience/eds-diabetes-date

Repository files navigation

Diabetes diagnosis date extraction from French clinical notes

License: MIT Python 3.12+

Benchmarking LLMs and rule-based methods for diabetes diagnosis date extraction from French clinical notes.

Key features

  • LLM inference: vLLM with continuous batching, tensor parallelism, thinking mode support
  • French NLP: Age normalization ("diabétique depuis l'âge de 45 ans" → birth_year + 45)
  • Evaluation: Bootstrap CIs, stratified metrics by date type, MAE, hallucination rate
  • Rule-based baseline: EDS-NLP ContextualMatcher with diverse context window configurations
  • Visualization: Bar plots with CIs, F1/time trade-off, Bland-Altman agreement

Installation

git clone https://github.com/aphp-datascience/eds-diabetes-date
cd eds-diabetes-date

This project uses uv for dependency management:

# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

For AP-HP users (default):

The project uses pyspark-revival, a patched version of PySpark 2.4.3 that works with recent Python versions (3.12+). This is required on the AP-HP infrastructure.

On the AP-HP cluster, uv must be installed inside a conda environment:

# Create and activate conda environment
conda create -n link_dates python=3.12
conda activate link_dates

# Install uv inside the conda environment
pip install uv

# Install project dependencies
uv sync

# Install with GPU support (for LLM inference with vLLM)
uv sync --group gpu

For users outside AP-HP:

If you don't have access to the AP-HP GitLab repository, remove the uv.lock file and install without custom sources:

# Install core dependencies
uv sync --no-sources

# Install with GPU support (for LLM inference with vLLM)
uv sync --no-sources --group gpu

This will install a recent version of PySpark (3.x or 4.x) from PyPI instead of the AP-HP-specific version.

Note on GPU dependencies:

  • vLLM is in the optional gpu dependency group since it requires CUDA and doesn't work on macOS
  • Core functionality (evaluation, visualization, tests) works without vLLM
  • LLM inference requires --group gpu installation on GPU-enabled systems

Pipeline execution

1. Dataset creation

uv run scripts/create_datasets.py

Creates train/validation/test splits from raw HDFS notes. Notes are truncated to context windows around diabetes mentions. Raw notes not shared (see DATA_AVAILABILITY.md).

2. LLM inference

Single experiment:

# Run locally
uv run scripts/run_llm_inference.py --experiment qwen3_8B_few_shot --dataset_type train

# List all experiments
uv run scripts/run_llm_inference.py --list_experiments

# Run on SLURM (single experiment)
sbatch scripts/run_llm_inference_single.sh --experiment qwen3_8B_few_shot --dataset_type train

All LLM experiments (batch):

# Run all LLM experiments in parallel on SLURM
sbatch scripts/run_llm_inference_all.sh train

# Experiments list can be found in src/link_dates/configs/experiments.py

Custom configuration:

uv run scripts/run_llm_inference.py \
  --model_id Qwen/Qwen2.5-8B-Instruct \
  --file_prompt custom_prompt.json \
  --window 300 \
  --batch_size 128 \
  --dataset_type validation

3. ContextualMatcher (rule-based)

Single configuration:

# Run locally
uv run scripts/run_contextual_matcher.py \
  --experiment contextualmatcher_wmin-15_wmax15_smin-1_smax1 \
  --dataset_type train

# List all configurations
uv run scripts/run_contextual_matcher.py --list_experiments

# Manual parameters
uv run scripts/run_contextual_matcher.py \
  --word_min -30 --word_max 30 --sent_min -1 --sent_max 1 \
  --dataset_type train

All ContextualMatcher configurations (batch):

# Run all configurations
bash scripts/run_contextual_matcher_all.sh train

# Experiments list can be found in src/link_dates/configs/experiments.py

4. Evaluation

uv run scripts/evaluate_model.py \
  --eval_path hdfs://results/test_type_1/qwen3_few_shot_regex_agg.parquet \
  --dataset_type test_type_1 \
  --tol 1

Arguments:

  • --eval_path: HDFS path to model predictions (Parquet format)
  • --dataset_type: Dataset to evaluate (train/validation/test_type_1/test_type_2)
  • --tol: Tolerance in years for matching (default: 1)

Metrics computed:

  • Classification metrics: F1, precision, recall, specificity, balanced accuracy
    • TP: Prediction matches annotation (within ±tolerance years)
    • TN: Both prediction and annotation are unknown (UNKNOWN_YEAR = 9999)
    • FP: Prediction made but annotation is unknown, or outside tolerance
    • FN: Prediction missed an existing date
    • Stratified by date type (absolute, relative, age, no_date)
  • Mean Absolute Error (MAE): Average year difference (pairs with unknown dates [UNKNOWN_YEAR = 9999] excluded)
  • Hallucination rate: Proportion of no-date notes where model predicted a date
  • No prediction rate: Proportion of notes with real dates where model predicted unknown (UNKNOWN_YEAR = 9999)
  • Bootstrap confidence intervals: 10,000 iterations for uncertainty estimates (95% CI, seed=12345)

5. Metrics Table (All Models)

Create comprehensive metrics tables comparing all models on a dataset:

uv run scripts/create_metrics_table.py \
  --dataset_type test_type_1 \
  --tol 1 \
  --output_format table  # or markdown, csv

Displays two tables:

  1. Overall metrics: precision, recall, F1, balanced accuracy, hallucination rate, no prediction rate, MAE, mean inference time
  2. Stratified metrics: precision, recall, F1 for each date type (absolute, relative, age)

6. Visualization

uv run scripts/create_plots.py \
  --dataset_type test_type_1 \
  --bootstrap_iterations 10000 \
  --tolerance 1

Generates: bar plots (F1/precision/recall + CI), F1 vs inference time, Bland-Altman agreement.

Experiment configurations

Defined in src/link_dates/configs/experiments.py:

LLM experiments (12):

Model Prompt Reasoning Window Temperature top_p
Qwen/Qwen3-8B zero_shot No 200 0.7 0.8
Qwen/Qwen3-8B zero_shot Yes 200 0.6 0.95
Qwen/Qwen3-8B few_shot No 200 0.7 0.8
Qwen/Qwen3-8B few_shot Yes 200 0.6 0.95
meta-llama/Llama-3.1-8B-Instruct zero_shot No 200 0.6 0.9
meta-llama/Llama-3.1-8B-Instruct few_shot No 200 0.6 0.9
mistralai/Ministral-3-8B-Reasoning-2512 zero_shot Yes 200 0.7 -
mistralai/Ministral-3-8B-Reasoning-2512 few_shot Yes 200 0.7 -
mistralai/Ministral-3-14B-Reasoning-2512 zero_shot Yes 200 1 -
mistralai/Ministral-3-14B-Reasoning-2512 few_shot Yes 200 1 -
google/medgemma-27b-text-it zero_shot No 200 0 -
google/medgemma-27b-text-it few_shot No 200 0 -

All experiments: seed=42, max_tokens=2048

ContextualMatcher experiments (12):

Context Type word_min word_max sent_min sent_max
Left -5 0 0 0
Left -10 0 0 0
Left -15 0 0 0
Left -15 0 -1 0
Right 0 5 0 0
Right 0 10 0 0
Right 0 15 0 0
Right 0 15 0 1
Bidirectional -5 5 0 0
Bidirectional -10 10 0 0
Bidirectional -15 15 0 0
Bidirectional -15 15 -1 1

Negative values = left context (before entity), positive values = right context (after entity)

Important Constants

The codebase uses centralized constants defined in src/link_dates/constants.py:

  • UNKNOWN_YEAR = 9999: Represents unknown/missing diagnosis dates in code

    • Used throughout Python code for consistency
    • Note: Prompt files (prompts/*.json) use the literal value "9999" in their French instructions to the LLM (e.g., "retourne 9999"). This is intentional as prompts are instructions to the model, not Python code.
  • DATASET_TYPES: Valid dataset splits ["train", "validation", "test_type_1", "test_type_2"]

Project structure

eds-diabetes-date/
├── pyproject.toml              # uv project configuration
├── uv.lock                     # exact packages versions
├── src/link_dates/
│   ├── inference.py            # vLLM pipeline, chat templates, regex extraction
│   ├── evaluation.py           # Metrics (F1, MAE, hallucination, no prediction), bootstrap CI
│   ├── visualization.py        # Plots with bootstrap CI
│   ├── age_normalization.py    # French age expressions detection & conversion
│   ├── contextual_matcher.py   # EDS-NLP pipeline creation
│   ├── converters.py           # Doc → DataFrame conversion
│   ├── constants.py            # Centralized constants (UNKNOWN_YEAR, DATASET_TYPES)
│   ├── hadoop_setup.py         # HDFS classpath configuration
│   └── configs/
│       ├── experiments.py      # All experiment configs (LLM + ContextualMatchers)
│       ├── paths.py            # Centralized paths management
│       └── plotting_config.py  # Visualization settings
├── scripts/
│   ├── create_datasets.py               # Dataset creation from HDFS
│   ├── run_llm_inference.py             # LLM inference script
│   ├── run_llm_inference_all.sh         # SLURM: run all LLM experiments
│   ├── run_llm_inference_single.sh      # SLURM: run single LLM experiment
│   ├── run_contextual_matcher.py        # ContextualMatcher inference script
│   ├── run_contextual_matcher_all.sh    # Bash: run all ContextualMatcher configs
│   ├── evaluate_model.py                # Single model evaluation
│   ├── create_metrics_table.py          # Comprehensive metrics table for all models
│   └── create_plots.py                  # Visualization
├── prompts/
│   ├── zero_shot.json          # Zero-shot prompt template
│   └── few_shot.json           # Few-shot with examples
└── tests/                      # Unit tests (no GPU/HDFS required)

Data Availability

Due to patient privacy regulations, the clinical notes dataset cannot be publicly shared. Synthetic test data is provided for code validation. See DATA_AVAILABILITY.md for details.

Reproducibility

  • Environment: Python 3.12+, uv, dependencies in pyproject.toml (exact versions in uv.lock)
  • Models: HuggingFace (Qwen/Qwen3-8B, meta-llama/Llama-3.1-8B-Instruct, mistralai/Ministral-3-8B-Reasoning-2512, mistralai/Ministral-3-14B-Reasoning-2512, google/medgemma-27b-text-it)
  • Hardware: 2× NVIDIA A100 for paper results, 30GB+ RAM, 8+ cores
  • Seeds: Model sampling=42, Bootstrap CI=12345
uv sync               # Inside AP-HP (core dependencies)
uv sync --group gpu   # Inside AP-HP (with vLLM for GPU inference)

uv sync --no-sources            # Outside AP-HP (core dependencies)
uv sync --no-sources --group gpu # Outside AP-HP (with vLLM for GPU inference)

export HF_TOKEN=<your_token>  # Or add your HF_TOKEN to a .env file at the root of the repository
uv run --frozen pytest tests/ -v  # Verify installation (works without GPUs)

About

Benchmarking LLMs and rule-based methods for diabetes diagnosis date extraction from French clinical notes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors