Diabetes diagnosis date extraction from French clinical notes

Benchmarking LLMs and rule-based methods for diabetes diagnosis date extraction from French clinical notes.

Key features

LLM inference: vLLM with continuous batching, tensor parallelism, thinking mode support
French NLP: Age normalization ("diabétique depuis l'âge de 45 ans" → birth_year + 45)
Evaluation: Bootstrap CIs, stratified metrics by date type, MAE, hallucination rate
Rule-based baseline: EDS-NLP ContextualMatcher with diverse context window configurations
Visualization: Bar plots with CIs, F1/time trade-off, Bland-Altman agreement

Installation

git clone https://github.com/aphp-datascience/eds-diabetes-date
cd eds-diabetes-date

This project uses uv for dependency management:

# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

For AP-HP users (default):

The project uses pyspark-revival, a patched version of PySpark 2.4.3 that works with recent Python versions (3.12+). This is required on the AP-HP infrastructure.

On the AP-HP cluster, uv must be installed inside a conda environment:

# Create and activate conda environment
conda create -n link_dates python=3.12
conda activate link_dates

# Install uv inside the conda environment
pip install uv

# Install project dependencies
uv sync

# Install with GPU support (for LLM inference with vLLM)
uv sync --group gpu

For users outside AP-HP:

If you don't have access to the AP-HP GitLab repository, remove the uv.lock file and install without custom sources:

# Install core dependencies
uv sync --no-sources

# Install with GPU support (for LLM inference with vLLM)
uv sync --no-sources --group gpu

This will install a recent version of PySpark (3.x or 4.x) from PyPI instead of the AP-HP-specific version.

Note on GPU dependencies:

vLLM is in the optional gpu dependency group since it requires CUDA and doesn't work on macOS
Core functionality (evaluation, visualization, tests) works without vLLM
LLM inference requires --group gpu installation on GPU-enabled systems

Pipeline execution

1. Dataset creation

uv run scripts/create_datasets.py

Creates train/validation/test splits from raw HDFS notes. Notes are truncated to context windows around diabetes mentions. Raw notes not shared (see DATA_AVAILABILITY.md).

2. LLM inference

Single experiment:

# Run locally
uv run scripts/run_llm_inference.py --experiment qwen3_8B_few_shot --dataset_type train

# List all experiments
uv run scripts/run_llm_inference.py --list_experiments

# Run on SLURM (single experiment)
sbatch scripts/run_llm_inference_single.sh --experiment qwen3_8B_few_shot --dataset_type train

All LLM experiments (batch):

# Run all LLM experiments in parallel on SLURM
sbatch scripts/run_llm_inference_all.sh train

# Experiments list can be found in src/link_dates/configs/experiments.py

Custom configuration:

uv run scripts/run_llm_inference.py \
  --model_id Qwen/Qwen2.5-8B-Instruct \
  --file_prompt custom_prompt.json \
  --window 300 \
  --batch_size 128 \
  --dataset_type validation

3. ContextualMatcher (rule-based)

Single configuration:

# Run locally
uv run scripts/run_contextual_matcher.py \
  --experiment contextualmatcher_wmin-15_wmax15_smin-1_smax1 \
  --dataset_type train

# List all configurations
uv run scripts/run_contextual_matcher.py --list_experiments

# Manual parameters
uv run scripts/run_contextual_matcher.py \
  --word_min -30 --word_max 30 --sent_min -1 --sent_max 1 \
  --dataset_type train

All ContextualMatcher configurations (batch):

# Run all configurations
bash scripts/run_contextual_matcher_all.sh train

# Experiments list can be found in src/link_dates/configs/experiments.py

4. Evaluation

uv run scripts/evaluate_model.py \
  --eval_path hdfs://results/test_type_1/qwen3_few_shot_regex_agg.parquet \
  --dataset_type test_type_1 \
  --tol 1

Arguments:

--eval_path: HDFS path to model predictions (Parquet format)
--dataset_type: Dataset to evaluate (train/validation/test_type_1/test_type_2)
--tol: Tolerance in years for matching (default: 1)

Metrics computed:

Classification metrics: F1, precision, recall, specificity, balanced accuracy
- TP: Prediction matches annotation (within ±tolerance years)
- TN: Both prediction and annotation are unknown (UNKNOWN_YEAR = 9999)
- FP: Prediction made but annotation is unknown, or outside tolerance
- FN: Prediction missed an existing date
- Stratified by date type (absolute, relative, age, no_date)
Mean Absolute Error (MAE): Average year difference (pairs with unknown dates [UNKNOWN_YEAR = 9999] excluded)
Hallucination rate: Proportion of no-date notes where model predicted a date
No prediction rate: Proportion of notes with real dates where model predicted unknown (UNKNOWN_YEAR = 9999)
Bootstrap confidence intervals: 10,000 iterations for uncertainty estimates (95% CI, seed=12345)

5. Metrics Table (All Models)

Create comprehensive metrics tables comparing all models on a dataset:

uv run scripts/create_metrics_table.py \
  --dataset_type test_type_1 \
  --tol 1 \
  --output_format table  # or markdown, csv

Displays two tables:

Overall metrics: precision, recall, F1, balanced accuracy, hallucination rate, no prediction rate, MAE, mean inference time
Stratified metrics: precision, recall, F1 for each date type (absolute, relative, age)

6. Visualization

uv run scripts/create_plots.py \
  --dataset_type test_type_1 \
  --bootstrap_iterations 10000 \
  --tolerance 1

Generates: bar plots (F1/precision/recall + CI), F1 vs inference time, Bland-Altman agreement.

Experiment configurations

Defined in src/link_dates/configs/experiments.py:

LLM experiments (12):

Model	Prompt	Reasoning	Window	Temperature	top_p
Qwen/Qwen3-8B	zero_shot	No	200	0.7	0.8
Qwen/Qwen3-8B	zero_shot	Yes	200	0.6	0.95
Qwen/Qwen3-8B	few_shot	No	200	0.7	0.8
Qwen/Qwen3-8B	few_shot	Yes	200	0.6	0.95
meta-llama/Llama-3.1-8B-Instruct	zero_shot	No	200	0.6	0.9
meta-llama/Llama-3.1-8B-Instruct	few_shot	No	200	0.6	0.9
mistralai/Ministral-3-8B-Reasoning-2512	zero_shot	Yes	200	0.7	-
mistralai/Ministral-3-8B-Reasoning-2512	few_shot	Yes	200	0.7	-
mistralai/Ministral-3-14B-Reasoning-2512	zero_shot	Yes	200	1	-
mistralai/Ministral-3-14B-Reasoning-2512	few_shot	Yes	200	1	-
google/medgemma-27b-text-it	zero_shot	No	200	0	-
google/medgemma-27b-text-it	few_shot	No	200	0	-

All experiments: seed=42, max_tokens=2048

ContextualMatcher experiments (12):

Context Type	word_min	word_max	sent_min	sent_max
Left	-5	0	0	0
Left	-10	0	0	0
Left	-15	0	0	0
Left	-15	0	-1	0
Right	0	5	0	0
Right	0	10	0	0
Right	0	15	0	0
Right	0	15	0	1
Bidirectional	-5	5	0	0
Bidirectional	-10	10	0	0
Bidirectional	-15	15	0	0
Bidirectional	-15	15	-1	1

Negative values = left context (before entity), positive values = right context (after entity)

Important Constants

The codebase uses centralized constants defined in src/link_dates/constants.py:

UNKNOWN_YEAR = 9999: Represents unknown/missing diagnosis dates in code
- Used throughout Python code for consistency
- Note: Prompt files (prompts/*.json) use the literal value "9999" in their French instructions to the LLM (e.g., "retourne 9999"). This is intentional as prompts are instructions to the model, not Python code.
DATASET_TYPES: Valid dataset splits ["train", "validation", "test_type_1", "test_type_2"]

Project structure

eds-diabetes-date/
├── pyproject.toml              # uv project configuration
├── uv.lock                     # exact packages versions
├── src/link_dates/
│   ├── inference.py            # vLLM pipeline, chat templates, regex extraction
│   ├── evaluation.py           # Metrics (F1, MAE, hallucination, no prediction), bootstrap CI
│   ├── visualization.py        # Plots with bootstrap CI
│   ├── age_normalization.py    # French age expressions detection & conversion
│   ├── contextual_matcher.py   # EDS-NLP pipeline creation
│   ├── converters.py           # Doc → DataFrame conversion
│   ├── constants.py            # Centralized constants (UNKNOWN_YEAR, DATASET_TYPES)
│   ├── hadoop_setup.py         # HDFS classpath configuration
│   └── configs/
│       ├── experiments.py      # All experiment configs (LLM + ContextualMatchers)
│       ├── paths.py            # Centralized paths management
│       └── plotting_config.py  # Visualization settings
├── scripts/
│   ├── create_datasets.py               # Dataset creation from HDFS
│   ├── run_llm_inference.py             # LLM inference script
│   ├── run_llm_inference_all.sh         # SLURM: run all LLM experiments
│   ├── run_llm_inference_single.sh      # SLURM: run single LLM experiment
│   ├── run_contextual_matcher.py        # ContextualMatcher inference script
│   ├── run_contextual_matcher_all.sh    # Bash: run all ContextualMatcher configs
│   ├── evaluate_model.py                # Single model evaluation
│   ├── create_metrics_table.py          # Comprehensive metrics table for all models
│   └── create_plots.py                  # Visualization
├── prompts/
│   ├── zero_shot.json          # Zero-shot prompt template
│   └── few_shot.json           # Few-shot with examples
└── tests/                      # Unit tests (no GPU/HDFS required)

Data Availability

Due to patient privacy regulations, the clinical notes dataset cannot be publicly shared. Synthetic test data is provided for code validation. See DATA_AVAILABILITY.md for details.

Reproducibility

Environment: Python 3.12+, uv, dependencies in pyproject.toml (exact versions in uv.lock)
Models: HuggingFace (Qwen/Qwen3-8B, meta-llama/Llama-3.1-8B-Instruct, mistralai/Ministral-3-8B-Reasoning-2512, mistralai/Ministral-3-14B-Reasoning-2512, google/medgemma-27b-text-it)
Hardware: 2× NVIDIA A100 for paper results, 30GB+ RAM, 8+ cores
Seeds: Model sampling=42, Bootstrap CI=12345

uv sync               # Inside AP-HP (core dependencies)
uv sync --group gpu   # Inside AP-HP (with vLLM for GPU inference)

uv sync --no-sources            # Outside AP-HP (core dependencies)
uv sync --no-sources --group gpu # Outside AP-HP (with vLLM for GPU inference)

export HF_TOKEN=<your_token>  # Or add your HF_TOKEN to a .env file at the root of the repository
uv run --frozen pytest tests/ -v  # Verify installation (works without GPUs)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diabetes diagnosis date extraction from French clinical notes

Key features

Installation

Pipeline execution

1. Dataset creation

2. LLM inference

3. ContextualMatcher (rule-based)

4. Evaluation

5. Metrics Table (All Models)

6. Visualization

Experiment configurations

Important Constants

Project structure

Data Availability

Reproducibility

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
prompts		prompts
scripts		scripts
src/link_dates		src/link_dates
tests		tests
.gitignore		.gitignore
ANNOTATION_GUIDE.md		ANNOTATION_GUIDE.md
DATA_AVAILABILITY.md		DATA_AVAILABILITY.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Diabetes diagnosis date extraction from French clinical notes

Key features

Installation

Pipeline execution

1. Dataset creation

2. LLM inference

3. ContextualMatcher (rule-based)

4. Evaluation

5. Metrics Table (All Models)

6. Visualization

Experiment configurations

Important Constants

Project structure

Data Availability

Reproducibility

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages