Benchmarking LLMs and rule-based methods for diabetes diagnosis date extraction from French clinical notes.
- LLM inference: vLLM with continuous batching, tensor parallelism, thinking mode support
- French NLP: Age normalization ("diabétique depuis l'âge de 45 ans" → birth_year + 45)
- Evaluation: Bootstrap CIs, stratified metrics by date type, MAE, hallucination rate
- Rule-based baseline: EDS-NLP ContextualMatcher with diverse context window configurations
- Visualization: Bar plots with CIs, F1/time trade-off, Bland-Altman agreement
git clone https://github.com/aphp-datascience/eds-diabetes-date
cd eds-diabetes-dateThis project uses uv for dependency management:
# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | shFor AP-HP users (default):
The project uses pyspark-revival, a patched version of PySpark 2.4.3 that works with recent Python versions (3.12+). This is required on the AP-HP infrastructure.
On the AP-HP cluster, uv must be installed inside a conda environment:
# Create and activate conda environment
conda create -n link_dates python=3.12
conda activate link_dates
# Install uv inside the conda environment
pip install uv
# Install project dependencies
uv sync
# Install with GPU support (for LLM inference with vLLM)
uv sync --group gpuFor users outside AP-HP:
If you don't have access to the AP-HP GitLab repository, remove the uv.lock file and install without custom sources:
# Install core dependencies
uv sync --no-sources
# Install with GPU support (for LLM inference with vLLM)
uv sync --no-sources --group gpuThis will install a recent version of PySpark (3.x or 4.x) from PyPI instead of the AP-HP-specific version.
Note on GPU dependencies:
- vLLM is in the optional
gpudependency group since it requires CUDA and doesn't work on macOS - Core functionality (evaluation, visualization, tests) works without vLLM
- LLM inference requires
--group gpuinstallation on GPU-enabled systems
uv run scripts/create_datasets.pyCreates train/validation/test splits from raw HDFS notes. Notes are truncated to context windows around diabetes mentions. Raw notes not shared (see DATA_AVAILABILITY.md).
Single experiment:
# Run locally
uv run scripts/run_llm_inference.py --experiment qwen3_8B_few_shot --dataset_type train
# List all experiments
uv run scripts/run_llm_inference.py --list_experiments
# Run on SLURM (single experiment)
sbatch scripts/run_llm_inference_single.sh --experiment qwen3_8B_few_shot --dataset_type trainAll LLM experiments (batch):
# Run all LLM experiments in parallel on SLURM
sbatch scripts/run_llm_inference_all.sh train
# Experiments list can be found in src/link_dates/configs/experiments.pyCustom configuration:
uv run scripts/run_llm_inference.py \
--model_id Qwen/Qwen2.5-8B-Instruct \
--file_prompt custom_prompt.json \
--window 300 \
--batch_size 128 \
--dataset_type validationSingle configuration:
# Run locally
uv run scripts/run_contextual_matcher.py \
--experiment contextualmatcher_wmin-15_wmax15_smin-1_smax1 \
--dataset_type train
# List all configurations
uv run scripts/run_contextual_matcher.py --list_experiments
# Manual parameters
uv run scripts/run_contextual_matcher.py \
--word_min -30 --word_max 30 --sent_min -1 --sent_max 1 \
--dataset_type trainAll ContextualMatcher configurations (batch):
# Run all configurations
bash scripts/run_contextual_matcher_all.sh train
# Experiments list can be found in src/link_dates/configs/experiments.pyuv run scripts/evaluate_model.py \
--eval_path hdfs://results/test_type_1/qwen3_few_shot_regex_agg.parquet \
--dataset_type test_type_1 \
--tol 1Arguments:
--eval_path: HDFS path to model predictions (Parquet format)--dataset_type: Dataset to evaluate (train/validation/test_type_1/test_type_2)--tol: Tolerance in years for matching (default: 1)
Metrics computed:
- Classification metrics: F1, precision, recall, specificity, balanced accuracy
- TP: Prediction matches annotation (within ±tolerance years)
- TN: Both prediction and annotation are unknown (UNKNOWN_YEAR = 9999)
- FP: Prediction made but annotation is unknown, or outside tolerance
- FN: Prediction missed an existing date
- Stratified by date type (absolute, relative, age, no_date)
- Mean Absolute Error (MAE): Average year difference (pairs with unknown dates [UNKNOWN_YEAR = 9999] excluded)
- Hallucination rate: Proportion of no-date notes where model predicted a date
- No prediction rate: Proportion of notes with real dates where model predicted unknown (UNKNOWN_YEAR = 9999)
- Bootstrap confidence intervals: 10,000 iterations for uncertainty estimates (95% CI, seed=12345)
Create comprehensive metrics tables comparing all models on a dataset:
uv run scripts/create_metrics_table.py \
--dataset_type test_type_1 \
--tol 1 \
--output_format table # or markdown, csvDisplays two tables:
- Overall metrics: precision, recall, F1, balanced accuracy, hallucination rate, no prediction rate, MAE, mean inference time
- Stratified metrics: precision, recall, F1 for each date type (absolute, relative, age)
uv run scripts/create_plots.py \
--dataset_type test_type_1 \
--bootstrap_iterations 10000 \
--tolerance 1Generates: bar plots (F1/precision/recall + CI), F1 vs inference time, Bland-Altman agreement.
Defined in src/link_dates/configs/experiments.py:
LLM experiments (12):
| Model | Prompt | Reasoning | Window | Temperature | top_p |
|---|---|---|---|---|---|
| Qwen/Qwen3-8B | zero_shot | No | 200 | 0.7 | 0.8 |
| Qwen/Qwen3-8B | zero_shot | Yes | 200 | 0.6 | 0.95 |
| Qwen/Qwen3-8B | few_shot | No | 200 | 0.7 | 0.8 |
| Qwen/Qwen3-8B | few_shot | Yes | 200 | 0.6 | 0.95 |
| meta-llama/Llama-3.1-8B-Instruct | zero_shot | No | 200 | 0.6 | 0.9 |
| meta-llama/Llama-3.1-8B-Instruct | few_shot | No | 200 | 0.6 | 0.9 |
| mistralai/Ministral-3-8B-Reasoning-2512 | zero_shot | Yes | 200 | 0.7 | - |
| mistralai/Ministral-3-8B-Reasoning-2512 | few_shot | Yes | 200 | 0.7 | - |
| mistralai/Ministral-3-14B-Reasoning-2512 | zero_shot | Yes | 200 | 1 | - |
| mistralai/Ministral-3-14B-Reasoning-2512 | few_shot | Yes | 200 | 1 | - |
| google/medgemma-27b-text-it | zero_shot | No | 200 | 0 | - |
| google/medgemma-27b-text-it | few_shot | No | 200 | 0 | - |
All experiments: seed=42, max_tokens=2048
ContextualMatcher experiments (12):
| Context Type | word_min | word_max | sent_min | sent_max |
|---|---|---|---|---|
| Left | -5 | 0 | 0 | 0 |
| Left | -10 | 0 | 0 | 0 |
| Left | -15 | 0 | 0 | 0 |
| Left | -15 | 0 | -1 | 0 |
| Right | 0 | 5 | 0 | 0 |
| Right | 0 | 10 | 0 | 0 |
| Right | 0 | 15 | 0 | 0 |
| Right | 0 | 15 | 0 | 1 |
| Bidirectional | -5 | 5 | 0 | 0 |
| Bidirectional | -10 | 10 | 0 | 0 |
| Bidirectional | -15 | 15 | 0 | 0 |
| Bidirectional | -15 | 15 | -1 | 1 |
Negative values = left context (before entity), positive values = right context (after entity)
The codebase uses centralized constants defined in src/link_dates/constants.py:
-
UNKNOWN_YEAR = 9999: Represents unknown/missing diagnosis dates in code
- Used throughout Python code for consistency
- Note: Prompt files (
prompts/*.json) use the literal value "9999" in their French instructions to the LLM (e.g., "retourne 9999"). This is intentional as prompts are instructions to the model, not Python code.
-
DATASET_TYPES: Valid dataset splits
["train", "validation", "test_type_1", "test_type_2"]
eds-diabetes-date/
├── pyproject.toml # uv project configuration
├── uv.lock # exact packages versions
├── src/link_dates/
│ ├── inference.py # vLLM pipeline, chat templates, regex extraction
│ ├── evaluation.py # Metrics (F1, MAE, hallucination, no prediction), bootstrap CI
│ ├── visualization.py # Plots with bootstrap CI
│ ├── age_normalization.py # French age expressions detection & conversion
│ ├── contextual_matcher.py # EDS-NLP pipeline creation
│ ├── converters.py # Doc → DataFrame conversion
│ ├── constants.py # Centralized constants (UNKNOWN_YEAR, DATASET_TYPES)
│ ├── hadoop_setup.py # HDFS classpath configuration
│ └── configs/
│ ├── experiments.py # All experiment configs (LLM + ContextualMatchers)
│ ├── paths.py # Centralized paths management
│ └── plotting_config.py # Visualization settings
├── scripts/
│ ├── create_datasets.py # Dataset creation from HDFS
│ ├── run_llm_inference.py # LLM inference script
│ ├── run_llm_inference_all.sh # SLURM: run all LLM experiments
│ ├── run_llm_inference_single.sh # SLURM: run single LLM experiment
│ ├── run_contextual_matcher.py # ContextualMatcher inference script
│ ├── run_contextual_matcher_all.sh # Bash: run all ContextualMatcher configs
│ ├── evaluate_model.py # Single model evaluation
│ ├── create_metrics_table.py # Comprehensive metrics table for all models
│ └── create_plots.py # Visualization
├── prompts/
│ ├── zero_shot.json # Zero-shot prompt template
│ └── few_shot.json # Few-shot with examples
└── tests/ # Unit tests (no GPU/HDFS required)
Due to patient privacy regulations, the clinical notes dataset cannot be publicly shared. Synthetic test data is provided for code validation. See DATA_AVAILABILITY.md for details.
- Environment: Python 3.12+, uv, dependencies in
pyproject.toml(exact versions inuv.lock) - Models: HuggingFace (Qwen/Qwen3-8B, meta-llama/Llama-3.1-8B-Instruct, mistralai/Ministral-3-8B-Reasoning-2512, mistralai/Ministral-3-14B-Reasoning-2512, google/medgemma-27b-text-it)
- Hardware: 2× NVIDIA A100 for paper results, 30GB+ RAM, 8+ cores
- Seeds: Model sampling=42, Bootstrap CI=12345
uv sync # Inside AP-HP (core dependencies)
uv sync --group gpu # Inside AP-HP (with vLLM for GPU inference)
uv sync --no-sources # Outside AP-HP (core dependencies)
uv sync --no-sources --group gpu # Outside AP-HP (with vLLM for GPU inference)
export HF_TOKEN=<your_token> # Or add your HF_TOKEN to a .env file at the root of the repository
uv run --frozen pytest tests/ -v # Verify installation (works without GPUs)