Skip to content

zch0414/hlip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HLIP

Official PyTorch implementation of the following paper:
Towards Scalable Language-Image Pre-training for 3D Medical Imaging
University of Michigan
TMLR  arXiv  huggingface weights 

Overview

HLIP overview

Directly leveraging uncurated clinical studies enables scalable language-image pre-training in 3D medical imaging, as the scale is no longer constrained by the manual effort required from clinicians to select a single representative scan or slice from each study. This paradigm could be more effective when equipped with a hierarchical attention mechanism inspired by the natural structure of the data: slice, scan, and study. We name this framework Hierarchical attention for Language-Image Pre-training (HLIP). For real-world clinical use, HLIP can be applied to studies containing either a single scan (e.g., chest CT) or multiple scans (e.g., brain MRI).

Updates

  • (2026-03) Check out our new paper, accepted at CVPR 2026, which introduces a new strategy, beyond the dual-loss approach presented in the HLIP blog, for handling itemized text supervision in language-image pre-training. The code and model weights are available here.
  • (2026-02) Assets in 2025-11 (departured) have been finalized and updated. We apologize for any inconvenience to researchers actively using this repository. This should be our last incremental update to HLIP. We have released four HLIP variants in the Hugging Face collection: huggingface weights. The model released in 2025-11 is also included in this collection and is listed as hlip-2025_10_08. Technical details are provided in this blog, and the implementation is based on this code branch.
  • (2026-02) HLIP is accepted by TMLR!
  • (2025-11) We release our updated model huggingface weights, along with a new code branch focused on uncurated 3D medical datasets. The technical details are described in this blog.
  • (2025-06) Complete the initiation of HLIP repository.
  • (2025-05) Release HLIP models trained on chest CT and brain MRI, feel free to try our demos.

Getting Started

Install

open-clip

python3 -m venv env
source env/bin/activate
pip install -U pip
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
git clone git@github.com:mlfoundations/open_clip.git
cd open_clip
make install
make install-training

Models

Data Objective Patch Size Attention Model
CT-RATE (20K) CLIP 8, 24, 24 slice + scan ViT Base
BrainMRI (220K) CLIP 8, 16, 16 scan + study ViT Base
HeadCT (240K) CLIP 8, 16, 16 scan + study ViT Base

Future models will be released on HuggingFace and announced in the Updates section.

Demo

Chest CT: an example from the external Rad-ChestCT dataset.

python inference_radchestct.py \
  --model clip_vit_base_slice_scan_token2744 \
  --use-cxr-bert \
  --resume /path/to/clip_vit_base_slice_scan_token2744.pt \
  --data ../../docs/tst32751/tst32751.pt

Brain MRI: an example from the external BraTS23 dataset.

python inference_pubbrain5.py \
  --model clip_vit_base_scan_study_token1176 \
  --resume /path/to/clip_vit_base_scan_study_token1176.pt \
  --patch-size 8 16 16 \
  --num-slices 48 \
  --data ../../docs/BraTS-GLI-00459-000

Visualizing the activation with --interpret. Increasing --num-slcies for better visualization quality.

Evaluation

CT-RATE

torchrun --rdzv_endpoint=localhost:29500 --nproc_per_node 4 zeroshot_ctrate.py \
  --model clip_vit_base_slice_scan_token2744 \
  --resume /path/to/clip_vit_base_slice_scan_token2744.pt \
  --data-root /data/ct_rate/valid/ \
  --input-file ../../data/ct_rate/metafiles/valid_labels.csv \
  --zeroshot-template volume

Rad-ChestCT

torchrun --rdzv_endpoint=localhost:29500 --nproc_per_node 4 zeroshot_radchestct.py \
  --model clip_vit_base_slice_scan_token2744 \
  --resume /path/to/clip_vit_base_slice_scan_token2744.pt \
  --data-root /data/rad_chestct/ \
  --input-file ../../data/rad_chestct/files/rad_chestct_labels.csv \
  --zeroshot-template volume

Pub-Brain-5

torchrun --rdzv_endpoint=localhost:29500 --nproc_per_node 8 zeroshot_pubbrain5.py \
  --model clip_vit_base_scan_study_token1176 \
  --resume /path/to/vit_base_scan_study_token1176.pt \
  --data-root /data/pub_brain_5/ \
  --input-file ../../data/pub_brain_5/pub_brain_5.csv

RSNA

torchrun --rdzv_endpoint=localhost:29500 --nproc_per_node 8 zeroshot_rsna.py \
  --model clip_vit_base_scan_study_token1176 \
  --resume /path/to/vit_base_scan_study_token1176.pt \
  --data-root /data/rsna/ \
  --input-file ../../data/rsna/rsna.csv

Training

Below, we provide a training script to reproduce our results on CT-RATE using the original reports as supervision. Training for 20 epochs takes approximately 6 hours on a single node with 4 A40 GPUs.

torchrun --rdzv_endpoint=localhost:29500 --nproc_per_node 4 main.py \
  --benchmark-type ct-rate \
  --logs-dir /path/to/logs/ \
  --zeroshot-frequency 1 \
  --zeroshot-template volume \
  --save-frequency 1 \
  --train-data /path/to/ct_rate/train/ \
  --train-file ../../data/ct_rate/files/raw_annotation.json \
  --image-process-cfg -1150 350 crop \
  --text-process-cfg report \
  --ct-rate data_root='"/path/to/ct_rate/valid/"' input_file='"../../data/ct_rate/metafiles/valid_labels.csv"' \
  --rad-chestct data_root='"/path/to/rad_chestct/"' input_file='"../../data/rad_chestct/files/rad_chestct_labels.csv"' \
  --report-to wandb \
  --wandb-project-name hlip \
  --warmup 47 \
  --batch-size 32 \
  --accum-batch 4 \
  --lr=8e-5 \
  --wd=0.2 \
  --force-patch-dropout 0.0 \
  --epochs=20 \
  --precision amp \
  --workers 4 \
  --local-loss \
  --gather-with-grad \
  --grad-checkpointing \
  --model clip_vit_base_slice_scan_token2744 \
  --use-cxr-bert \
  --lock-text \
  --dist-url "env://localhost:29500"

Our training implementation is closely aligned with open-clip, allowing us to leverage features such as patch dropout and siglip.

For patch dropout, try the following commands:

  --force-patch-dropout 0.5 \
  --beta2 0.95

For siglip, you can try it using the following commands, but make sure to modify the model configuration beforehand:

  --beta2 0.95 \
  --siglip

Citation

If you find this repository helpful, please consider citing:

@article{zhao2026towards,
  title={Towards Scalable Language-Image Pre-training for 3D Medical Imaging},
  author={Chenhui Zhao and Yiwei Lyu and Asadur Zaman Chowdury and Edward S Harake and Akhil Kondepudi and Akshay T Rao and Xinhai Hou and Honglak Lee and Todd C Hollon},
  journal={Transactions on Machine Learning Research},
  issn={2835-8856},
  year={2026},
  url={https://openreview.net/forum?id=WxHf4EcBWA}
}

About

Towards Scalable Language-Image Pre-training for 3D Medical Imaging [TMLR 2026]

Resources

License

Stars

Watchers

Forks

Packages