Official PyTorch implementation of the following paper:
Towards Scalable Language-Image Pre-training for 3D Medical Imaging
University of Michigan
![]()
![]()
![]()
Directly leveraging uncurated clinical studies enables scalable language-image pre-training in 3D medical imaging, as the scale is no longer constrained by the manual effort required from clinicians to select a single representative scan or slice from each study. This paradigm could be more effective when equipped with a hierarchical attention mechanism inspired by the natural structure of the data: slice, scan, and study. We name this framework Hierarchical attention for Language-Image Pre-training (HLIP). For real-world clinical use, HLIP can be applied to studies containing either a single scan (e.g., chest CT) or multiple scans (e.g., brain MRI).
- (2026-03) Check out our new paper, accepted at CVPR 2026, which introduces a new strategy, beyond the dual-loss approach presented in the HLIP blog, for handling itemized text supervision in language-image pre-training. The code and model weights are available here.
- (2026-02) Assets in 2025-11 (departured) have been finalized and updated. We apologize for any inconvenience to researchers actively using this repository. This should be our last incremental update to HLIP. We have released four HLIP variants in the Hugging Face collection:
. The model released in 2025-11 is also included in this collection and is listed as hlip-2025_10_08. Technical details are provided in this blog, and the implementation is based on this code branch.
- (2026-02) HLIP is accepted by TMLR!
(2025-11) We release our updated model, along with a new code branch focused on uncurated 3D medical datasets. The technical details are described in this blog.
- (2025-06) Complete the initiation of HLIP repository.
- (2025-05) Release HLIP models trained on chest CT and brain MRI, feel free to try our demos.
python3 -m venv env
source env/bin/activate
pip install -U pip
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
git clone git@github.com:mlfoundations/open_clip.git
cd open_clip
make install
make install-training| Data | Objective | Patch Size | Attention | Model |
|---|---|---|---|---|
| CT-RATE (20K) | CLIP | 8, 24, 24 |
slice + scan |
ViT Base |
| BrainMRI (220K) | CLIP | 8, 16, 16 |
scan + study |
ViT Base |
| HeadCT (240K) | CLIP | 8, 16, 16 |
scan + study |
ViT Base |
Future models will be released on HuggingFace and announced in the Updates section.
Chest CT: an example from the external Rad-ChestCT dataset.
python inference_radchestct.py \
--model clip_vit_base_slice_scan_token2744 \
--use-cxr-bert \
--resume /path/to/clip_vit_base_slice_scan_token2744.pt \
--data ../../docs/tst32751/tst32751.ptBrain MRI: an example from the external BraTS23 dataset.
python inference_pubbrain5.py \
--model clip_vit_base_scan_study_token1176 \
--resume /path/to/clip_vit_base_scan_study_token1176.pt \
--patch-size 8 16 16 \
--num-slices 48 \
--data ../../docs/BraTS-GLI-00459-000Visualizing the activation with --interpret. Increasing --num-slcies for better visualization quality.
CT-RATE
torchrun --rdzv_endpoint=localhost:29500 --nproc_per_node 4 zeroshot_ctrate.py \
--model clip_vit_base_slice_scan_token2744 \
--resume /path/to/clip_vit_base_slice_scan_token2744.pt \
--data-root /data/ct_rate/valid/ \
--input-file ../../data/ct_rate/metafiles/valid_labels.csv \
--zeroshot-template volumeRad-ChestCT
torchrun --rdzv_endpoint=localhost:29500 --nproc_per_node 4 zeroshot_radchestct.py \
--model clip_vit_base_slice_scan_token2744 \
--resume /path/to/clip_vit_base_slice_scan_token2744.pt \
--data-root /data/rad_chestct/ \
--input-file ../../data/rad_chestct/files/rad_chestct_labels.csv \
--zeroshot-template volumePub-Brain-5
torchrun --rdzv_endpoint=localhost:29500 --nproc_per_node 8 zeroshot_pubbrain5.py \
--model clip_vit_base_scan_study_token1176 \
--resume /path/to/vit_base_scan_study_token1176.pt \
--data-root /data/pub_brain_5/ \
--input-file ../../data/pub_brain_5/pub_brain_5.csvRSNA
torchrun --rdzv_endpoint=localhost:29500 --nproc_per_node 8 zeroshot_rsna.py \
--model clip_vit_base_scan_study_token1176 \
--resume /path/to/vit_base_scan_study_token1176.pt \
--data-root /data/rsna/ \
--input-file ../../data/rsna/rsna.csvBelow, we provide a training script to reproduce our results on CT-RATE using the original reports as supervision. Training for 20 epochs takes approximately 6 hours on a single node with 4 A40 GPUs.
torchrun --rdzv_endpoint=localhost:29500 --nproc_per_node 4 main.py \
--benchmark-type ct-rate \
--logs-dir /path/to/logs/ \
--zeroshot-frequency 1 \
--zeroshot-template volume \
--save-frequency 1 \
--train-data /path/to/ct_rate/train/ \
--train-file ../../data/ct_rate/files/raw_annotation.json \
--image-process-cfg -1150 350 crop \
--text-process-cfg report \
--ct-rate data_root='"/path/to/ct_rate/valid/"' input_file='"../../data/ct_rate/metafiles/valid_labels.csv"' \
--rad-chestct data_root='"/path/to/rad_chestct/"' input_file='"../../data/rad_chestct/files/rad_chestct_labels.csv"' \
--report-to wandb \
--wandb-project-name hlip \
--warmup 47 \
--batch-size 32 \
--accum-batch 4 \
--lr=8e-5 \
--wd=0.2 \
--force-patch-dropout 0.0 \
--epochs=20 \
--precision amp \
--workers 4 \
--local-loss \
--gather-with-grad \
--grad-checkpointing \
--model clip_vit_base_slice_scan_token2744 \
--use-cxr-bert \
--lock-text \
--dist-url "env://localhost:29500"Our training implementation is closely aligned with open-clip, allowing us to leverage features such as patch dropout and siglip.
For patch dropout, try the following commands:
--force-patch-dropout 0.5 \
--beta2 0.95For siglip, you can try it using the following commands, but make sure to modify the model configuration beforehand:
--beta2 0.95 \
--siglipIf you find this repository helpful, please consider citing:
@article{zhao2026towards,
title={Towards Scalable Language-Image Pre-training for 3D Medical Imaging},
author={Chenhui Zhao and Yiwei Lyu and Asadur Zaman Chowdury and Edward S Harake and Akhil Kondepudi and Akshay T Rao and Xinhai Hou and Honglak Lee and Todd C Hollon},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2026},
url={https://openreview.net/forum?id=WxHf4EcBWA}
}