Skip to content

Psypeal/hlip

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HLIP

Official PyTorch implementation of the following paper:
Towards Scalable Language-Image Pre-training for 3D Medical Imaging
University of Michigan
arXiv 

Overview

We propose Hierarchical attention for Language-Image Pre-training (HLIP), inspired by the natural hierarchy of radiology data: slice, scan, and study. With this lightweight attention mechanism, HLIP can be trained directly on uncurated clinical datasets, enabling scalable language-image pre-training in 3D medical imaging. For real-world clinical use, HLIP can be applied to studies containing either a single scan (e.g., chest CT) or multiple scans (e.g., brain MRI).

Updates

  • (2025-06) Complete the initiation of HLIP repository.
  • (2025-05) Release HLIP models trained on chest CT and brain MRI, feel free to try our demos.

Getting Started

Install

open-clip

python3 -m venv env
source env/bin/activate
pip install -U pip
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
git clone git@github.com:mlfoundations/open_clip.git
cd open_clip
make install
make install-training

Model Card

Data Attention Patch Size Model
CT-RATE-20K slice + scan 8, 24, 24 ViT-Base
BrainMRI220K scan + study 16, 16, 16 ViT-Base
BrainMRI220K scan + study 8, 16, 16 ViT-Base
BrainMRI220K slice + scan + study 8, 16, 16 ViT-Base

Demo

Chest CT: an example from the external Rad-ChestCT dataset.

python inference_rad_chestct.py \
  --model vit_base_singlescan_h2_token1176 \
  --resume /path/to/vit_base_chestct_h2_token1176.pt \
  --data /docs/tst32751/tst32751.pt \

Brain MRI: an example from the external BraTS23 dataset.

python inference_pub_brain_5.py \
  --model vit_base_multiscan_h2_token1176 \
  --resume /path/to/vit_base_brainmri_h2_token1176.pt \
  --patch-size 8 16 16 \
  --num-slices 72 \
  --data /docs/BraTS-GLI-00459-000/ \

Visualizing the activation with --interpret.

Evaluation

CT-RATE

python zeroshot_ct_rate.py \
  --model vit_base_singlescan_h2_token2744 \
  --resume /path/to/vit_base_chestct_h2_token2744.pt \
  --ct-rate-root /data/ct_rate/valid/ \
  --zeroshot-template volume \

Rad-ChestCT

python zeroshot_rad_chestct.py \
  --model vit_base_singlescan_h2_token2744 \
  --resume /path/to/vit_base_chestct_h2_token2744.pt \
  --rad-chestct-root /data/rad_chestct/ \
  --zeroshot-template volume \

Brain MRI

python pub_brain_5_embed.py \
  --model vit_base_multiscan_h2_token1176 \
  --resume /path/to/vit_base_brainmri_h2_token1176.pt \
  --num-slices 144 \
python zeroshot_pub_brain_5.py \
  --model vit_base_multiscan_h2_token1176 \
  --resume /path/to/vit_base_brainmri_h2_token1176.pt \
  --num-slices 144 \
  --zeroshot_prompt prompt \
  --zeroshot_template template \

As there are ~18K studies in the Pub-Brain-5 dataset, evaluation may take ~30 minutes. We first extract the embedding for each study, followed by zero-shot classification. This procedure facilitates the evaluation of prompt engineering. Although we use a fixed input size of 48, 224, 224, --num-slices is set to 144 during evaluation, as we found that HLIP can directly transfer and benefit from higher-resolution inputs at test time.

Training

Our training implementation is closely aligned with open-clip, allowing us to leverage features such as patch dropout and siglip. Below, we provide a training code demo for chest CT. Training on CT-RATE for 20 epochs takes ~6 hours using a node with 4 A40 GPUs.

torchrun --rdzv_endpoint=localhost:29500 --nproc_per_node 4 main.py \
  --json-root ../../data/ct_rate/files/ --data-root /path/to/data/ct_rate/ \
  --train-data raw_annotation --input-info -1150 350 crop \
  --zeroshot-ct-rate ../../data/ct_rate/metafiles/valid_labels.csv --zeroshot-template volume \
  --zeroshot-frequency 1 \
  --save-frequency 1 \
  --report-to wandb \
  --wandb-project-name chest_ct \
  --warmup 377 \
  --batch-size 16 \
  --accum-batch 1 \
  --lr=1e-5 \
  --wd=0.2 \
  --epochs=20 \
  --precision amp \
  --workers 4 \
  --grad-checkpointing \
  --model vit_base_singlescan_h2_token2744 \
  --use-cxr-bert \
  --lock-text

Use the following commands for patch dropout:

  --force-patch-dropout 0.5 \
  --beta2 0.95

Use the following commands for siglip:

  --siglip

Citation

If you find this repository helpful, please consider citing:

@article{zhao2025towards,
  title={Towards Scalable Language-Image Pre-training for 3D Medical Imaging},
  author={Zhao, Chenhui and Lyu, Yiwei and Chowdury, Asadur and Harake, Edward and Kondepudi, Akhil and Rao, Akshay and Hou, Xinhai and Lee, Honglak and Hollon, Todd},
  journal={arXiv preprint arXiv:2505.21862},
  year={2025}
}

About

Towards Scalable Language-Image Pre-training for 3D Medical Imaging

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 91.5%
  • Jupyter Notebook 8.5%