GlimpsePrune

A Dynamic Visual Token Pruning Framework for Large Vision-Language Models

GlimpsePrune dynamically prunes a large number of irrelevant visual tokens before answering questions, reducing the model's inference overhead.

GlimpsePrune is a dynamic visual token pruning framework designed for Large Vision-Language Models (LVLMs). Through fast training on a small amount of data (e.g., less than 1 hour on 20K GQA data), GlimpsePrune enables Qwen2.5-VL-7B to prune an average of 92.6% of visual tokens before generating a response, while maintaining performance comparable to the original model.

For more technical details, please refer to our paper.

If you find our work inspiring or helpful, please give us a star ⭐. Thank you for your attention and support:

✨ Key Features

High Pruning Rate: Prunes over 90% of visual tokens on average with almost no performance loss, effectively reducing computational and memory overhead.
Robust Performance: Stable performance when processing high-resolution images and handling complex free-form VQA tasks.
Lightweight Training: Only a few extra parameters (Glimpse token and VIP) need to be trained, completed in less than 1 hour on a single A100 GPU.
Broad Compatibility: Supports single and multi-image inputs, is compatible with KV-Cache and Flash Attention 2, and provides a fair comparison benchmark with other mainstream visual compression methods.

🚀 News

2025.08.05: Paper are publicly released!
2025.08.03: Code and Models are publicly released!

🖼️ Framework Overview

The core idea of GlimpsePrune is to introduce a glimpse token and a lightweight Visual tokens Important Predictor (VIP) that can quickly identify and retain the visual regions most relevant to the text prompt, pruning the remaining redundant information.

The core code implementation is located in:

Qwen2.5-VL: transformers_gp/models/qwen2_5_vl/model_gp.py
LLaVA-1.5: llava_gp/model/language_model/llava_llama.py

📊 Performance Results

We evaluated GlimpsePrune on multiple VQA benchmarks. The results show that it achieves a high pruning rate while maintaining performance on par with the original model, outperforming other visual compression methods.

Free-form VQA Benchmarks

Short-form VQA Benchmarks

✅ Roadmap

🛠️ Installation

Clone the repository

git clone https://github.com/HVision-NKU/GlimpsePrune.git
cd GlimpsePrune

Create an environment and install dependencies We recommend create seperated virtual environment for different models:

For Qwen2.5-VL:
- python=3.10
- torch==2.7.0
- flash-attn==2.7.4.post1
- pip install -r qwen_requirements.txt
- pip install qwen-vl-utils[decord]
For LLaVA-1.5 (Optional):
Click to expand LLaVA dependency installation
- python=3.10
- torch==2.1.2
- flash-attn=2.7.3
- pip install -r llava_requirements.txt
Additional dependencies for Evaluation and Demo (Optional):
```
# Evaluation
pip install lmms-eval==0.3.5 vllm==0.9.0.1
# Demo
pip install gradio==5.39.0
```

📦 Models and Data

Model Download

All models can be automatically downloaded from the Hugging Face Hub. If you encounter network issues, you can download them manually to a local directory. <new_module> are the weights of the extra glimpse token and VIP modules we trained.

`<base_model>`	`<new_module>`
Qwen/Qwen2.5-VL-3B-Instruct	ashun989/GlimpsePrune_Qwen2.5-VL-3B-Instruct
Qwen/Qwen2.5-VL-7B-Instruct	ashun989/GlimpsePrune_Qwen2.5-VL-7B-Instruct
liuhaotian/llava-v1.5-7b	ashun989/GlimpsePrune_LLaVA-1.5-7B
liuhaotian/llava-v1.5-13b	ashun989/GlimpsePrune_LLaVA-1.5-13B

Data Preparation

Training and Free-form VQA evaluation use the Visual-CoT dataset.

# Download the dataset (approx. 128GB)
huggingface-cli download --repo-type dataset --local-dir datas deepcs233/Visual-CoT cot_images_tar_split

# Extract
cd datas/cot_images_tar_split
cat cot_images_* | tar -xvf - -C ../cot
cd ../.. # Return to the project root directory

After extraction, the datas directory structure should be as follows:

GlimpsePrune/
├── datas/
│   └── cot/
│       ├── cub/
│       ├── gqa/
│       └── ...
└── ...

▶️ How to Use

Local Demo

We provide a Gradio Demo to intuitively experience the effects of GlimpsePrune.

python demo_gp.py \
    --base_model Qwen/Qwen2.5-VL-7B-Instruct \
    --new_modules_dir ashun989/GlimpsePrune_Qwen2.5-VL-7B-Instruct

Inference

For a detailed example of how to load the model and perform inference, please refer to the Jupyter Notebook: ➡️ notebook/gp_qwen_tutorial.ipynb

Evaluation

We provide convenient evaluation scripts.

Free-form VQA

# Default settings (no retention rate limit)
BASE_MODEL=<base_model> bash scripts/infer_qwen_gp_cot.sh <new_modules_dir>

# Set a maximum retention rate (e.g., 11.1%)
BASE_MODEL=<base_model> MAX_REMAIN_RATIO=0.111 bash scripts/infer_qwen_gp_cot.sh <new_modules_dir>

Short-form VQA

# Default settings
BASE_MODEL=<base_model> bash scripts/eval_qwen_gp.sh <new_modules_dir>

# Set a maximum retention rate
BASE_MODEL=<base_model> MAX_REMAIN_RATIO=0.111 bash scripts/eval_qwen_gp.sh <new_modules_dir>

Training

Train GlimpsePrune

Training on Qwen2.5-VL-3B-Instruct requires at least two 24GB GPUs (e.g., RTX 3090) and takes about 1 hour.

# Train Qwen2.5-VL
CUDA_VISIBLE_DEVICES=0,1 \
bash scripts/train_qwen_gp.sh

# Train LLaVA-1.5
CUDA_VISIBLE_DEVICES=0,1 \
bash scripts/train_llava_gp.sh

Train GlimpsePrune+ (Optional)

Training on Qwen2.5-VL-7B-Instruct requires four 80GB A100 GPUs, plus an additional 48GB of VRAM to run the reward model, and takes about 24 hours.

# 1. Deploy the reward model
bash scripts/vllm_serve.sh
# 2. Test the API
python test_api.py
# 3. Start training
CUDA_VISIBLE_DEVICES=0,1,2,3 \
bash scripts/train_qwen_gp_plus.sh

🙏 Acknowledgements

This project is based on the following excellent open-source work, and we express our sincere gratitude:

Qwen2.5-VL / LLaVA: Powerful Large Vision-Language Models.
Visual-CoT: A VQA dataset with rich domains, diverse object sizes, and bounding box annotations.
PyramidDrop, VisionZip, DivPrune, CDPruner, VScan: Other exploratory works in the field of visual token compression.
lmms_eval: A evaluation toolkit of large multimodal models.

🖊️ Citation

If you find our work helpful, please consider citing our paper:

@misc{zeng2025glimpseprune,
      title={A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models}, 
      author={Quan-Sheng Zeng and Yunheng Li and Qilong Wang and Peng-Tao Jiang and Zuxuan Wu and Ming-Ming Cheng and Qibin Hou},
      year={2025},
      eprint={2508.01548},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.01548}, 
}

📧 Contact Us

For any technical questions or academic collaborations, feel free to contact us via email: qszeng[AT]mail.nankai.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
dataset_configs		dataset_configs
examples		examples
llava		llava
llava_cdpruner		llava_cdpruner
llava_divprune		llava_divprune
llava_gp		llava_gp
llava_pdrop		llava_pdrop
llava_visionzip		llava_visionzip
llava_vscan		llava_vscan
my_lmms_eval		my_lmms_eval
notebook		notebook
qwen_visionzip		qwen_visionzip
qwen_vscan		qwen_vscan
scripts		scripts
train_configs		train_configs
transformers_gp		transformers_gp
utils		utils
viscot_benchmark		viscot_benchmark
viscot_dataset		viscot_dataset
viscot_eval		viscot_eval
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
avg_tokens.py		avg_tokens.py
cal_flops.py		cal_flops.py
demo_gp.py		demo_gp.py
llava_requirements.txt		llava_requirements.txt
pyrightconfig.json		pyrightconfig.json
qwen_requirements.txt		qwen_requirements.txt
results_vis_compare.py		results_vis_compare.py
save_seq_attns.py		save_seq_attns.py
test_api.py		test_api.py
train_llava_gp.py		train_llava_gp.py
train_qwen_gp.py		train_qwen_gp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GlimpsePrune

Table of Contents

✨ Key Features

🚀 News

🖼️ Framework Overview

📊 Performance Results

✅ Roadmap

🛠️ Installation

📦 Models and Data

Model Download

Data Preparation

▶️ How to Use

Local Demo

Inference

Evaluation

Free-form VQA

Short-form VQA

Training

Train GlimpsePrune

Train GlimpsePrune+ (Optional)

🙏 Acknowledgements

🖊️ Citation

📧 Contact Us

About

Uh oh!

Releases

Packages

Languages

License

HVision-NKU/GlimpsePrune

Folders and files

Latest commit

History

Repository files navigation

GlimpsePrune

Table of Contents

✨ Key Features

🚀 News

🖼️ Framework Overview

📊 Performance Results

✅ Roadmap

🛠️ Installation

📦 Models and Data

Model Download

Data Preparation

▶️ How to Use

Local Demo

Inference

Evaluation

Free-form VQA

Short-form VQA

Training

Train GlimpsePrune

Train GlimpsePrune+ (Optional)

🙏 Acknowledgements

🖊️ Citation

📧 Contact Us

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages