A Dynamic Visual Token Pruning Framework for Large Vision-Language Models
GlimpsePrune dynamically prunes a large number of irrelevant visual tokens before answering questions, reducing the model's inference overhead.
GlimpsePrune is a dynamic visual token pruning framework designed for Large Vision-Language Models (LVLMs). Through fast training on a small amount of data (e.g., less than 1 hour on 20K GQA data), GlimpsePrune enables Qwen2.5-VL-7B to prune an average of 92.6% of visual tokens before generating a response, while maintaining performance comparable to the original model.
For more technical details, please refer to our paper.
If you find our work inspiring or helpful, please give us a star ⭐. Thank you for your attention and support:
- ✨ Key Features
- 🚀 News
- 🖼️ Framework Overview
- 📊 Performance Results
- ✅ Roadmap
- 🛠️ Installation
- 📦 Models and Data
▶️ How to Use- 🙏 Acknowledgements
- 🖊️ Citation
- 📧 Contact Us
- High Pruning Rate: Prunes over 90% of visual tokens on average with almost no performance loss, effectively reducing computational and memory overhead.
- Robust Performance: Stable performance when processing high-resolution images and handling complex free-form VQA tasks.
- Lightweight Training: Only a few extra parameters (Glimpse token and VIP) need to be trained, completed in less than 1 hour on a single A100 GPU.
- Broad Compatibility: Supports single and multi-image inputs, is compatible with KV-Cache and Flash Attention 2, and provides a fair comparison benchmark with other mainstream visual compression methods.
The core idea of GlimpsePrune is to introduce a glimpse token and a lightweight Visual tokens Important Predictor (VIP) that can quickly identify and retain the visual regions most relevant to the text prompt, pruning the remaining redundant information.
The core code implementation is located in:
- Qwen2.5-VL:
transformers_gp/models/qwen2_5_vl/model_gp.py - LLaVA-1.5:
llava_gp/model/language_model/llava_llama.py
We evaluated GlimpsePrune on multiple VQA benchmarks. The results show that it achieves a high pruning rate while maintaining performance on par with the original model, outperforming other visual compression methods.
- Support for Qwen2.5-VL
- Support for single-image input
- Support for multi-image input
- Provide a local Gradio Demo
- Support for LLaVA-1.5
- Provide evaluation scripts for various visual token compression methods (PyramidDrop, VisionZip, etc.) on the free-form VQA
- Support for batch input (Batch Inference)
- Support for video input
- Support for LLaVA-NeXt
- Provide an online Demo
-
Clone the repository
git clone https://github.com/HVision-NKU/GlimpsePrune.git cd GlimpsePrune -
Create an environment and install dependencies We recommend create seperated virtual environment for different models:
For Qwen2.5-VL:
python=3.10torch==2.7.0flash-attn==2.7.4.post1pip install -r qwen_requirements.txtpip install qwen-vl-utils[decord]
For LLaVA-1.5 (Optional):
Click to expand LLaVA dependency installation
python=3.10torch==2.1.2flash-attn=2.7.3pip install -r llava_requirements.txt
Additional dependencies for Evaluation and Demo (Optional):
# Evaluation pip install lmms-eval==0.3.5 vllm==0.9.0.1 # Demo pip install gradio==5.39.0
All models can be automatically downloaded from the Hugging Face Hub. If you encounter network issues, you can download them manually to a local directory. <new_module> are the weights of the extra glimpse token and VIP modules we trained.
Training and Free-form VQA evaluation use the Visual-CoT dataset.
# Download the dataset (approx. 128GB)
huggingface-cli download --repo-type dataset --local-dir datas deepcs233/Visual-CoT cot_images_tar_split
# Extract
cd datas/cot_images_tar_split
cat cot_images_* | tar -xvf - -C ../cot
cd ../.. # Return to the project root directoryAfter extraction, the datas directory structure should be as follows:
GlimpsePrune/
├── datas/
│ └── cot/
│ ├── cub/
│ ├── gqa/
│ └── ...
└── ...
We provide a Gradio Demo to intuitively experience the effects of GlimpsePrune.
python demo_gp.py \
--base_model Qwen/Qwen2.5-VL-7B-Instruct \
--new_modules_dir ashun989/GlimpsePrune_Qwen2.5-VL-7B-InstructFor a detailed example of how to load the model and perform inference, please refer to the Jupyter Notebook:
➡️ notebook/gp_qwen_tutorial.ipynb
We provide convenient evaluation scripts.
# Default settings (no retention rate limit)
BASE_MODEL=<base_model> bash scripts/infer_qwen_gp_cot.sh <new_modules_dir>
# Set a maximum retention rate (e.g., 11.1%)
BASE_MODEL=<base_model> MAX_REMAIN_RATIO=0.111 bash scripts/infer_qwen_gp_cot.sh <new_modules_dir># Default settings
BASE_MODEL=<base_model> bash scripts/eval_qwen_gp.sh <new_modules_dir>
# Set a maximum retention rate
BASE_MODEL=<base_model> MAX_REMAIN_RATIO=0.111 bash scripts/eval_qwen_gp.sh <new_modules_dir>Training on Qwen2.5-VL-3B-Instruct requires at least two 24GB GPUs (e.g., RTX 3090) and takes about 1 hour.
# Train Qwen2.5-VL
CUDA_VISIBLE_DEVICES=0,1 \
bash scripts/train_qwen_gp.sh
# Train LLaVA-1.5
CUDA_VISIBLE_DEVICES=0,1 \
bash scripts/train_llava_gp.shTraining on Qwen2.5-VL-7B-Instruct requires four 80GB A100 GPUs, plus an additional 48GB of VRAM to run the reward model, and takes about 24 hours.
# 1. Deploy the reward model
bash scripts/vllm_serve.sh
# 2. Test the API
python test_api.py
# 3. Start training
CUDA_VISIBLE_DEVICES=0,1,2,3 \
bash scripts/train_qwen_gp_plus.shThis project is based on the following excellent open-source work, and we express our sincere gratitude:
- Qwen2.5-VL / LLaVA: Powerful Large Vision-Language Models.
- Visual-CoT: A VQA dataset with rich domains, diverse object sizes, and bounding box annotations.
- PyramidDrop, VisionZip, DivPrune, CDPruner, VScan: Other exploratory works in the field of visual token compression.
- lmms_eval: A evaluation toolkit of large multimodal models.
If you find our work helpful, please consider citing our paper:
@misc{zeng2025glimpseprune,
title={A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models},
author={Quan-Sheng Zeng and Yunheng Li and Qilong Wang and Peng-Tao Jiang and Zuxuan Wu and Ming-Ming Cheng and Qibin Hou},
year={2025},
eprint={2508.01548},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.01548},
}For any technical questions or academic collaborations, feel free to contact us via email: qszeng[AT]mail.nankai.edu.cn


