Skip to content

Official repository of the paper "A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models"

License

Notifications You must be signed in to change notification settings

HVision-NKU/GlimpsePrune

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GlimpsePrune

English | 简体中文

A Dynamic Visual Token Pruning Framework for Large Vision-Language Models

License


GlimpsePrune dynamically prunes a large number of irrelevant visual tokens before answering questions, reducing the model's inference overhead.

GlimpsePrune is a dynamic visual token pruning framework designed for Large Vision-Language Models (LVLMs). Through fast training on a small amount of data (e.g., less than 1 hour on 20K GQA data), GlimpsePrune enables Qwen2.5-VL-7B to prune an average of 92.6% of visual tokens before generating a response, while maintaining performance comparable to the original model.

For more technical details, please refer to our paper.

If you find our work inspiring or helpful, please give us a star ⭐. Thank you for your attention and support:

Stargazers repo roster for @HVision-NKU/GlimpsePrune

Table of Contents

✨ Key Features

  • High Pruning Rate: Prunes over 90% of visual tokens on average with almost no performance loss, effectively reducing computational and memory overhead.
  • Robust Performance: Stable performance when processing high-resolution images and handling complex free-form VQA tasks.
  • Lightweight Training: Only a few extra parameters (Glimpse token and VIP) need to be trained, completed in less than 1 hour on a single A100 GPU.
  • Broad Compatibility: Supports single and multi-image inputs, is compatible with KV-Cache and Flash Attention 2, and provides a fair comparison benchmark with other mainstream visual compression methods.

🚀 News

  • 2025.08.05: Paper are publicly released!
  • 2025.08.03: Code and Models are publicly released!

🖼️ Framework Overview

The core idea of GlimpsePrune is to introduce a glimpse token and a lightweight Visual tokens Important Predictor (VIP) that can quickly identify and retain the visual regions most relevant to the text prompt, pruning the remaining redundant information.

The core code implementation is located in:

📊 Performance Results

We evaluated GlimpsePrune on multiple VQA benchmarks. The results show that it achieves a high pruning rate while maintaining performance on par with the original model, outperforming other visual compression methods.

Free-form VQA Benchmarks

Short-form VQA Benchmarks

✅ Roadmap

  • Support for Qwen2.5-VL
  • Support for single-image input
  • Support for multi-image input
  • Provide a local Gradio Demo
  • Support for LLaVA-1.5
  • Provide evaluation scripts for various visual token compression methods (PyramidDrop, VisionZip, etc.) on the free-form VQA
  • Support for batch input (Batch Inference)
  • Support for video input
  • Support for LLaVA-NeXt
  • Provide an online Demo

🛠️ Installation

  1. Clone the repository

    git clone https://github.com/HVision-NKU/GlimpsePrune.git
    cd GlimpsePrune
  2. Create an environment and install dependencies We recommend create seperated virtual environment for different models:

    For Qwen2.5-VL:

    For LLaVA-1.5 (Optional):

    Click to expand LLaVA dependency installation

    Additional dependencies for Evaluation and Demo (Optional):

    # Evaluation
    pip install lmms-eval==0.3.5 vllm==0.9.0.1
    # Demo
    pip install gradio==5.39.0

📦 Models and Data

Model Download

All models can be automatically downloaded from the Hugging Face Hub. If you encounter network issues, you can download them manually to a local directory. <new_module> are the weights of the extra glimpse token and VIP modules we trained.

<base_model> <new_module>
Qwen/Qwen2.5-VL-3B-Instruct ashun989/GlimpsePrune_Qwen2.5-VL-3B-Instruct
Qwen/Qwen2.5-VL-7B-Instruct ashun989/GlimpsePrune_Qwen2.5-VL-7B-Instruct
liuhaotian/llava-v1.5-7b ashun989/GlimpsePrune_LLaVA-1.5-7B
liuhaotian/llava-v1.5-13b ashun989/GlimpsePrune_LLaVA-1.5-13B

Data Preparation

Training and Free-form VQA evaluation use the Visual-CoT dataset.

# Download the dataset (approx. 128GB)
huggingface-cli download --repo-type dataset --local-dir datas deepcs233/Visual-CoT cot_images_tar_split

# Extract
cd datas/cot_images_tar_split
cat cot_images_* | tar -xvf - -C ../cot
cd ../.. # Return to the project root directory

After extraction, the datas directory structure should be as follows:

GlimpsePrune/
├── datas/
│   └── cot/
│       ├── cub/
│       ├── gqa/
│       └── ...
└── ...

▶️ How to Use

Local Demo

We provide a Gradio Demo to intuitively experience the effects of GlimpsePrune.

python demo_gp.py \
    --base_model Qwen/Qwen2.5-VL-7B-Instruct \
    --new_modules_dir ashun989/GlimpsePrune_Qwen2.5-VL-7B-Instruct

Inference

For a detailed example of how to load the model and perform inference, please refer to the Jupyter Notebook: ➡️ notebook/gp_qwen_tutorial.ipynb

Evaluation

We provide convenient evaluation scripts.

Free-form VQA

# Default settings (no retention rate limit)
BASE_MODEL=<base_model> bash scripts/infer_qwen_gp_cot.sh <new_modules_dir>

# Set a maximum retention rate (e.g., 11.1%)
BASE_MODEL=<base_model> MAX_REMAIN_RATIO=0.111 bash scripts/infer_qwen_gp_cot.sh <new_modules_dir>

Short-form VQA

# Default settings
BASE_MODEL=<base_model> bash scripts/eval_qwen_gp.sh <new_modules_dir>

# Set a maximum retention rate
BASE_MODEL=<base_model> MAX_REMAIN_RATIO=0.111 bash scripts/eval_qwen_gp.sh <new_modules_dir>

Training

Train GlimpsePrune

Training on Qwen2.5-VL-3B-Instruct requires at least two 24GB GPUs (e.g., RTX 3090) and takes about 1 hour.

# Train Qwen2.5-VL
CUDA_VISIBLE_DEVICES=0,1 \
bash scripts/train_qwen_gp.sh

# Train LLaVA-1.5
CUDA_VISIBLE_DEVICES=0,1 \
bash scripts/train_llava_gp.sh

Train GlimpsePrune+ (Optional)

Training on Qwen2.5-VL-7B-Instruct requires four 80GB A100 GPUs, plus an additional 48GB of VRAM to run the reward model, and takes about 24 hours.

# 1. Deploy the reward model
bash scripts/vllm_serve.sh
# 2. Test the API
python test_api.py
# 3. Start training
CUDA_VISIBLE_DEVICES=0,1,2,3 \
bash scripts/train_qwen_gp_plus.sh

🙏 Acknowledgements

This project is based on the following excellent open-source work, and we express our sincere gratitude:

🖊️ Citation

If you find our work helpful, please consider citing our paper:

@misc{zeng2025glimpseprune,
      title={A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models}, 
      author={Quan-Sheng Zeng and Yunheng Li and Qilong Wang and Peng-Tao Jiang and Zuxuan Wu and Ming-Ming Cheng and Qibin Hou},
      year={2025},
      eprint={2508.01548},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.01548}, 
}

📧 Contact Us

For any technical questions or academic collaborations, feel free to contact us via email: qszeng[AT]mail.nankai.edu.cn

About

Official repository of the paper "A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published