VideoITG is an innovative approach to video understanding, designed to enhance the performance of Video Large Language Models (Video-LLMs) through informed frame selection. It tackles the complexities of real-world video scenarios by aligning frame sampling with user instructions. VideoITG employs a comprehensive pipeline that includes detailed clip-level description generation, question-guided clip retrieval, and task-specific frame selection. This results in a robust dataset of 40K videos and 480K annotations. The plug-and-play model leverages visual language alignment and reasoning, achieving superior results across multimodal benchmarks, particularly in tasks requiring precise temporal grounding.
- [2025/09/30] The results of VideoITG on benchmarks release. See results for released JSONL files.
- [2025/07/25] Code and checkpoint release.
- [2025/07/18] Technical report release. [arXiv]
- Models & Performance
- Visual Examples
- Install
- Training Data
- Checkpoint Preparation
- Training
- Evaluation
Here is the model trained on our organized 1.8M supervised fine-tuning data.
Model | VideoLLM | Frames | LongVideoBench | MLVU | VideoMME | CG-Bench |
---|---|---|---|---|---|---|
VideoITG-7B | InternVL2.5-8B | 32 | 61.9 (+2.9%) | 75.0 (+7.8%) | 67.3 (+4.0%) | 46.7 (+7.0%) |
VideoITG-7B | InternVL2.5-26B | 32 | 63.0 (+1.0%) | 78.9 (+6.1%) | 69.9 (+2.5) | 48.7 (+6.0%) |
VideoITG-7B | LLaVA-Video-7B | 32 | 61.6 (3.6%) | 74.6 (+8.6%) | 66.1 (+3.0%) | 42.8 (+9.0%) |
VideoITG-7B | LLaVA-Video-7B | 64 | 60.9 (+7.4%) | 76.3 (+7.6%) | 66.4 (+1.9%) | 42.9 (8.1%) |
Please following the guide here to prepare the environment on Linux OS.
- Clone this repository
git clone https://github.com/NVlabs/VideoITG.git
cd VideoITG
- Create environment and install package
conda create -n videoitg python=3.12 -y
conda activate videoitg
pip install --upgrade pip # enable PEP 660 support
pip install -r requirements.txt
- Install additional packages for training cases
pip install flash-attn==2.4.2 --no-build-isolation
For VideoLLM training, wew use the same data and stragety as LLaVA-Video, including the Pretraining Data, OV SFT Data and LLaVA-Video Data.
We recommend using the VideoLLM checkpoints we provided here to reproduce our results.
You can train the model following:
bash scripts/videoitg/finetune-uni-64frame-qwen2-7b-grounding.sh finetune 16
In default we use 128 NVIDIA A100 80G GPU to conduct the training. Please modify the per_device_train_batch_size
and gradient_accumulation_steps
if you are using different amount of GPUs. The training for VideoITG requires 4 hours.
If you have limited GPU resources or memory, please considering the following:
- use gradient accumulation and reduce the per-device batch size
For evaluation, we use Videomme as an example. First, using this command to run our VideoITG model and get the instructed grounding results.
bash scripts/eval_lmms_eval/videomme_grounding.sh $REPO_ID_OR_LOCAL_PATH $MODEL_NAME $CONV_MODE
In our paper, we report the results of CG-Bench mini, which includes 3,000 QA pairs.
If you find this project useful, please cite our work:
@article{wang2025videoitg,
title = {VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding},
author = {Shihao Wang and Guo Chen and De-An Huang and Zhiqi Li and Minghan Li and Guilin Liu and Jose M. Alvarez and Lei Zhang and Zhiding Yu},
journal = {arXiv preprint arXiv:2507.13353},
year = {2025}
}
- The code is released under the Apache 2.0 License.
- Portions of the code under lmms-eval are reused and subject to their original licenses. Some files have been modified, with appropriate attribution and additional license headers added where applicable.
- The pretrained model weights are released under the NVIDIA License. The model is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
- Model License of Qwen2-7B-Instruct: Apache 2.0.
- Model License of SigLIP: Apache 2.0.
- For code contributions to VideoITG, please refer to the Contribution Guide.
- Users are reminded to ensure that their use of the dataset and model weights is in compliance with all applicable laws and regulations.
- Eagle: the codebase we built upon.
- LMMs-Eval: many thanks to the LMMs-Lab for the easy-to-use evaluation tools.
- LLaVA-OneVision and LLaVA-Video: we train our models with the data from these great open-source projects.