VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

Introduction

VideoITG is an innovative approach to video understanding, designed to enhance the performance of Video Large Language Models (Video-LLMs) through informed frame selection. It tackles the complexities of real-world video scenarios by aligning frame sampling with user instructions. VideoITG employs a comprehensive pipeline that includes detailed clip-level description generation, question-guided clip retrieval, and task-specific frame selection. This results in a robust dataset of 40K videos and 480K annotations. The plug-and-play model leverages visual language alignment and reasoning, achieving superior results across multimodal benchmarks, particularly in tasks requiring precise temporal grounding.

Updates

[2025/09/30] The results of VideoITG on benchmarks release. See results for released JSONL files.
[2025/07/25] Code and checkpoint release.
[2025/07/18] Technical report release. [arXiv]

Models & Performance

Here is the model trained on our organized 1.8M supervised fine-tuning data.

Model	VideoLLM	Frames	LongVideoBench	MLVU	VideoMME	CG-Bench
VideoITG-7B	InternVL2.5-8B	32	61.9 (+2.9%)	75.0 (+7.8%)	67.3 (+4.0%)	46.7 (+7.0%)
VideoITG-7B	InternVL2.5-26B	32	63.0 (+1.0%)	78.9 (+6.1%)	69.9 (+2.5)	48.7 (+6.0%)
VideoITG-7B	LLaVA-Video-7B	32	61.6 (3.6%)	74.6 (+8.6%)	66.1 (+3.0%)	42.8 (+9.0%)
VideoITG-7B	LLaVA-Video-7B	64	60.9 (+7.4%)	76.3 (+7.6%)	66.4 (+1.9%)	42.9 (8.1%)

Visual Examples

Install

Please following the guide here to prepare the environment on Linux OS.

Clone this repository

git clone https://github.com/NVlabs/VideoITG.git
cd VideoITG

Create environment and install package

conda create -n videoitg python=3.12 -y
conda activate videoitg
pip install --upgrade pip  # enable PEP 660 support
pip install -r requirements.txt

Install additional packages for training cases

pip install flash-attn==2.4.2 --no-build-isolation

Training Data

VideoLLM Data

For VideoLLM training, wew use the same data and stragety as LLaVA-Video, including the Pretraining Data, OV SFT Data and LLaVA-Video Data.

VideoITG Data

Checkpoint Preparation

We recommend using the VideoLLM checkpoints we provided here to reproduce our results.

Training

You can train the model following:

bash scripts/videoitg/finetune-uni-64frame-qwen2-7b-grounding.sh finetune 16

In default we use 128 NVIDIA A100 80G GPU to conduct the training. Please modify the per_device_train_batch_size and gradient_accumulation_steps if you are using different amount of GPUs. The training for VideoITG requires 4 hours.

Notes

If you have limited GPU resources or memory, please considering the following:

use gradient accumulation and reduce the per-device batch size

Evaluation

Evaluation with LMMs-Eval

For evaluation, we use Videomme as an example. First, using this command to run our VideoITG model and get the instructed grounding results.

bash scripts/eval_lmms_eval/videomme_grounding.sh $REPO_ID_OR_LOCAL_PATH $MODEL_NAME $CONV_MODE

Notes

In our paper, we report the results of CG-Bench mini, which includes 3,000 QA pairs.

Citation

If you find this project useful, please cite our work:

@article{wang2025videoitg,
  title     = {VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding},
  author    = {Shihao Wang and Guo Chen and De-An Huang and Zhiqi Li and Minghan Li and Guilin Liu and Jose M. Alvarez and Lei Zhang and Zhiding Yu},
  journal   = {arXiv preprint arXiv:2507.13353},
  year      = {2025}
}

License/Terms of Use

The code is released under the Apache 2.0 License.
Portions of the code under lmms-eval are reused and subject to their original licenses. Some files have been modified, with appropriate attribution and additional license headers added where applicable.
The pretrained model weights are released under the NVIDIA License. The model is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
- Model License of Qwen2-7B-Instruct: Apache 2.0.
- Model License of SigLIP: Apache 2.0.
For code contributions to VideoITG, please refer to the Contribution Guide.
Users are reminded to ensure that their use of the dataset and model weights is in compliance with all applicable laws and regulations.

Acknowledgement

Eagle: the codebase we built upon.
LMMs-Eval: many thanks to the LMMs-Lab for the easy-to-use evaluation tools.
LLaVA-OneVision and LLaVA-Video: we train our models with the data from these great open-source projects.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
eagle		eagle
lmms_eval		lmms_eval
results		results
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE_Model		LICENSE_Model
README.md		README.md
evaluate_lmms_eval.py		evaluate_lmms_eval.py
infer.py		infer.py
requirements.txt		requirements.txt
train_itg.py		train_itg.py
train_itg_mem.py		train_itg_mem.py
train_mem.py		train_mem.py
train_vlm.py		train_vlm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

Introduction

Updates

Contents

Models & Performance

Visual Examples

Install

Training Data

VideoLLM Data

VideoITG Data

Checkpoint Preparation

Training

Notes

Evaluation

Evaluation with LMMs-Eval

Notes

Citation

License/Terms of Use

Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

NVlabs/VideoITG

Folders and files

Latest commit

History

Repository files navigation

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

Introduction

Updates

Contents

Models & Performance

Visual Examples

Install

Training Data

VideoLLM Data

VideoITG Data

Checkpoint Preparation

Training

Notes

Evaluation

Evaluation with LMMs-Eval

Notes

Citation

License/Terms of Use

Acknowledgement

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages