Temporal Preference Optimization (TPO) is a self-training framework designed to enhance long-form video understanding capabilities in video Large Multimodal Models (video-LMMs). Our approach significantly improves video comprehension by modeling temporal preferences at two complementary granular levels.
- 🎯 Localized TPO: Generates queries focused on short video segments with contrastive responses that retain or exclude target segments
- 🌐 Comprehensive TPO: Designs broader understanding queries using intact videos versus sparse downsampled videos for response contrast
- 🔧 Intelligent Post-filtering: Ensures high-quality contrast response pairs through multi-dimensional filtering mechanisms
- 🚀 Self-training Pipeline: Complete end-to-end framework for temporal preference optimization
- 📊 Significant Performance Gains: Achieves substantial improvements across multiple video understanding benchmarks
- 📚 Comprehensive Pipeline: Complete toolkit from data curation to model training
- 🔬 Reproducible Research: Full codebase and datasets for research reproducibility
We provide high-performance model weights trained with TPO:
| Model | Base Architecture | HuggingFace Link | Description |
|---|---|---|---|
| LongVA-7B-TPO | LongVA-7B | 🤗 Download | Optimized for long-form video understanding |
| LLaVA-Video-7B-TPO | LLaVA-Video-7B | 🤗 Download | General-purpose video comprehension model |
# Clone the repository
git clone https://github.com/ruili33/TPO
cd TPO
# Create conda environment
conda create -n TPOLongVA python=3.10
conda activate TPOLongVA
# Install dependencies
conda install ffmpeg
pip install torch==2.1.2 torchvision --index-url https://download.pytorch.org/whl/cu118
pip install -e "longva/.[train]"
pip install packaging ffmpeg-python ninja
pip install flash-attn==2.5.0 --no-build-isolation --no-cache-dir
pip install -r requirements_longva.txt# Create conda environment
conda create -n TPOllava python=3.10 -y
conda activate TPOllava
# Install dependencies
conda install ffmpeg
pip install --upgrade pip
pip install torch==2.1.2 torchvision --index-url https://download.pytorch.org/whl/cu118
pip install -e "LLaVA/.[train]"
pip install flash-attn==2.5.0 --no-build-isolation --no-cache-dir
pip install ffmpeg-pythonFor LongVA-TPO, please following the inference demo in longva/inference_longva.py.
For LLaVA-Video-TPO, please following the inference demo in LLaVA/inference_llava.py.
We utilize the lmms-eval framework for standardized evaluation, ensuring consistency with previous works.
# LongVA-TPO evaluation
bash longva/eval.sh
# LLaVA-Video-TPO evaluation
bash LLaVA/eval.sh| Dataset | Description | Link |
|---|---|---|
| LongVA-TPO-10k | TPO training dataset for LongVA | 🤗 Dataset |
Experience our TPO model (LLaVA-Video-7B-TPO) with an interactive web interface:
conda activate TPOllava
python local_demo/multimodal_chat.pyVisit the local server URL to start interactive video question-answering.
# Run TPO training script
bash LLaVA/tpo_video.sh# Run TPO training script
bash longva/longva/source/TPO.shDetailed implementation scripts are available in the data/ directory.
If you find this repository useful in your research or work, please consider citing our paper:
@article{li2025temporal,
title={Temporal Preference Optimization for Long-Form Video Understanding},
author={Li, Rui and Wang, Xiaohan and Zhang, Yuhui and Zohar, Orr and Wang, Zeyu and Yeung-Levy, Serena},
journal={arXiv preprint arXiv:2501.13919},
year={2025}
}This work builds upon the excellent open-source projects LongVA and LLaVA-Video. We extend our sincere gratitude to the maintainers and contributors of these repositories for their outstanding work, which greatly facilitated the development of our project.
⭐ If you find this project helpful, please give us a star!
