🎬 Temporal Preference Optimization (TPO) for Long-Form Video Understanding

📖 Overview

Temporal Preference Optimization (TPO) is a self-training framework designed to enhance long-form video understanding capabilities in video Large Multimodal Models (video-LMMs). Our approach significantly improves video comprehension by modeling temporal preferences at two complementary granular levels.

🔑 Key Innovations

🎯 Localized TPO: Generates queries focused on short video segments with contrastive responses that retain or exclude target segments
🌐 Comprehensive TPO: Designs broader understanding queries using intact videos versus sparse downsampled videos for response contrast
🔧 Intelligent Post-filtering: Ensures high-quality contrast response pairs through multi-dimensional filtering mechanisms
🚀 Self-training Pipeline: Complete end-to-end framework for temporal preference optimization

✨ Key Features

📊 Significant Performance Gains: Achieves substantial improvements across multiple video understanding benchmarks
📚 Comprehensive Pipeline: Complete toolkit from data curation to model training
🔬 Reproducible Research: Full codebase and datasets for research reproducibility

🚀 Quick Start

📦 Pre-trained Model Weights

We provide high-performance model weights trained with TPO:

Model	Base Architecture	HuggingFace Link	Description
LongVA-7B-TPO	LongVA-7B	🤗 Download	Optimized for long-form video understanding
LLaVA-Video-7B-TPO	LLaVA-Video-7B	🤗 Download	General-purpose video comprehension model

🛠️ Installation

Option 1: LongVA-TPO Setup

# Clone the repository
git clone https://github.com/ruili33/TPO
cd TPO

# Create conda environment
conda create -n TPOLongVA python=3.10
conda activate TPOLongVA

# Install dependencies
conda install ffmpeg
pip install torch==2.1.2 torchvision --index-url https://download.pytorch.org/whl/cu118
pip install -e "longva/.[train]"
pip install packaging ffmpeg-python ninja
pip install flash-attn==2.5.0 --no-build-isolation --no-cache-dir
pip install -r requirements_longva.txt

Option 2: LLaVA-Video-TPO Setup

# Create conda environment
conda create -n TPOllava python=3.10 -y
conda activate TPOllava

# Install dependencies
conda install ffmpeg
pip install --upgrade pip 
pip install torch==2.1.2 torchvision --index-url https://download.pytorch.org/whl/cu118
pip install -e "LLaVA/.[train]"
pip install flash-attn==2.5.0 --no-build-isolation --no-cache-dir
pip install ffmpeg-python

🎯 Model Inference

For LongVA-TPO, please following the inference demo in longva/inference_longva.py.

For LLaVA-Video-TPO, please following the inference demo in LLaVA/inference_llava.py.

📊 Model Evaluation

We utilize the lmms-eval framework for standardized evaluation, ensuring consistency with previous works.

Evaluation Scripts

# LongVA-TPO evaluation
bash longva/eval.sh

# LLaVA-Video-TPO evaluation  
bash LLaVA/eval.sh

📦 Datasets

Dataset	Description	Link
LongVA-TPO-10k	TPO training dataset for LongVA	🤗 Dataset

🌐 Web Demo

Experience our TPO model (LLaVA-Video-7B-TPO) with an interactive web interface:

conda activate TPOllava
python local_demo/multimodal_chat.py

Visit the local server URL to start interactive video question-answering.

🔧 Training

TPO Training Pipeline

LLaVA-Video-TPO Training

# Run TPO training script
bash LLaVA/tpo_video.sh

LongVA-TPO Training

# Run TPO training script
bash longva/longva/source/TPO.sh

📈 TPO Data Curation Pipeline

Detailed implementation scripts are available in the data/ directory.

📝 Citation

If you find this repository useful in your research or work, please consider citing our paper:

@article{li2025temporal,
      title={Temporal Preference Optimization for Long-Form Video Understanding},
      author={Li, Rui and Wang, Xiaohan and Zhang, Yuhui and Zohar, Orr and Wang, Zeyu and Yeung-Levy, Serena},
      journal={arXiv preprint arXiv:2501.13919},
      year={2025}
}

🙏 Acknowledgements

This work builds upon the excellent open-source projects LongVA and LLaVA-Video. We extend our sincere gratitude to the maintainers and contributors of these repositories for their outstanding work, which greatly facilitated the development of our project.

⭐ If you find this project helpful, please give us a star!

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
LLaVA		LLaVA
asset		asset
data		data
local_demo		local_demo
longva		longva
.DS_Store		.DS_Store
README.md		README.md
requirements_llava.txt		requirements_llava.txt
requirements_longva.txt		requirements_longva.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬 Temporal Preference Optimization (TPO) for Long-Form Video Understanding

📖 Overview

🔑 Key Innovations

✨ Key Features

🚀 Quick Start

📦 Pre-trained Model Weights

🛠️ Installation

Option 1: LongVA-TPO Setup

Option 2: LLaVA-Video-TPO Setup

🎯 Model Inference

📊 Model Evaluation

Evaluation Scripts

📦 Datasets

🌐 Web Demo

🔧 Training

TPO Training Pipeline

LLaVA-Video-TPO Training

LongVA-TPO Training

📈 TPO Data Curation Pipeline

📝 Citation

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Languages

ruili33/TPO

Folders and files

Latest commit

History

Repository files navigation

🎬 Temporal Preference Optimization (TPO) for Long-Form Video Understanding

📖 Overview

🔑 Key Innovations

✨ Key Features

🚀 Quick Start

📦 Pre-trained Model Weights

🛠️ Installation

Option 1: LongVA-TPO Setup

Option 2: LLaVA-Video-TPO Setup

🎯 Model Inference

📊 Model Evaluation

Evaluation Scripts

📦 Datasets

🌐 Web Demo

🔧 Training

TPO Training Pipeline

LLaVA-Video-TPO Training

LongVA-TPO Training

📈 TPO Data Curation Pipeline

📝 Citation

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages