Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

🎊 News

[2025.11.07] Our paper "Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm" has been released on arXiv! 📄 [Paper]

📜 Brief Introduction

Moving beyond the traditional paradigms of "Thinking with Text" (e.g., Chain-of-Thought) and "Thinking with Images", we propose "Thinking with Video"—a new paradigm that unifies visual and textual reasoning through video generation models. It naturally enables human-like dynamic reasoning through video generation, such as drawing and imagination.

💡 A New Unified Reasoning Paradigm "Thinking with Video" leverages video generation models to visualize dynamic processes, represent temporal evolution, and embed text within video frames. This approach achieves unified multimodal understanding and generation, overcoming the static constraints of image-based reasoning and the modality separation in traditional approaches.

📊 VideoThinkBench: A Comprehensive Benchmark We developed VideoThinkBench, the first reasoning benchmark specifically designed for evaluating video generation models. It comprises vision-centric tasks (eyeballing puzzles, visual puzzles, ARC-AGI-2, mazes) that leverage dynamic visual reasoning, and text-centric tasks adapted from established benchmarks (MATH, GSM8K, MMLU, MMMU, etc.) that test text-based reasoning capabilities within generated videos.

🚀 Surpassing VLMs on Several Tasks Our evaluation shows that Sora-2 demonstrates competitive reasoning capabilities across both categories. Notably, Sora-2 surpasses state-of-the-art vision-language models on several vision-centric tasks, showcasing the unique advantages of dynamic visual reasoning. On text-centric tasks, Sora-2 achieves strong performance including 98.9% on GSM8K, 94.0% on MATH, and 75.5% on MMMU, demonstrating the potential of "Thinking with Video" as a unified multimodal reasoning paradigm.

Installation

Clone this repository and navigate to Thinking-with-Video folder

git clone https://github.com/YOUR_USERNAME/Thinking-with-Video.git
cd Thinking-with-Video

Install dependencies

pip install -r requirements.txt

(Coming soon)

VideoThinkBench

VideoThinkBench is a comprehensive benchmark for evaluating video generation models' reasoning capabilities, consisting of two main categories:

Vision-Centric Tasks

Eyeballing Puzzles: Spatial reasoning tasks requiring visual estimation and drawing
Visual Puzzles: Pattern recognition and visual logic problems
ARC-AGI-2: Abstract reasoning tasks requiring few-shot learning
Mazes: Path-finding and navigation challenges

Text-Centric Tasks

Adapted from established benchmarks including:

Mathematical Reasoning: MATH, GSM8K, AIME, MathVista, MathVision
Multimodal Understanding: MMMU, MMBench
General Knowledge: MMLU, MMLU-Pro
Scientific Reasoning: GPQA-diamond, SuperGPQA

Dataset is available on Hugging Face.

Benchmark Results

Performance Comparison Across All Tasks

The table below summarizes the accuracy (%) of Sora-2 compared with state-of-the-art vision-language models across all second-level tasks in VideoThinkBench:

Category	Task	Sora-2	Gemini 2.5 Pro	GPT5 high	Claude Sonnet 4.5
Vision-Centric	Eyeballing-Point	44.7	27.8	33.6	36.2
	Eyeballing-Line	38.0	21.0	24.0	26.3
	Eyeballing-Shape	34.5	34.5	32.5	50.5
	Visual-Color	67.0	73.9	79.6	85.6
	Visual-Shape	64.9	92.9	97.5	68.6
	ARC-AGI-2	1.3	1.9	0.5	5.3
	Average	41.7	42.0	44.6	45.4
Text-Centric	Text-Only Math	53.6	94.8	97.2	90.0
	Text-Only General Knowledge	63.1	84.5	85.2	86.3
	Multimodal Math	56.3	66.7	69.6	65.6
	Multimodal General Knowledge	49.4	83.0	80.6	82.3
	Average	55.6	82.3	83.2	81.1
Overall Average		47.3	58.1	60.0	59.7

Note: For Sora-2: Eyeballing Puzzles use Major Frame evaluation; Visual Puzzles show the average of Color-Filling and Shape-Drawing tasks; Text-Centric Reasoning tasks use Video evaluation results.

Evaluation

Benchmark Evaluation Scripts

# Vision-centric tasks evaluation
python eval_vision_centric.py --task eyeballing

# Text-centric tasks evaluation
python eval_text_centric.py --benchmark MATH

Coming Soon: We are preparing comprehensive evaluation scripts for all tasks in VideoThinkBench. Stay tuned!

Takeaways

Our systematic evaluation on VideoThinkBench reveals seven key findings:

Surpassing VLMs on Eyeballing Puzzles: Sora-2 generally surpasses SOTA VLMs on eyeballing puzzles, exhibiting strong geometric and physical reasoning abilities. It can simulate the extension and reflection of rays and manipulate geometric elements (e.g., points and lines) to support spatial reasoning.
Inductive Reasoning on Visual Puzzles: Sora-2's performance is comparable to Claude Sonnet 4.5 on Shape-Drawing puzzles, demonstrating inductive reasoning capabilities. Sora-2 can recognize and apply patterns of color, shape, and size, solving visual puzzles involving symmetry, gradients, and compositionality.
Few-Shot Learning Capabilities: Sora-2 is a few-shot learner. On ARC-AGI-2, which requires finding patterns in input-output pairs, while SOTA VLMs achieve less than 5% accuracy, Sora-2 can often make reasonable predictions, although they do not strictly match dataset annotations.
Unified Multimodal Reasoning: On text-centric tasks, Sora-2 shows surprising performance on text and multimodal reasoning benchmarks. The video generation model can embed text within video frames, enabling unified multimodal understanding and generation. This demonstrates that "Thinking with Video" is potentially a unified multimodal reasoning paradigm.
Improved In-Context Learning with More Examples: Sora-2 achieves better in-context learning by providing more examples. Experiments show that Sora-2 performs better when provided with all examples compared to only one example, revealing an underexplored direction for analyzing and improving the in-context learning abilities of video generation models.
Test-Time Scaling with Self-Consistency: Self-consistency can improve Sora-2's performance on verifiable video generation reasoning tasks. This reveals an underexplored direction: test-time scaling in video generation reasoning tasks.
Analysis of Capability Source: We systematically analyzed the source of Sora-2's capabilities. Sora-2 maintains performance comparable to the original test set on adapted math problems, reducing the likelihood of test set leakage. However, Sora-2 struggles to generate coherent reasoning processes in videos, even when providing correct final answers. Through comparative experiments with Wan 2.5, we speculate that Sora-2's text-centric reasoning ability originates from its prompt rewriter model.

Licenses

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you find our work helpful, please consider citing our paper 📝 and starring us ⭐️!

@article{tong2025thinkingwithvideo,
    title={Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm},
    author={Jingqi Tong and Yurong Mou and Hangcheng Li and Mingzhe Li and Yongzhuo Yang and Ming Zhang and Qiguang Chen and Tianyi Liang and Xiaomeng Hu and Yining Zheng and Xinchi Chen and Jun Zhao and Xuanjing Huang and Xipeng Qiu},
    journal={arXiv preprint arXiv:2511.04570},
    year={2025}
}

Star History

Made with ❤️ for advancing multimodal reasoning research

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
VisionCentric @ 2958d3e		VisionCentric @ 2958d3e
assets		assets
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

🎊 News

📜 Brief Introduction

📌 Contents

Installation

VideoThinkBench

Vision-Centric Tasks

Text-Centric Tasks

Benchmark Results

Performance Comparison Across All Tasks

Evaluation

Benchmark Evaluation Scripts

Takeaways

Licenses

Citation

Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

🎊 News

📜 Brief Introduction

📌 Contents

Installation

VideoThinkBench

Vision-Centric Tasks

Text-Centric Tasks

Benchmark Results

Performance Comparison Across All Tasks

Evaluation

Benchmark Evaluation Scripts

Takeaways

Licenses

Citation

Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages