Skip to content

InternRobotics/MMSI-Video-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

🌐 Homepage | 📑 Paper | 🤗 dataset | 📖 arXiv

🔔 News

🔥[2025-12]: We released our paper, benchmark, and evaluation codes.

Features of MMSI-Video-Bench

MMSI-Video-Bench is a fully human-annotated benchmark for evaluating video-based spatial intelligence in MLLMs. It implements a four-level framework—Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 video clips sourced from 25 public datasets as well as in-house recordings.

  • High quality. All data are manually annotated by 11 domain experts in 3D vision, following a rigorous review and acceptance process to ensure annotation accuracy and reliability.

  • Challenging. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human–AI gap, even the best reasoning model trails human performance by nearly 60%.

  • Diverse Video Sources. MMSI-Video-Bench includes videos from 25 public datasets and 1 in-house collection, spanning tabletop recordings, indoor and multi-floor environments, outdoor scenes, sports, and movie footage.

  • Comprehensive and Hostile Task Coverage. The benchmark covers spatial layout reasoning, motion understanding, decision-making, and cross-video reasoning, providing a more holistic evaluation of video-based spatial intelligence.

Dialogue_Teaser

Example

The questions in MMSI-VIdeo-Bench span 5 major categories: (1).Spatial Construction:This category focuses on spatial attributes of instances and scenes, as well as spatial relationships among instances, scenes, and cameras (six subtypes in total); (2).Motion Understanding: This includes understanding camera motion, instance motion, and interactive motion between instances; (3).Planning based on spatiotemporal video information; (4).Prediction: Assessing a model’s ability to predict, anticipate, or imagine future states based on the observed video; (5).Cross-Video Reasoning: This involves memory update across temporally separated video segments and multi-view integration across videos captured from different viewpoints.

Dialogue_Teaser

🚀 Getting Started

Installation

  1. Clone Github repo.

  2. Install requirements.

    conda activate your_env_name
    pip install -r requirements.txt

    Note: If you want to evaluate open-source models, you need to set up their corresponding environments.

Data Preparation

Download the MMSI-Video-Bench data from Hugging Face. The dataset includes:

(1) Annotations: mmsivideo.json.

(2) Reference images for questions: ref_images.zip.

(3) Video frames: frames.zip.

(Optional) Original video files: videos.zip.

After downloading, unzip the files and organize them as follows:

|-data/
|-- mmsivideo.json
|-- frames/
|-- ref_images/
|-- videos/

For more detail about the json-format data, refer to documention.

👓 Evaluation

Please note that while Sufficient Coverage ensures that all video information is fully preserved, we Recommend using this setting for evaluation. Evaluation under the Uniform setting may lead to missing critical information. The Uniform-50 setting is only provided due to current input-length limitations in some models.

  1. Run infernece

    For open-source models, change the openai base_url and api_key to your own in utils/openai_api.py. For proprietary models, modify the load_model function in inference.py to use the corresponding model path. Run the following command to perform inference for a specific model under a particular setting:

    python inference.py --model_name {model_name} --setting Uniform-50/Sufficient-Coverage
  2. Run evaluation

    Run the following command to obtain scores for a specific benchmark. The default is the main benchmark.

     python evaluation.py --eval_dir {path/to/results} --bench main/robot_bench/ground_bench/indoor_perception_bench/easy2hard_bench

🏆 Leaderboard

📦 Uniform-50 Setting
Model Avg.(%) Type
Human 96.40 Baseline
🥇Gemini 3 pro 37.97 Proprietary
🥈 O3 36.98 Proprietary
🥉GPT-5 36.80 Proprietary
Gemini 2.5 Flash 35.44 Proprietary
Gemini 2.5 Flash (Thinking) 35.17 Proprietary
Seed-1.6-vision 34.87 Proprietary
Claude-haiku-4.5 34.27 Proprietary
O4-mini 34.18 Proprietary
QwenVL2.5-72B 32.73 Open-Source
InternVL3-78B 32.55 Open-Source
Doubao-1.5-thinking 31.65 Proprietary
GPT-4o 31.56 Proprietary
InternVL2.5-78B 31.37 Open-Source
InternVL2.5-38B 31.01 Open-Source
QwenVL3-30B (Thinking) 30.83 Open-Source
LLaVA-Video-72B 30.38 Open-Source
InternVL3-8B 30.38 Open-Source
QwenVL2.5-VL-7B-Instruct 29.66 Open-Source
InternVL2.5-8B 29.11 Open-Source
InternVL3-38B 28.84 Open-Source
QwenVL3-30B 28.75 Open-Source
QwenVL2.5-32B 28.57 Open-Source
LLaVA-Video-7B 28.48 Open-Source
QwenVL3-8B 27.58 Open-Source
InternVideo2.5-8B 27.40 Open-Source
Random Guessing 24.10 Baseline
📦 Sufficient-Coverage Setting
Model Avg.(%) Type
Human 96.4 Baseline
🥇O3 37.34 Proprietary
🥈Gemini 2.5 Flash (Thinking) 36.71 Proprietary
🥉Gemini 2.5 Flash 36.62 Proprietary
O4-mini 35.08 Proprietary
QwenVL2.5-32B 32.37 Open-Source
QwenVL2.5-72B 31.83 Open-Source
InternVL3-8B 29.57 Open-Source
QwenVL3-30B 29.11 Open-Source
QwenVL3-8B 29.09 Open-Source
QwenVL2.5-7B 28.84 Open-Source
InternVL2.5-8B 28.66 Open-Source
GPT-4o 28.12 Proprietary
QwenVL3-30B (Thinking) 28.03 Open-Source
InternVideo2.5-8B 26.85 Open-Source
Random Guessing 24.10 Baseline
🤖 Robot Sub-bench
Model Avg.(%) Type
🥇Gemini 3 Pro 40.20 Proprietary
🥈Gemini 2.5 Flash (Thinking) 39.71 Proprietary
🥉Seed-1.6-vision 39.34 Proprietary
O3 39.22 Proprietary
QwenVL2.5-72B 37.75 Open-Source
InternVL3-8B 37.75 Open-Source
GPT-5 37.75 Proprietary
InternVL2.5-38B 36.27 Open-Source
Doubao-1.5-thinking 36.07 Proprietary
Gemini 2.5 Flash 35.78 Proprietary
O4-mini 35.29 Proprietary
QwenVL2.5-7B 34.8 Open-Source
InternVL2.5-78B 34.8 Open-Source
Claude-haiku-4.5 34.8 Proprietary
InternVL3-78B 34.31 Open-Source
LLaVA-Video-72B 34.31 Open-Source
QwenVL3-30B 32.84 Open-Source
QwenVL2.5-32B 32.84 Open-Source
QwenVL3-8B 32.12 Open-Source
InternVideo2.5-8B 29.90 Open-Source
GPT-4o 29.90 Proprietary
InternVL2.5-8B 28.43 Open-Source
InternVL3-38B 27.94 Open-Source
QwenVL3-30B (Thinking) 27.94 Open-Source
LLaVA-Video-7B 24.51 Open-Source
🏠 Indoor Scene Perception Sub-bench
Model Avg.(%) Type
🥇GPT-5 41.68 Proprietary
🥈O3 40.73 Proprietary
🥉Gemini 2.5 Flash 39.39 Proprietary
Gemini 3 Pro 39.39 Proprietary
Gemini 2.5 Flash (Thinking) 37.86 Proprietary
O4-mini 37.48 Proprietary
Seed-1.6-vision 34.2 Proprietary
Claude-haiku-4.5 33.46 Proprietary
Doubao-1.5-thinking 33.04 Proprietary
InternVL3-78B 32.5 Open-Source
QwenVL3-30B (Thinking) 32.31 Open-Source
GPT-4o 31.74 Proprietary
QwenVL2.5-72B 30.78 Open-Source
InternVL2.5-78B 30.4 Open-Source
QwenVL3-30B 30.02 Open-Source
QwenVL2.5-32B 29.64 Open-Source
InternVL2.5-8B 29.45 Open-Source
InternVL3-38B 29.06 Open-Source
QwenVL3-8B 28.68 Open-Source
InternVL2.5-38B 28.3 Open-Source
LLaVA-Video-72B 28.11 Open-Source
InternVL3-8B 27.72 Open-Source
LLaVA-Video-7B 27.53 Open-Source
QwenVL2.5-7B 27.15 Open-Source
InternVideo2.5-8B 26.77 Open-Source
📍 Grounding Sub-bench
Model Avg.(%) Type
🥇Gemini 2.5 Flash 38.81 Proprietary
🥈Gemini 2.5 Flash (Thinking) 38.21 Proprietary
🥉O3 37.61 Proprietary
Doubao-1.5-thinking 37.05 Proprietary
InternVL3-78B 35.52 Open-Source
GPT-5 35.22 Proprietary
Gemini 3 Pro 35.22 Proprietary
O4-mini 34.33 Proprietary
QwenVL2.5-72B 34.33 Open-Source
Seed-1.6-vision 33.04 Proprietary
Claude-haiku-4.5 32.84 Proprietary
InternVL2.5-38B 31.94 Open-Source
InternVL3-8B 31.94 Open-Source
GPT-4o 31.94 Proprietary
QwenVL3-30B (Thinking) 31.64 Open-Source
QwenVL2.5-32B 31.04 Open-Source
LLaVA-Video-72B 31.04 Open-Source
InternVL3-38B 30.45 Open-Source
InternVL2.5-8B 30.15 Open-Source
InternVL2.5-78B 29.85 Open-Source
QwenVL3-30B 29.25 Open-Source
QwenVL2.5-7B 28.66 Open-Source
QwenVL3-8B 28.66 Open-Source
InternVideo2.5-8B 27.76 Open-Source
LLaVA-Video-7B 27.16 Open-Source

Note: For the three sub-benchmarks, we take the higher score of each model across the two settings for easier presentation.

🔗 Citation

@misc{lin2025mmsivideobenchholisticbenchmarkvideobased,
      title={MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence}, 
      author={Jingli Lin and Runsen Xu and Shaohao Zhu and Sihan Yang and Peizhou Cao and Yunlong Ran and Miao Hu and Chenming Zhu and Yiman Xie and Yilin Long and Wenbo Hu and Dahua Lin and Tai Wang and Jiangmiao Pang},
      year={2025},
      eprint={2512.10863},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.10863}, 
}

📄 License

Shield: CC BY-NC 4.0

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

CC BY-NC 4.0

Acknowledgment

MMSI-Video-Bench utilizes data from the following open-source datasets: Roomtour3d, ScanNet, ScanNet++, 3RScan, ARKitScenes, RealEstate10k, DL3DV, Waymo, NuScenes, OVIS, TrackingNet, LaSOT, UAV123, Ego4D, EPIC-KITCHENS, EgoExoLearn, MultiSports, charades, LEMMA, TF2023, CVMHT, AVA, DROID, RH20T, DTU. We sincerely thank the respective teams for their valuable contributions to the research community.

Contact

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages