This directory contains code for generating and testing visual puzzle tasks from the VideoThinkBench benchmark.
Visual puzzles assess pattern recognition, inductive reasoning, and visual logic capabilities through tasks involving color filling and shape drawing.
Note: In the latest version, we categorize the tasks into symmetry, gradient and compositionality tasks, as shown below:
- Hexagonal Color Color Pattern Matching (color_hexagon): Fill hexagonal grids with color patterns
- Grid Color Pattern Matching (color_grid): Complete color grids following pattern rules
- Grid Size Pattern Matching (size_grid): Draw circles in grids based on size patterns
- Reflection Recognition & Application (shape_reflect): Draw reflected shapes
- Color Gradient Perception & Application (color_size): Fill colors based on object size patterns
- Cycle Size Pattern Matching (size_cycle): Draw circles in cycle structures based on size patterns
- Shape Color Pattern Matching (polygon_sides_color): Color polygons based on number of sides
- Rectangle Height Color Matching (rectangle_height_color): Color rectangles based on their heights
- Color Mixing Perception & Application (color_overlap_squares): Determine colors for overlapping squares
- Grid Shape & Size Pattern Matching (shape_size_grid): Combine shape and size patterns in grids
# 1. Navigate to the visual_puzzles directory
cd visual_puzzles
# 2. Prepare benchmark data
mkdir -p data
# [Note] you can choose to use the minitest version for evaluation
# cp -r ../VideoThinkBench/Vision-Centric_Reasoning/visual_puzzles/* data/
cp -r ../VideoThinkBench/minitest_Vision-Centric_Reasoning/visual_puzzles/* data/
# 3. Configure your API key in scripts/run.sh
# Edit the file and replace YOUR_API_KEY_HERE with your actual API key
# 4. Run inference with Sora-2
bash scripts/run.sh
# 5. Extract best frames for evaluation
bash scripts/extract_best_frame.shvisual_puzzles/
├── README.md # This file
├── data/ # Benchmark dataset (after preparation)
├── example_data/ # Example data for testing
│ ├── color_size/ # Example color-size puzzles
│ ├── color_grid/ # Example color-grid puzzles
│ └── ... # Other puzzle types
├── eval/ # Evaluation scripts
│ └── find_best_frame.py # Extract optimal frames from videos
├── gen_data/ # Data generation scripts
│ └── data_generation.py # Generate new puzzle instances
├── infer/ # Inference scripts
│ └── request_videos.py # Request video generation from models
├── scripts/ # Utility scripts
│ ├── run.sh # Run inference pipeline
│ ├── extract_best_frame.sh # Extract best frames
│ └── generate_data.sh # Generate new data
└── fonts/ # Font files for rendering
Before running experiments, download and prepare the benchmark data:
# From the Thinking-with-Video root directory
# 1. Download VideoThinkBench dataset (see main README.md)
hf download --repo-type dataset OpenMOSS-Team/VideoThinkBench --local-dir VideoThinkBench
# 2. Extract the visual puzzles data
cd VideoThinkBench
# bash unzip_dir.sh Vision-Centric_Reasoning
# [Note] you can choose to use the minitest version for evaluation
bash unzip_dir.sh minitest_Vision-Centric_Reasoning
# 3. Copy to visual_puzzles directory
cd ..
mkdir -p visual_puzzles/data
# cp -r VideoThinkBench/Vision-Centric_Reasoning/visual_puzzles/* visual_puzzles/data/
cp -r VideoThinkBench/minitest_Vision-Centric_Reasoning/visual_puzzles/* visual_puzzles/data/
cd visual_puzzlesRun inference on all visual puzzle tasks using Sora-2 or other video generation models:
bash scripts/run.shConfiguration Options (edit scripts/run.sh):
--model: Model identifier (default:sora_video2-landscape)--tasks: Space-separated list of tasks to evaluate--data_root: Path to input data directory--base_url: API endpoint URL--api_key: Your API key for the video generation service--output_root: Directory to save generated videos--threads: Number of parallel threads (default: 16)--max_request_attempts: Maximum retry attempts (default: 5)--request_attempt_delay: Delay between retries in seconds (default: 2)--request_mode: Choosechat(existing OpenAI chat completions flow) ordirect(custom REST pipeline)--direct_request_timeout,--direct_poll_interval,--direct_max_poll_attempts: Optional tunables used only when--request_mode=direct
Recent updates extend infer/request_videos.py with a second generation pathway. Specifically:
- Pass
--request_mode directto send handcrafted JSON payloads to/video/createimmediately. - The script captures the returned
video_id(task id), then polls/video/query?id=<video_id>untilstatus == "completed"and finally downloadsvideo_url. - All dataset/image handling, retries, and download bookkeeping remain identical to the chat flow, so downstream evaluation scripts continue to work without changes.
Example run (direct mode):
python infer/request_videos.py \
--model veo_3_1-landscape \
--base_url https://jyapi.ai-wx.cn/v1 \
--request_mode direct \
--tasks color_size color_grid \
--data_root data \
--output_root outputs/direct_run \
--threads 8After generating videos, extract the frame that best matches the solution for each task:
bash scripts/extract_best_frame.shThis script evaluates each frame in the generated videos and selects the one closest to the ground truth solution.
Two comparing modes:
- Color-Filling Tasks: Uses RGB Euclidean distance to compare pixel-wise color similarity
- Shape-Drawing Tasks: Uses coverage difference after binarization to compare shape accuracy
We now provide infer/test_VLM.py to batch-query VLMs on the visual puzzle tasks.
The companion script scripts/run_VLM.sh runs two batches: one that sends the baseline tasks and another that appends the available options lists when requested (via --provide_options). Each batch saves results under vlm_output/<mode>/<model_timestamp>/result.json.
Key CLI flags for test_VLM.py:
--model: Name of the VLM endpoint to call--base_url: Inference service URL--tasks: List of puzzle tasks to evaluate (each task folder must containdata.json)--data_root: Path containing the task directories (defaults todata)--output_root: Directory to hold structured output (metadata + entries inresult.json)--threads,--max_request_attempts,--request_attempt_delay: Control concurrency and retry timing (mirrors video script defaults)--provide_options: Use this flag to appendOptions: ...text for question answering
The runner logs per-task accuracy into each result.json and includes an is_correct flag on every entry, making it easy to aggregate VLM performance.
Create new puzzle instances for creating your own dataset.
bash scripts/generate_data.shGenerate specific puzzle types:
# Generate a single puzzle type
python gen_data/data_generation.py create_data color_size example_data --limit 10 --seed 42
# Generate multiple types with custom resolution
for pattern in color_size size_grid color_grid; do
python gen_data/data_generation.py create_data $pattern custom_data \
--limit 5 \
--seed 17 \
--target_size "(1280, 704)"
doneParameters:
create_data: Command to generate puzzle datapattern: Puzzle type (see Task Categories below)output_dir: Directory to save generated puzzles--limit: Number of instances to generate--seed: Random seed for reproducibility--target_size: Output resolution as "(width, height)"
The data generation code is adapted from the PuzzleVQA project.
