This repository contains the code and data for our paper VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs.
The suite includes three main task categories:
- Object Re-Identification: Tests comparative perception by asking if a transformed object reappears in a second image.
- Visual Scavenger Hunt (Chain Trial): Tests saccadic search by requiring models to follow a chain of visual cues across a grid.
- Circuit Connections: Tests smooth visual search by asking models to trace wires on a circuit board.
This code allows you to verify our results, generate new datasets, run models against those datasets, and analyze the results.
- Python 3.8+
- Dependencies: Install the packages listed in
requirements.txt(e.g.,pip install -r requirements.txt). The list includes Pillow, numpy, pandas, statsmodels, scipy, scikit-learn, openai, requests, torch, transformers, and tabulate. - API Keys: If using OpenAI or OpenRouter, set your API keys as environment variables:
OPENAI_API_KEY="your_openai_key"OPENROUTER_API_KEY="your_openrouter_key"
All functionality is exposed through main.py.
Note: If you're using a conda environment, make sure to activate it first:
conda activate svlmpython main.py [COMMON_OPTIONS] <TASK_NAME> [TASK_SPECIFIC_OPTIONS]Example
python main.py --num-examples 20 --verbose objreid--models BACKEND:MODEL_ID …
Evaluate one or more models.
Examples:openai:o4-mini-2025-04-16,openrouter:google/gemini-2.5-pro-preview,local:Qwen/Qwen2.5-VL-32B-Instruct--num-examples INT– number of examples to run--few-shot– enable few-shot evaluation (uses canonical demos made during make-dataset)--all-settings– sweep over FS/non-FS and other settings (e.g. distractors)--verbose– print per-example details and raw model outputs--local-model-path PATH– path to a local model (if not given inline in--models)--make-dataset DIR– generate a new dataset in DIR--load-dataset DIR [DIR …]– load existing dataset(s) and evaluate--test-image– generate one sample image for the chosen task and exit
Only certain HF models are supported; check inference/infclient.py.
Below are the extra flags and typical commands for each task family.
Generation-only flags
| Flag | Meaning | Default |
|---|---|---|
--objreid-canvas-size INT |
Output image resolution | 512 |
--objreid-trials 1 3 9 … |
Which trial types to generate/evaluate | 1 3 9 |
--objreid-no-distractors |
Disable distractor objects | off |
--objreid-allow-distractor-overlap |
Allow distractors to overlap the main object | off |
Trial 9 corresponds to the 'Standard' variant from the paper; 1 corresponds to the 'Unconnected' variant, and 3 corresponds to the 'Pixel Perfect' variant. Example commands
Create a single test image for trial 9:
python main.py --test-image objreid --objreid-trials 9Make a 50-example dataset for trial 1, few-shot, no distractors:
python main.py \
--make-dataset ./my_objreid_T1_fs_nodist \
--num-examples 50 \
--few-shot \
objreid \
--objreid-trials 1 \
--objreid-no-distractorsEvaluate GPT o4 mini on the fs1_nd0 split of trial 9 in an existing dataset:
python main.py \
--load-dataset ./objreid-data \
--models openai:o4-mini-2025-04-16 \
--few-shot \
--verbose \
objreid \
--objreid-trials 9Live inference with a local model (trial 3, 5 examples):
python main.py \
--models local:/hfpath/to/your/local_model \
--num-examples 5 \
--verbose \
objreid \
--objreid-trials 3Generation-only flags
| Flag | Meaning | Default |
|---|---|---|
--va-grid-size INT |
Grid dimension (e.g. 3 → 3×3) |
3 |
--va-cell-size INT |
Cell height/width in pixels | 100 |
--va-chain-length INT |
Number of clue hops | 2 |
Example commands
One test image with a 3-step chain:
python main.py --test-image visual_attention --va-chain-length 350-example few-shot dataset:
python main.py \
--make-dataset ./my_va_dataset_fs1 \
--num-examples 50 \
--few-shot \
visual_attentionEvaluate Gemini Flash on the non-few-shot split:
python main.py \
--load-dataset ./my_va_dataset \
--models openrouter:google/gemini-2.5-pro-preview \
--num-examples 30 \
--verbose \
visual_attentionGeneration-only flags
| Flag | Description |
|---|---|
--min-components / --max-components INT |
Limits on number of components |
--min-ports / --max-ports INT |
Ports per component |
--min-wires / --max-wires INT |
Total wires |
--min-cc-wires / --max-cc-wires INT |
Wires within a single component |
| `--wire-color-mode [default | single |
--no-wire-crossing |
Forbid wire crossings |
Example commands
Single test image, unique wire colors, no crossings:
python main.py \
--test-image \
circuits \
--wire-color-mode unique \
--no-wire-crossing50-example dataset, single-color wires:
python main.py \
--make-dataset ./my_circuits_singlecolor \
--num-examples 50 \
circuits \
--wire-color-mode singleEvaluate GPT o4 mini on a few-shot, parameter-controlled live run:
python main.py \
--models openai:o4-mini-2025-04-16 \
--num-examples 5 \
--few-shot \
--verbose \
circuits \
--min-components 3 --max-components 5 \
--min-wires 4 --max-wires 8@article{berman2025vlmtunnel,
title={VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs},
author={Berman, Shmuel and Deng, Jia},
journal={arXiv preprint arXiv:2507.13361},
year={2025}
}