ROSE v0.1 · arXiv:2606.19965 · Preprint

ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models

Reference-conditioned Oddity and Symbolic Execution

Yihao Wang1 · Zijian He1 · Jie Ren2 · Keze Wang1,†
1Sun Yat-sen University 2Shaanxi Normal University
Preprint, June 2026 · Corresponding author: kezewang@gmail.com
Overview of the ROSE benchmark
ROSE fixes the visual scene while changing the task context. A model must infer the implicit majority reference, identify the exception cells, and produce different formal outputs under global, region-conditioned, and exclusion-based tasks.

Can a multimodal model turn the same visual evidence into the exact action required by the current task context?

Highlights

1,512
Scenes
3,024
Rendered images
7,560
Task instances
5
Fine-grained visual sources
98.8%
Human PASS
44.5 pts
Largest counting-to-action drop

Coupled scene design

Global counting, local counting, local clicking, visual-region clicking, and exclusion actions are derived from the same underlying scene.

Exact symbolic execution

Outputs are automatically evaluated as COUNT, coordinate-level CLICK, or click-count SUBMIT actions.

Diagnostic bridge controls

Global-click and matched local count-to-click controls separate output validity, coordinate grounding, and context-conditioned action.

Abstract

Multimodal large language models (MLLMs) are increasingly expected to act on visual information, yet the same scene may require different actions under different task contexts. How reliably can a model turn the same visual evidence into the action required by the current context? To answer this question, we introduce ROSE (Reference-conditioned Oddity and Symbolic Execution), a controlled benchmark that holds the visual scene fixed while varying region constraints and required symbolic outputs. Through coupled counting and coordinate-action tasks, ROSE tests whether models can infer an implicit majority reference and act on the resulting fine-grained visual evidence under changing contexts. Across nine recent MLLMs, performance drops by as much as 44.5 percentage points from counting-oriented tasks to region-conditioned action, despite 98.8% human performance. The gap persists on paired scenes and regions for which the same model returns the correct count, while global-click and matched local controls show that coordinate grounding explains only part of the loss, revealing a distinct, model-dependent bottleneck in turning shared visual evidence into context-specific actions.

Benchmark Design

ROSE is designed as a controlled readout problem: the underlying visual evidence stays fixed, while the relevant region and required symbolic output change. The target identity is never named explicitly; it must be inferred from the scene-internal majority relation.

ChineseGlyph

Distinct but visually confusable Chinese characters rendered with the same verified font.

🎨

EmojiStyle

The same emoji identity shown with different provider-specific rendering styles.

🙂

EmojiContent

Visually related emoji identities rendered in a shared style.

PixelEdit

A localized modification introduced into the same pixel-art source asset.

PixelContent

Related but distinct pixel-art assets with matched visual structure or theme.

Template Task Region / context Required output
T1 Global counting Whole grid COUNT(n)
T2 Local counting Numeric row, column, or rectangle COUNT(n)
T3 Local clicking Numeric row, column, or rectangle CLICK(...); DONE
T4 Visual-region clicking Highlighted region in the image CLICK(...); DONE
T5 Exclusion clicking with count submission Outside a specified excluded region CLICK(...); SUBMIT(n)
Formal protocol COUNT(n) CLICK(Rr,Cc); ...; DONE CLICK(Rr,Cc); ...; SUBMIT(n)

Dataset Composition

ROSE v0.1 uses an official scene-level split. All task variants and both renderings derived from the same scene are assigned to the same split. Dev and Test below report task-instance counts.

Subset Controlled visual source Scenes Dev Test
ChineseGlyphConfusable characters, same verified font4125551,505
EmojiStyleSame emoji, different rendering providers3003951,105
EmojiContentRelated emoji identities, shared rendering style3003951,105
PixelEditSame pixel-art asset, localized edit3003951,105
PixelContentRelated but distinct pixel-art assets200260740
TotalFive visual sources and five coupled task templates1,5122,0005,560

Main Results

ROSE is highly solvable by humans, but current MLLMs exhibit a clear and model-dependent gap between counting-oriented readouts and exact region-conditioned actions.

Counting does not automatically become action.

GPT-5.5 achieves the strongest model result with 92.2% Avg. PASS, followed by Gemini-3.1-Pro with 79.4%. The remaining models range from 14.3% to 50.3%.

The dominant signal is not only model ranking, but the gap from compact counting readouts to context-sensitive coordinate actions. For example, Qwen3.6-Plus reaches 80.3% on global counting but 37.7% on visual-region clicking.

Model Avg. PASS VALID
Qwen3.6-Plus50.399.9
GLM-5V-Turbo33.899.5
Gemini-3.1-Pro79.493.4
GPT-5.592.2100.0
Human98.8
Model G-Cnt L-Cnt L-Clk V-Clk Excl-CS Glyph Emoji Pixel Avg. VALID
Qwen3-VL-Flash47.721.61.30.50.715.014.913.614.386.6
Qwen3-VL-Plus66.430.64.15.73.221.824.020.122.095.5
Qwen3.6-Plus80.365.339.537.728.948.448.053.650.399.9
Claude-Sonnet-4.662.121.69.820.64.528.525.020.123.761.3
Claude-Opus-4.864.021.29.821.44.930.225.220.424.362.7
GLM-4.6V60.730.85.04.52.519.122.419.920.798.8
GLM-5V-Turbo64.256.920.621.46.137.534.531.333.899.5
Gemini-3.1-Pro92.893.975.464.270.467.684.580.179.493.4
GPT-5.593.897.093.684.392.587.494.892.292.2100.0
Human99.9100.098.897.795.899.897.599.898.8
Perception-to-action gap in ROSE
Counting Avg. averages G-Cnt and L-Cnt, while Action Avg. averages L-Clk, V-Clk, and Excl-CS. The gap remains after conditioning on correct global counting.

Diagnostic Controls

Global-click bridge

Does the gap reduce to coordinate grounding?

This bridge inserts full-grid coordinate localization between global counting and visual-region clicking.

G-Cnt G-Clk V-Clk
ModelG-CntG-ClkV-Clk
Qwen3.6-Plus80.367.137.7
Gemini-3.1-Pro92.886.564.2
GPT-5.593.891.884.3

Coordinate grounding explains part of the loss, but the larger degradation often appears only after a region context is introduced.

Matched local count-to-click

Does correct regional counting support exact action?

Each matched pair uses the same image, numeric region, and regional target set; only the output operation changes from COUNT to CLICK.

ModelmL-CntL-ClkL-Clk†Fail†
Qwen3.6-Plus63.239.552.747.3
GPT-5.596.793.695.84.2

Correct cardinality is highly predictive of exact action for GPT-5.5, but not for Qwen3.6-Plus, especially when the required action is empty.

Qualitative Failure Cases

Even when the independently queried global-count response is correct, the model may fail to rebind the same visual evidence to the current action context.

Representative ROSE failure cases
Representative failures include count anchoring, region ignored, and failure to abstain when no valid target remains.

BibTeX

@misc{wang2026rose,
  title={ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models},
  author={Yihao Wang and Zijian He and Jie Ren and Keze Wang},
  year={2026},
  eprint={2606.19965},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2606.19965}
}