We've been assuming: if a large model can correctly describe an image, it "understands" that image.
This paper, published May 14, says: not necessarily.
Titled "Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs," authored by researchers from Nanyang Technological University (Ziwei Liu's team). The core finding is counterintuitive — omnimodal LLMs (models that process text, images, audio, video simultaneously) have a systematic gap between "representation-level" visual understanding and "action-level" output.
The Discovery
Simply put: a model may genuinely "see" image content (internal representations are correct), but when answering questions or performing tasks, its output doesn't match that understanding.
This isn't a "hallucination" problem — hallucination is when a model fabricates non-existent information. This is weirder: the model knows the correct answer (extractable from internal representations), but says a different answer.
The paper uses the "Senses Wide Shut" metaphor — like someone looking at things with eyes open but reacting as if they didn't see.
Why This Matters
Omnimodal models are 2026's hot direction. GPT-4o, Gemini, Qwen-VL, Claude's visual capabilities are all iterating fast. Everyone's racing to "support more modalities."
But this paper asks a more fundamental question: "Seeing" ≠ "Using."
If a medical AI can correctly identify tumors in X-rays (representation correct) but gives a diagnostic suggestion of "no abnormalities found" (action wrong), the model's clinical value is zero — or negative.
Technical Details
Key methodology:
- Probe internal representations — directly read the model's visual representations to confirm "what it saw"
- Compare output behavior — check textual/action output for the same visual input
- Quantify the gap — measure the inconsistency between representation and action
This goes much deeper than traditional "give a model an image, see what it says." Traditional benchmarks only look at output, not what's happening "inside the model's head."
My Take
The paper's value isn't "pointing out a problem" — everyone knows multimodal models have issues. Its value is precisely locating the problem at the "representation-action" interface layer.
Direct implications:
- Model evaluation: output-only benchmarks may severely over/underestimate model capabilities
- Safety alignment: if a model "knows" but "doesn't say," traditional RLHF may fail
- Multimodal Agents: agents making decisions based on visual understanding need additional verification layers
Ziwei Liu's team consistently produces high-quality multimodal research. If these findings replicate across more models, the omnimodal development roadmap may need rethinking — not "add more modalities" but "ensure existing modality understanding reliably translates to action."
The next challenge for multimodal isn't "make it see more" — it's "make it do what it says."
Main sources: