C
ChaoBro

Omnimodal LLMs' 'Sensory Disconnect': New Paper Reveals Representation-Action Gap

Omnimodal LLMs' 'Sensory Disconnect': New Paper Reveals Representation-Action Gap

We've been assuming: if a large model can correctly describe an image, it "understands" that image.

This paper, published May 14, says: not necessarily.

Titled "Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs," authored by researchers from Nanyang Technological University (Ziwei Liu's team). The core finding is counterintuitive — omnimodal LLMs (models that process text, images, audio, video simultaneously) have a systematic gap between "representation-level" visual understanding and "action-level" output.

The Discovery

Simply put: a model may genuinely "see" image content (internal representations are correct), but when answering questions or performing tasks, its output doesn't match that understanding.

This isn't a "hallucination" problem — hallucination is when a model fabricates non-existent information. This is weirder: the model knows the correct answer (extractable from internal representations), but says a different answer.

The paper uses the "Senses Wide Shut" metaphor — like someone looking at things with eyes open but reacting as if they didn't see.

Why This Matters

Omnimodal models are 2026's hot direction. GPT-4o, Gemini, Qwen-VL, Claude's visual capabilities are all iterating fast. Everyone's racing to "support more modalities."

But this paper asks a more fundamental question: "Seeing" ≠ "Using."

If a medical AI can correctly identify tumors in X-rays (representation correct) but gives a diagnostic suggestion of "no abnormalities found" (action wrong), the model's clinical value is zero — or negative.

Technical Details

Key methodology:

  1. Probe internal representations — directly read the model's visual representations to confirm "what it saw"
  2. Compare output behavior — check textual/action output for the same visual input
  3. Quantify the gap — measure the inconsistency between representation and action

This goes much deeper than traditional "give a model an image, see what it says." Traditional benchmarks only look at output, not what's happening "inside the model's head."

My Take

The paper's value isn't "pointing out a problem" — everyone knows multimodal models have issues. Its value is precisely locating the problem at the "representation-action" interface layer.

Direct implications:

  • Model evaluation: output-only benchmarks may severely over/underestimate model capabilities
  • Safety alignment: if a model "knows" but "doesn't say," traditional RLHF may fail
  • Multimodal Agents: agents making decisions based on visual understanding need additional verification layers

Ziwei Liu's team consistently produces high-quality multimodal research. If these findings replicate across more models, the omnimodal development roadmap may need rethinking — not "add more modalities" but "ensure existing modality understanding reliably translates to action."

The next challenge for multimodal isn't "make it see more" — it's "make it do what it says."


Main sources: