MMSkills: SJTU Wants Visual Agents to Truly "See" and "Act", Not Just Memorize

The field of visual agents is neither overwhelmingly broad nor insignificantly narrow.

It's broad because almost every scenario in embodied AI, robotic manipulation, and screen interaction depends on it—an agent must first "understand" the visual scene before it knows "what to do." It's narrow because, so far, most so-called "visual agents" are essentially doing pattern matching: given an image, output an action, with the intermediate "understanding" largely relying on the model guessing based on similar scenes it has seen during training.

The problem that Shanghai Jiao Tong University's MMSkills (Towards Multimodal Skills for General Visual Agents) aims to solve hits this exact pain point.

What Are "Multimodal Skills"?

The paper's core argument is straightforward: a truly general-purpose visual agent shouldn't just "map images to actions." It should master "skills"—modular multimodal capabilities that can be transferred across tasks and reused across scenarios.

The key here lies in the distinction between a "skill" and an "action."

An "action" is atomic: clicking, dragging, grasping, moving. A "skill" is structured: it combines multiple actions, adjusts strategies based on visual feedback, and makes different choices depending on the context. For example, "opening an application" is a skill. It may involve a sequence like "find icon → click → wait for loading → confirm window appearance," but the agent doesn't need to relearn this entire workflow every time it executes the task.

MMSkills is designed to enable agents to learn these structured multimodal skills, rather than isolated action-observation pairs.

Methodology: Enabling Agents to Learn Skills Like Humans

The paper's methodology features several notable design choices:

Skill Representation. MMSkills encodes skills into a multimodal representation—simultaneously containing visual information and action sequence information. This means that when an agent learns a skill, it doesn't just memorize "if you see A, do B." Instead, it understands "under what visual conditions, executing which action sequence will achieve what outcome."

Skill Composition. Acquired skills can be combined. This mirrors human learning: you first learn "open the door," then learn "turn on the light," and subsequently you can accomplish the composite task of "enter the room and turn on the light" without having to learn it from scratch.

Cross-Task Generalization. This is the core capability MMSkills aims to demonstrate: whether learned skills can be applied to tasks unseen during training.

Differences from Existing Approaches

Mainstream approaches for training visual agents currently fall into two broad categories:

The first is end-to-end training, exemplified by models like RT-2 and the VLA series, which directly map images to actions. The advantage of this approach is its simplicity, but its drawbacks include a lack of interpretability and the difficulty of transferring learned capabilities to new tasks.

The second relies on planning-based approaches, where a large model first makes high-level decisions before invoking low-level controllers for execution. This approach offers flexibility but heavily depends on the large model's visual understanding capabilities, which happen to be a current weakness—LLMs excel at language tasks but still struggle with fine-grained visual comprehension.

MMSkills charts a third path: introducing the abstraction of "skills" at an intermediate layer. It doesn't pursue the simplicity of end-to-end mapping, nor does it rely solely on the generalization prowess of large models. Instead, it builds the agent's capability foundation through systematic skill learning and composition.

Experiments and Results

The paper was evaluated across multiple visual manipulation benchmarks. The results show that MMSkills excels in cross-task generalization—on tasks unseen during training, it significantly outperforms both end-to-end and large-model-based approaches.

This validates the paper's core hypothesis: structured skill learning yields better generalization than mere pattern matching.

My Take

MMSkills is heading in the right direction. Achieving true generality in visual agents cannot be solved by the brute-force route of "more data + larger models." It requires structured knowledge representations and composable capability units—exactly what the "skill" abstraction provides.

However, the results presented so far are primarily confined to academic benchmarks. There is still a massive gap between academic benchmarks and real-world applications. Visual inputs in real-world scenarios are far more complex than the datasets used in the paper, and the definitions and boundaries of skills are nowhere near as clear-cut as in benchmark tasks.

But the direction is correct. When agents move beyond simply "reacting to what they see" and truly master reusable, composable skills, general visual intelligence will have taken a substantial step forward.

Primary Source:

Hugging Face Daily Papers - MMSkills

What Are "Multimodal Skills"?

Methodology: Enabling Agents to Learn Skills Like Humans

Differences from Existing Approaches

Experiments and Results

My Take

Related

APWA: A Distributed Architecture for True Parallelization in Multi-Agent Systems

Dual-Dimensional Consistency: A New Method to Save 10x Tokens During Inference-Time Scaling

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory Capabilities