MMSkills: SJTU Decomposes Visual Agent Capabilities into a "Skill Pack"—A New Paradigm for Multimodal Agents

Over the past two years, the development trajectory of AI Agents has largely followed this pattern: take a powerful large language model, equip it with tool-calling capabilities, and hope it can handle everything on its own.

The problem with this approach is: when tasks become complex, relying solely on a "universal brain" is insufficient.

Just as you wouldn't ask a general practitioner to perform heart surgery, you shouldn't expect a general-purpose Agent to handle all visual tasks.

SJTU's MMSkills proposes a different approach: decompose the Agent's capabilities into independent "skills" that can be combined on demand and invoked flexibly.

What is a Multimodal "Skill"?

In MMSkills, a "Skill" is not a traditional API call, but a complete perception-decision-execution unit.

Each skill comprises three elements:

Trigger Condition: Under what circumstances this skill should be invoked
Input Modality: What kind of visual input is required (screenshots, icons, page structure, etc.)
Output Behavior: What action to execute (click, type, scroll, etc.)

For example, "locate the search box and enter keywords" is one skill, "recognize and fill in a CAPTCHA" is another, and "extract data from a specific column in a table" is yet another.

These skills can be trained, tested, and updated independently, then assembled by the Agent into a complete workflow when needed.

Why is this approach valuable?

First, composability. Like LEGO bricks, you can combine a limited set of skills to create infinite workflows. Adding a new task doesn't require retraining the entire model; you just need to combine existing skills or add a new one.

Second, debuggability. When an Agent fails, you can precisely pinpoint which skill went wrong, rather than being helpless in the face of an end-to-end black-box model.

Third, transferability. A "search for products" skill trained on an e-commerce site might only need minor adjustments to work on other websites. Skill-level transfer is more flexible and cost-effective than model-level transfer.

Technical Details

The MMSkills architecture features several noteworthy designs:

Skill Registry. This is a structured skill library where each skill has standardized descriptions and metadata. When executing a task, the Agent first retrieves relevant skills from the registry and then combines them as needed.

Multimodal Alignment. Skills need to understand not only visual information but also text instructions. MMSkills establishes a fine-grained alignment mechanism between vision and language, ensuring skills correctly interpret user intent.

Dynamic Skill Selection. Faced with a new task, the Agent doesn't guess randomly. Instead, based on the task description and historical experience, it selects the most appropriate skill combination from the registry. This selection process itself is a learning process.

Relationship with the Agent Skills Ecosystem

You may have noticed a surge of Agent Skills projects on GitHub recently—academic-research-skills, scientific-agent-skills, tech-leads-club/agent-skills, and more.

The relationship between MMSkills and these projects is: the former focuses on skillification in the visual/multimodal domain, while the latter primarily targets skillification in coding and research domains. However, they share the same core philosophy—shifting Agent capabilities from "built into the model" to "externally pluggable."

This is no coincidence. As Agents transition from "experimental novelties" to "production-ready tools," modularity, composability, and maintainability of capabilities become critical.

Challenges

While MMSkills' approach is clear, it also faces several challenges:

Skill Explosion. As application scenarios expand, the number of skills could grow exponentially. How do you manage thousands of skills? How do you avoid conflicts and redundancy between them?

Cross-Skill Coordination. When multiple skills need to work together, how do you ensure accurate and efficient information transfer between them?

Skill Evaluation. How do you measure the quality of a skill? Success rate alone may not be enough—some skills might perform well in most cases but fail in critical edge scenarios.

Trend Assessment

The emergence of MMSkills is not an isolated event. Alongside CLI-Anything (making all software natively Agent-driven), agentmemory (persistent Agent memory), and FORGE (self-evolving Agent memory), it points to a broader trend:

Agents are evolving from a "smart large model" into "a system composed of multiple specialized components."

The significance of this shift may be greater than we imagine. As Agent architecture moves from monolithic to modular, its scalability, reliability, and customizability will undergo a qualitative leap.

This doesn't mean large models are becoming less important—quite the opposite. The large model serves as the "orchestration center" and "glue" for this system. But the orchestration center itself doesn't need to know every detail; it only needs to know how to direct various specialized skills to work in concert.

This might just be the right path for Agents to achieve large-scale adoption.

What is a Multimodal "Skill"?

Why is this approach valuable?

Technical Details

Relationship with the Agent Skills Ecosystem

Challenges

Trend Assessment

Related

CiteVQA: OpenDataLab's Document Intelligence Benchmark Makes Every AI Citation Verifiable

CLI-Anything Surges by 1,000 Stars in a Week: Making All Software "Agent-Native," A New Approach from the HKU Team

Decoding the PhysBrain 1.0 Technical Report: AI Finally Begins to "Understand" the Physical World