There's an old problem in robotics: you teach a robot to do one thing, and it learns. But change the scene, swap the object, or adjust the lighting, and it fails.
This is the dilemma faced by Vision-Language-Action (VLA) models. VLA models have already made significant strides in semantic generalization—they can understand natural language instructions like "put the red cup on the left side of the table" and translate them into actions. But at their core, they learn a reactive mapping: see a certain frame, execute a certain action. They don't care about "what will happen to the world if I do this."
A new survey by the OpenMOSS team names this emerging solution under a unified paradigm: WorldActionModels (WAMs).
From "See and Act" to "Think Before You Act"
The core idea behind WAMs is straightforward: integrate a world model (a model that predicts environmental dynamics) into the action generation pipeline.
Existing VLA models learn P(action | observation, instruction)—given the current observation and instruction, output an action. WAMs learn P(future state, action | current state, instruction)—not only outputting an action but also predicting how the world will change after that action is executed.
This added "predictive" capability gives robots a form of "imagination." Before executing an action, it can internally simulate: "If I reach for this cup, how will the cup move, where will my hand end up, and will it bump into nearby objects?"
Two Architectural Approaches
The survey categorizes existing WAMs approaches into two main classes:
Cascaded WAMs. First, a world model predicts the future state, and then a policy model generates actions based on that predicted state. The two modules are independent and can be trained separately. The advantage is clear modularity and ease of debugging; the downside is error accumulation—if the world model's prediction is inaccurate, the policy model will inevitably make mistakes based on it.
Joint WAMs. The world model and policy model share representations and are trained jointly, targeting the joint distribution of future states and actions. The advantage is that the two modules can correct each other; the downside is more complex training and higher computational costs.
The survey further refines the classification based on generation modality (generating images vs. generating features), conditioning mechanisms (text-conditioned vs. vision-conditioned), and action decoding strategies (direct output vs. autoregressive generation).
Data Ecosystem: From Teleoperation to Internet Videos
The development of WAMs heavily relies on data, and the survey systematically outlines four main data sources:
- Robotic Teleoperation Data: Humans remotely control robots to perform tasks, recording actions and state changes. High quality but limited in scale.
- Portable Human Demonstrations: Human operations are recorded using VR headsets or data gloves, then transferred to robots. Offers better scalability.
- Simulation Data: Generated in simulators like Isaac Sim or MuJoCo. Can be produced at massive scales, but the sim-to-real gap remains a persistent challenge.
- Internet-Scale Egocentric Videos: First-person human videos collected from platforms like YouTube. Largest in scale, but lacks precise action annotations.
Interestingly, the survey highlights several approaches attempting to bridge the gap between these data sources using "latent actions"—learning implicit action representations directly from videos without requiring precise joint-angle annotations.
Evaluation: Three Dimensions
Evaluation protocols for WAMs are also gradually taking shape, with the survey summarizing three core dimensions:
- Visual Fidelity: How closely the predicted future frames match the real ones.
- Physical Commonsense: Whether the predictions adhere to physical laws (e.g., objects don't clip through each other, gravity points in the correct direction).
- Action Plausibility: Whether the generated actions are effective for the target task.
These three dimensions correspond respectively to whether the world model "sees accurately," "thinks correctly," and whether the policy model "acts effectively."
Why This Survey Arrives at the Perfect Time
WAMs is not a brand-new invention, but it has indeed reached a stage where it needs to be formally recognized. Over the past two years, Google's RT series, Figure AI's Figure 01, and solutions from various robotics companies have all been moving toward the "VLA + World Model" direction, yet each uses its own terminology and architecture.
The significance of OpenMOSS's survey lies in this: it provides a unified name and classification system for this emerging paradigm. For researchers newly entering the field, this saves a tremendous amount of time spent sorting through literature; for those already working on it, it offers a coordinate system to position their own work.
Embodied AI is transitioning from "imitation learning" to a closed loop of "understand-predict-act." WAMs represents a critical milestone in this shift.
HuggingFace Paper Page: WorldActionModels on HF Papers