The “Efficiency Race” for World Models
World models are among the most exciting directions in AI—a model capable of understanding physical world dynamics and generating future video frames conditioned on actions.
Yet prior world models suffered from two key issues: size and cost. Parameter counts routinely reached several billion; training required thousands of GPUs over weeks or even months; and inference demanded multiple top-tier GPUs.
SANA-WM takes a different stance: we can build smaller, faster, and cheaper world models—without sacrificing performance.
2.6B Parameters—Competing with Industrial-Scale Models
SANA-WM has only 2.6B parameters. For comparison, industrial-scale baseline models such as LingBot-World and HY-WorldPlay typically have parameter counts several times larger.
Yet the paper claims SANA-WM achieves visual quality on par with these large models—an ambitious claim.
Key metrics:
- 720p resolution, one-minute video generation
- Precise camera control (6-DoF trajectory tracking)
- Training efficiency: trained on only ~213K publicly available video clips, using 64 H100 GPUs for 15 days
- Inference efficiency: generates 60-second videos on a single GPU; the distilled + NVFP4-quantized version denoises a 60-second, 720p video in just 34 seconds on a single RTX 5090
Four Core Design Innovations
Hybrid Linear Attention
This is the cornerstone of SANA-WM’s efficiency. It combines inter-frame Gated DeltaNet (GDN) with softmax attention—preserving long-context modeling capability while drastically reducing memory consumption.
Intuitively: GDN handles temporal dependencies between frames (memory-efficient), while softmax attention captures fine-grained spatial details within each frame (high-fidelity). The two mechanisms complement each other.
Dual-Branch Camera Control
Ensures generated videos strictly follow the input 6-DoF camera trajectory. One branch handles spatial localization; the other ensures temporal smoothness—working in concert.
Two-Stage Generation Pipeline
The first stage produces a foundational video sequence; the second stage refines it using a long-video refiner applied to the first-stage output. This design mirrors the “draft–refine” paradigm common in image generation—but is significantly more complex in video, requiring explicit guarantees of temporal consistency.
Robust Annotation Pipeline
Extracts accurate metric-scale 6-DoF camera poses from public videos to serve as action labels. The fidelity of this step directly determines how accurately the model learns physical laws.
The Significance of Open Sourcing
SANA-WM’s open release is a major catalyst for the world model community. Until now, high-quality world models were almost entirely closed-source—researchers could only observe results via papers and demo videos.
Now, a 2.6B-parameter, open-source world model deployable on consumer-grade GPUs (e.g., RTX 5090) empowers independent researchers and small teams to conduct experiments and develop applications grounded in world modeling.
Potential Applications
Minute-scale world models unlock diverse use cases, including:
- Dynamic scene generation for games and virtual environments
- Autonomous driving simulation (generating road scenes under varying camera viewpoints and agent actions)
- Pre-visualization (pre-vis) in film and television production
- Training environment generation for embodied AI
Primary Sources:
- arXiv:2605.15178 SANA-WM
- Haoyi Zhu, Haozhe Liu, Yuyang Zhao, Tian Ye, Junsong Chen, Jincheng Yu, Tong He, Song Han, Enze Xie
- NVIDIA
- Project Page