SANA-WM: A 2.6B-Parameter, Minute-Scale World Model from NVIDIA—Trainable on 64 H100s in 15 Days, Deployable on a Single GPU

The “Efficiency Race” for World Models

World models are among the most exciting directions in AI—a model capable of understanding physical world dynamics and generating future video frames conditioned on actions.

Yet prior world models suffered from two key issues: size and cost. Parameter counts routinely reached several billion; training required thousands of GPUs over weeks or even months; and inference demanded multiple top-tier GPUs.

SANA-WM takes a different stance: we can build smaller, faster, and cheaper world models—without sacrificing performance.

2.6B Parameters—Competing with Industrial-Scale Models

SANA-WM has only 2.6B parameters. For comparison, industrial-scale baseline models such as LingBot-World and HY-WorldPlay typically have parameter counts several times larger.

Yet the paper claims SANA-WM achieves visual quality on par with these large models—an ambitious claim.

Key metrics:

720p resolution, one-minute video generation
Precise camera control (6-DoF trajectory tracking)
Training efficiency: trained on only ~213K publicly available video clips, using 64 H100 GPUs for 15 days
Inference efficiency: generates 60-second videos on a single GPU; the distilled + NVFP4-quantized version denoises a 60-second, 720p video in just 34 seconds on a single RTX 5090

Four Core Design Innovations

Hybrid Linear Attention

This is the cornerstone of SANA-WM’s efficiency. It combines inter-frame Gated DeltaNet (GDN) with softmax attention—preserving long-context modeling capability while drastically reducing memory consumption.

Intuitively: GDN handles temporal dependencies between frames (memory-efficient), while softmax attention captures fine-grained spatial details within each frame (high-fidelity). The two mechanisms complement each other.

Dual-Branch Camera Control

Ensures generated videos strictly follow the input 6-DoF camera trajectory. One branch handles spatial localization; the other ensures temporal smoothness—working in concert.

Two-Stage Generation Pipeline

The first stage produces a foundational video sequence; the second stage refines it using a long-video refiner applied to the first-stage output. This design mirrors the “draft–refine” paradigm common in image generation—but is significantly more complex in video, requiring explicit guarantees of temporal consistency.

Robust Annotation Pipeline

Extracts accurate metric-scale 6-DoF camera poses from public videos to serve as action labels. The fidelity of this step directly determines how accurately the model learns physical laws.

The Significance of Open Sourcing

SANA-WM’s open release is a major catalyst for the world model community. Until now, high-quality world models were almost entirely closed-source—researchers could only observe results via papers and demo videos.

Now, a 2.6B-parameter, open-source world model deployable on consumer-grade GPUs (e.g., RTX 5090) empowers independent researchers and small teams to conduct experiments and develop applications grounded in world modeling.

Potential Applications

Minute-scale world models unlock diverse use cases, including:

Dynamic scene generation for games and virtual environments
Autonomous driving simulation (generating road scenes under varying camera viewpoints and agent actions)
Pre-visualization (pre-vis) in film and television production
Training environment generation for embodied AI

Primary Sources:

arXiv:2605.15178 SANA-WM
Haoyi Zhu, Haozhe Liu, Yuyang Zhao, Tian Ye, Junsong Chen, Jincheng Yu, Tong He, Song Han, Enze Xie
NVIDIA
Project Page

The “Efficiency Race” for World Models

2.6B Parameters—Competing with Industrial-Scale Models

Four Core Design Innovations

The Significance of Open Sourcing

Potential Applications

Related

CiteVQA: OpenDataLab's Document Intelligence Benchmark Makes Every AI Citation Verifiable

CLI-Anything Surges by 1,000 Stars in a Week: Making All Software "Agent-Native," A New Approach from the HKU Team

MMSkills: SJTU Decomposes Visual Agent Capabilities into a "Skill Pack"—A New Paradigm for Multimodal Agents