The video generation field just dropped another bomb—but this time, it’s not from a closed-source startup. It’s from NVIDIA—and it’s open source.
SANA-WM is a 2.6B-parameter world model that generates controllable 720p videos up to one minute long—using only a single image plus a camera trajectory—on a single GPU. On Hacker News, it scored 312 points and sparked 128 discussions—the rare level of engagement for an AI video-generation topic on the HN front page.
What the Numbers Mean
Let’s look at several key figures:
- 2.6B parameters: This qualifies as “lightweight” among video generation models. For comparison, some industrial-grade video models have parameter counts reaching 10B or more.
- Trained for 15 days on 64 H100s: Training cost remains within a manageable range—unlike projects requiring thousands of accelerator cards.
- Inference on a single H100: Generating a one-minute 720p video requires only one GPU.
- 34 seconds on an RTX 5090: With distillation + NVFP4 quantization, even a consumer-tier flagship GPU can denoise a full 60-second 720p video in just 34 seconds.
Together, these numbers send a clear message: high-quality video generation is shifting from “cloud-only” to “locally runnable.”
Architectural Innovation: Hybrid Linear Attention
SANA-WM achieves this capability thanks to its architectural design.
Traditional Transformers rely on full softmax attention, whose memory and compute requirements scale quadratically with sequence length. For a one-minute video (e.g., 30 fps → 1800 frames), full attention becomes infeasible—NVIDIA explicitly states in its paper that the all-softmax approach runs out of memory (OOM) at the 60-second mark.
SANA-WM’s solution is called Hybrid Linear Attention: it combines frame-wise Gated DeltaNet with periodic softmax attention. Gated DeltaNet efficiently maintains long-term state, while periodic softmax performs fine-grained attention computation at critical moments.
The result? Memory consumption scales linearly, not quadratically, with sequence length. That’s why SANA-WM handles one-minute videos smoothly—while other approaches exhaust GPU memory after just a few seconds.
Precise Camera Control
Generating video alone isn’t enough—SANA-WM’s key differentiator is controllability.
It implements a dual-branch camera control system: a coarse-grained global pose branch governs overall camera motion, while a fine-grained pixel-aligned geometric branch ensures local precision. Together, they enable accurate 6-DoF (six degrees of freedom) camera trajectory tracking.
In short: if you instruct the model, “Move the camera from left to right, then tilt upward,” the generated video will follow that exact trajectory—no improvisation.
Two-Stage Generation Pipeline
SANA-WM’s generation process consists of two stages:
- Stage One: The main 2.6B model produces a foundational video, ensuring content coherence and accurate camera control.
- Stage Two: A 17B long-video refiner enhances the Stage One output—boosting texture fidelity, motion quality, and temporal consistency.
This “generate-then-refine” paradigm is common in image generation (e.g., SDXL), but rarely applied to long-form video generation. SANA-WM brings it to the long-video domain—with demonstrably strong results.
What Open Source Means
SANA-WM’s greatest value may lie less in its technical specs—and more in its decision to go open source.
Today’s video generation landscape is dominated by closed-source commercial products: Runway, Pika, Luma, Kling, etc. Researchers and small teams lack access to high-quality open-source baseline models for exploration and innovation.
SANA-WM fills that gap. Although model weights are currently labeled “SOON” (not yet released), once available, it is poised to become a new starting point for the open-source video generation community.
Competitive Landscape
The paper benchmarks against several baselines: LingBot-World and HY-WorldPlay—industrial-grade models. SANA-WM matches them in visual quality, yet delivers 36× higher throughput.
This comparison is telling. It suggests: In video generation, parameter count and compute budget do not directly equate to performance. Thoughtful architectural design can deliver comparable quality—even in significantly smaller models.
Conclusion
The release of SANA-WM marks a landmark move by NVIDIA in the open-source AI space. It demonstrates that industrial-grade video generation capabilities can be delivered in a lightweight, open, and locally executable form.
For research teams exploring video generation, SANA-WM lowers the entry barrier. For developers aiming to run video generation locally, the ability to produce a full minute of 720p video in just 34 seconds on an RTX 5090 is already highly practical.
The era of open-source world models may arrive sooner than we think.
Paper: arXiv | Project Page: nvlabs.github.io/Sana/WM