NVIDIA SANA-WM: An Open-Source World Model with 2.6B Parameters That Generates Up-to-One-Minute 720p Videos on a Single GPU

The video generation field just dropped another bomb—but this time, it’s not from a closed-source startup. It’s from NVIDIA—and it’s open source.

SANA-WM is a 2.6B-parameter world model that generates controllable 720p videos up to one minute long—using only a single image plus a camera trajectory—on a single GPU. On Hacker News, it scored 312 points and sparked 128 discussions—the rare level of engagement for an AI video-generation topic on the HN front page.

What the Numbers Mean

Let’s look at several key figures:

2.6B parameters: This qualifies as “lightweight” among video generation models. For comparison, some industrial-grade video models have parameter counts reaching 10B or more.
Trained for 15 days on 64 H100s: Training cost remains within a manageable range—unlike projects requiring thousands of accelerator cards.
Inference on a single H100: Generating a one-minute 720p video requires only one GPU.
34 seconds on an RTX 5090: With distillation + NVFP4 quantization, even a consumer-tier flagship GPU can denoise a full 60-second 720p video in just 34 seconds.

Together, these numbers send a clear message: high-quality video generation is shifting from “cloud-only” to “locally runnable.”

Architectural Innovation: Hybrid Linear Attention

SANA-WM achieves this capability thanks to its architectural design.

Traditional Transformers rely on full softmax attention, whose memory and compute requirements scale quadratically with sequence length. For a one-minute video (e.g., 30 fps → 1800 frames), full attention becomes infeasible—NVIDIA explicitly states in its paper that the all-softmax approach runs out of memory (OOM) at the 60-second mark.

SANA-WM’s solution is called Hybrid Linear Attention: it combines frame-wise Gated DeltaNet with periodic softmax attention. Gated DeltaNet efficiently maintains long-term state, while periodic softmax performs fine-grained attention computation at critical moments.

The result? Memory consumption scales linearly, not quadratically, with sequence length. That’s why SANA-WM handles one-minute videos smoothly—while other approaches exhaust GPU memory after just a few seconds.

Precise Camera Control

Generating video alone isn’t enough—SANA-WM’s key differentiator is controllability.

It implements a dual-branch camera control system: a coarse-grained global pose branch governs overall camera motion, while a fine-grained pixel-aligned geometric branch ensures local precision. Together, they enable accurate 6-DoF (six degrees of freedom) camera trajectory tracking.

In short: if you instruct the model, “Move the camera from left to right, then tilt upward,” the generated video will follow that exact trajectory—no improvisation.

Two-Stage Generation Pipeline

SANA-WM’s generation process consists of two stages:

Stage One: The main 2.6B model produces a foundational video, ensuring content coherence and accurate camera control.
Stage Two: A 17B long-video refiner enhances the Stage One output—boosting texture fidelity, motion quality, and temporal consistency.

This “generate-then-refine” paradigm is common in image generation (e.g., SDXL), but rarely applied to long-form video generation. SANA-WM brings it to the long-video domain—with demonstrably strong results.

What Open Source Means

SANA-WM’s greatest value may lie less in its technical specs—and more in its decision to go open source.

Today’s video generation landscape is dominated by closed-source commercial products: Runway, Pika, Luma, Kling, etc. Researchers and small teams lack access to high-quality open-source baseline models for exploration and innovation.

SANA-WM fills that gap. Although model weights are currently labeled “SOON” (not yet released), once available, it is poised to become a new starting point for the open-source video generation community.

Competitive Landscape

The paper benchmarks against several baselines: LingBot-World and HY-WorldPlay—industrial-grade models. SANA-WM matches them in visual quality, yet delivers 36× higher throughput.

This comparison is telling. It suggests: In video generation, parameter count and compute budget do not directly equate to performance. Thoughtful architectural design can deliver comparable quality—even in significantly smaller models.

Conclusion

The release of SANA-WM marks a landmark move by NVIDIA in the open-source AI space. It demonstrates that industrial-grade video generation capabilities can be delivered in a lightweight, open, and locally executable form.

For research teams exploring video generation, SANA-WM lowers the entry barrier. For developers aiming to run video generation locally, the ability to produce a full minute of 720p video in just 34 seconds on an RTX 5090 is already highly practical.

The era of open-source world models may arrive sooner than we think.

Paper: arXiv | Project Page: nvlabs.github.io/Sana/WM

What the Numbers Mean

Architectural Innovation: Hybrid Linear Attention

Precise Camera Control

Two-Stage Generation Pipeline

What Open Source Means

Competitive Landscape

Conclusion

Related

Chrome DevTools Officially Releases MCP Server: AI Coding Agents Can Finally "See" the Browser

Google I/O 2026: The "Agentification" of Search Isn't an Upgrade, It's a Rewrite

Google's SynthID Watermarking Technology Adopted by Giants Like OpenAI and Nvidia: AI Content Provenance Enters the Standardization Era