Reinforcement learning alignment for video generation models has long been a technical challenge.
Aligning text models with human preferences using RLHF/GRPO is already highly mature—generate a few responses from the model, score them with a human preference model, and optimize using the reward signal. But video generation is different: the computational cost of generating a single video frame is tens of thousands of times higher than generating a single token, making it impossible to sample and evaluate at scale like we do with text.
Moreover, the mathematical models for video generation are fundamentally different. Mainstream autoregressive video generators are based on ODE (Ordinary Differential Equation) flow matching architectures, while existing RL methods mostly rely on SDE (Stochastic Differential Equation) and noise-based exploration. The two are fundamentally incompatible.
Tsinghua's KVPO is designed to defuse this "bomb."
Where the Problem Lies
The process of autoregressive video generation can be understood as: starting from the first frame, generating frame by frame, with each step depending on information from all previous frames. When existing RL methods perform policy optimization, they add noise to "explore" different generation paths.
But here's the catch: noise perturbation changes pixel-level details—color shades, texture granularity—rather than semantic-level content—story progression, object motion trajectories. If you want the model to learn to "generate more narrative-driven videos," what it actually explores is just "make this pixel slightly brighter."
It's like asking someone to learn creative writing but only allowing them to change punctuation marks.
KVPO's Core Innovation: Finding Semantic Shifts in the KV Cache
KVPO takes a highly imaginative approach: shifting the source of exploration from random noise to the historical KV Cache.
In autoregressive generation, the KV Cache stores the key-value pairs of all historical tokens, essentially serving as the model's "memory." By applying random routing to historical entries in the KV Cache, KVPO constructs semantically distinct generation branches—because different combinations of historical memories naturally lead to different storylines and visual content.
Even better, this exploration remains strictly on the data manifold—the generated content avoids absurd OOD (out-of-distribution) results, since all variations originate from the representation space the model has already learned.
ODE-Native Policy Modeling
Having solved the exploration problem, KVPO also redesigns the policy modeling approach.
Traditional RL policies are "outsiders" in the video generation context—SDE-based surrogate policies mismatch the dynamic characteristics of ODEs. KVPO introduces a velocity field surrogate policy based on Trajectory Velocity Energy (TVE):
- Quantifies the "likelihood" of different generation branches within the velocity space of flow matching
- Constructs a reward-weighted contrastive objective that is fully consistent with the native ODE formulation
- Requires no SDE approximations or surrogate conversions
This ODE-native design naturally aligns RL signals with the mathematical foundation of video generation, avoiding the theoretical inconsistencies present in previous methods.
Experimental Results
Tested across multiple distilled autoregressive video generators:
- Improved Visual Quality: Better image details, color grading, and composition
- Improved Motion Quality: Enhanced coherence and naturalness of object movements
- Improved Text-Video Consistency: Higher alignment between generated content and prompts
- Broad Applicability: Benefits both single-prompt short videos and multi-prompt long videos
Deeper Implications
KVPO's technical roadmap hints at a broader trend: alignment methods for video generation must be specifically designed for video, rather than simply borrowing methodologies from text models. Videos encompass temporal, spatial, and semantic narrative dimensions, each requiring corresponding exploration strategies and alignment objectives.
This also means that RL alignment in the video generation space is only just beginning. KVPO provides a viable starting point, but there is still a long road ahead to achieve truly human-satisfying video generation.
Main Sources:
- KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
- https://richard-zhang-ai.github.io/KVPO-Project/
- https://github.com/Richard-Zhang-AI/KVPO