Tsinghua KVPO: Bringing GRPO into Video Generation, Using KV Cache for Semantic Exploration to Make AI-Generated Videos Better Align with Human Aesthetics

Reinforcement learning alignment for video generation models has long been a technical challenge.

Aligning text models with human preferences using RLHF/GRPO is already highly mature—generate a few responses from the model, score them with a human preference model, and optimize using the reward signal. But video generation is different: the computational cost of generating a single video frame is tens of thousands of times higher than generating a single token, making it impossible to sample and evaluate at scale like we do with text.

Moreover, the mathematical models for video generation are fundamentally different. Mainstream autoregressive video generators are based on ODE (Ordinary Differential Equation) flow matching architectures, while existing RL methods mostly rely on SDE (Stochastic Differential Equation) and noise-based exploration. The two are fundamentally incompatible.

Tsinghua's KVPO is designed to defuse this "bomb."

Where the Problem Lies

The process of autoregressive video generation can be understood as: starting from the first frame, generating frame by frame, with each step depending on information from all previous frames. When existing RL methods perform policy optimization, they add noise to "explore" different generation paths.

But here's the catch: noise perturbation changes pixel-level details—color shades, texture granularity—rather than semantic-level content—story progression, object motion trajectories. If you want the model to learn to "generate more narrative-driven videos," what it actually explores is just "make this pixel slightly brighter."

It's like asking someone to learn creative writing but only allowing them to change punctuation marks.

KVPO's Core Innovation: Finding Semantic Shifts in the KV Cache

KVPO takes a highly imaginative approach: shifting the source of exploration from random noise to the historical KV Cache.

In autoregressive generation, the KV Cache stores the key-value pairs of all historical tokens, essentially serving as the model's "memory." By applying random routing to historical entries in the KV Cache, KVPO constructs semantically distinct generation branches—because different combinations of historical memories naturally lead to different storylines and visual content.

Even better, this exploration remains strictly on the data manifold—the generated content avoids absurd OOD (out-of-distribution) results, since all variations originate from the representation space the model has already learned.

ODE-Native Policy Modeling

Having solved the exploration problem, KVPO also redesigns the policy modeling approach.

Traditional RL policies are "outsiders" in the video generation context—SDE-based surrogate policies mismatch the dynamic characteristics of ODEs. KVPO introduces a velocity field surrogate policy based on Trajectory Velocity Energy (TVE):

Quantifies the "likelihood" of different generation branches within the velocity space of flow matching
Constructs a reward-weighted contrastive objective that is fully consistent with the native ODE formulation
Requires no SDE approximations or surrogate conversions

This ODE-native design naturally aligns RL signals with the mathematical foundation of video generation, avoiding the theoretical inconsistencies present in previous methods.

Experimental Results

Tested across multiple distilled autoregressive video generators:

Improved Visual Quality: Better image details, color grading, and composition
Improved Motion Quality: Enhanced coherence and naturalness of object movements
Improved Text-Video Consistency: Higher alignment between generated content and prompts
Broad Applicability: Benefits both single-prompt short videos and multi-prompt long videos

Deeper Implications

KVPO's technical roadmap hints at a broader trend: alignment methods for video generation must be specifically designed for video, rather than simply borrowing methodologies from text models. Videos encompass temporal, spatial, and semantic narrative dimensions, each requiring corresponding exploration strategies and alignment objectives.

This also means that RL alignment in the video generation space is only just beginning. KVPO provides a viable starting point, but there is still a long road ahead to achieve truly human-satisfying video generation.

Main Sources:

KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
https://richard-zhang-ai.github.io/KVPO-Project/
https://github.com/Richard-Zhang-AI/KVPO

Where the Problem Lies

KVPO's Core Innovation: Finding Semantic Shifts in the KV Cache

ODE-Native Policy Modeling

Experimental Results

Deeper Implications

Related

CiteVQA: OpenDataLab's Document Intelligence Benchmark Makes Every AI Citation Verifiable

CLI-Anything Surges by 1,000 Stars in a Week: Making All Software "Agent-Native," A New Approach from the HKU Team

MMSkills: SJTU Decomposes Visual Agent Capabilities into a "Skill Pack"—A New Paradigm for Multimodal Agents