Causal Forcing++: Tsinghua ML Group Real-Time Video Generation via Few-Step Diffusion Distillation

The persistent problem in video generation: diffusion models produce quality but are slow, autoregressive models are fast but sacrifice quality. There seems to be an unavoidable wall between the two.

Tsinghua ML group's Causal Forcing++ aims to tear that wall down — making diffusion models capable of real-time interactive video generation.

Old Problem, New Solution in Diffusion Distillation

Diffusion distillation isn't new. Early work like SDXL Turbo and LCM already proved: distilling a 50-step diffusion process into 1-4 steps is feasible. But video generation is far more complex than image generation — every frame must not only look good individually but also maintain temporal coherence with neighboring frames. The causal dependency across time is something image distillation doesn't need to handle.

The key insight of Causal Forcing++ is in the name: "causal forcing." In video generation, each frame depends on previous frames — frame 30's character position is determined by frame 29, which depends on frame 28. This is a causal chain.

The distillation challenge: the teacher model generates slowly over 50 steps, while the student model generates quickly in 4 steps. Their intermediate hidden states are completely mismatched. Traditional distillation methods only match the final output, ignoring the causal structure of intermediate steps.

Causal Forcing++ forces the student model to maintain the same causal dependency relationships as the teacher, even during fast generation. Not just learning the result — learning the process.

What "Real-Time Interactive" Actually Means

Real-time interactive video generation isn't just about "fast generation." It means:

Users input text/image prompts and see video in seconds
Conditions can be modified mid-generation (e.g., "make this person walk left"), with instant video response
No waiting minutes for results — the interaction experience resembles chatting with ChatGPT

If this goal is truly achieved, video generation shifts from "offline batch task" to "interactive creative tool."

Community Response

The paper received 72 upvotes on Hugging Face Daily Papers from thu-ml (Tsinghua Machine Learning Group). This group's previous CogVideo work has established credibility in the community, so the attention level is justified.

Points That Need Verification

Quality loss: How much do video quality and temporal coherence actually degrade after distillation?
Generalization: Distilled models typically perform well on training distribution — but what happens with unseen scenes (new object combinations, novel motion patterns)?
Reproducibility: Distillation is sensitive to hyperparameters — can the community reproduce these results?

My Take

The direction is right. For video generation to truly enter workflows, latency must drop to seconds. If Causal Forcing++ finds an acceptable balance between quality and speed, it could become a standard component in video generation pipelines.

But don't jump to conclusions. The distilled video generation space has seen many "looks great on paper, falls short in practice" projects over the past two years. The key is the quality of open-source code and pretrained models — if they only publish a paper without releasing the model, it's just academic competition.

Main sources:

Hugging Face Daily Papers (2026-05-15)
Tsinghua ML Group (thu-ml)

Old Problem, New Solution in Diffusion Distillation

What "Real-Time Interactive" Actually Means

Community Response

Points That Need Verification

My Take

Related

ACC: Compiling Agent Trajectories into Long-Context QA for Direct Reasoning

RLVR Credit Assignment, Revisited: DelTA Takes a Discriminator View on Token-Level Rewards

Do MLLMs Really Read People? MM-OCEAN Finds 51% of "Correct Ratings" Are Guessing