The field of video generation is undergoing a subtle but significant shift: from “generating videos that look realistic” to “generating videos that users actually want.” The gap between these two goals is far larger than one might imagine.
This paper, CogOmniControl, comes from the Jianbing Shen team at Beijing Institute of Technology. It tackles a highly specific challenge: controllable video generation—not just generating any video, but generating precisely what the user intends, guided by their creative vision.
Core Idea: Separating “Thinking” from “Drawing”
CogOmniControl’s design philosophy is simple yet effective: decompose controllable video generation into two sequential stages—creative intent cognition (via CogVLM) + video generation (via CogOmniDiT).
This may sound like common sense—but most existing video generation models either inject conditions via adapters or embed generic vision-language models (VLMs) directly into their diffusion backbones. As a result, they suffer from a capability mismatch between condition control fidelity and generation quality.
CogVLM: A Vision-Language Model That Speaks “Creative Language”
The key innovation lies in CogVLM’s training data: real-world anime production data, not generic image-text pairs.
Why anime? Because professional anime production inherently involves abundant “abstract condition → concrete frame” transformations: storyboards, clay renders, concept art—these are all sparse, high-level, abstract creative inputs. Training a VLM on such data enables it to understand users’ creative intent more professionally and precisely, converting sparse cues into rich, structured reasoning outputs.
CogOmniDiT: In-Context Unified Multi-Condition Control
On the generation side, CogOmniControl adopts CogOmniDiT, which unifies control signals from diverse modalities via in-context generation, and aligns its outputs with CogVLM’s reasoning through reinforcement learning (RL).
Closed-Loop Architecture
Even more intriguingly, CogOmniControl structures the entire system as a closed-loop “harness-like” architecture:
- CogVLM interprets the user’s creative intent
- CogOmniDiT generates candidate videos
- CogVLM simultaneously acts as an evaluator, dynamically planning task-specific assessment criteria
- A Best-of-N selection mechanism picks the optimal output
This enables the model not only to generate—but also to self-evaluate and iteratively improve.
Two New Benchmarks
The paper also introduces two novel, professionally grounded benchmarks: CogReasonBench and CogControlBench. Built from real animation production workflows, they encode authentic creative intent—not synthetic or simulated intent. On both benchmarks, CogOmniControl surpasses all existing open-source models.
Paper link: arXiv:2605.19995