CogOmniControl: Turning “Creative Intent Understanding” into a Reasoning Engine for Video Generation

The field of video generation is undergoing a subtle but significant shift: from “generating videos that look realistic” to “generating videos that users actually want.” The gap between these two goals is far larger than one might imagine.

This paper, CogOmniControl, comes from the Jianbing Shen team at Beijing Institute of Technology. It tackles a highly specific challenge: controllable video generation—not just generating any video, but generating precisely what the user intends, guided by their creative vision.

Core Idea: Separating “Thinking” from “Drawing”

CogOmniControl’s design philosophy is simple yet effective: decompose controllable video generation into two sequential stages—creative intent cognition (via CogVLM) + video generation (via CogOmniDiT).

This may sound like common sense—but most existing video generation models either inject conditions via adapters or embed generic vision-language models (VLMs) directly into their diffusion backbones. As a result, they suffer from a capability mismatch between condition control fidelity and generation quality.

CogVLM: A Vision-Language Model That Speaks “Creative Language”

The key innovation lies in CogVLM’s training data: real-world anime production data, not generic image-text pairs.

Why anime? Because professional anime production inherently involves abundant “abstract condition → concrete frame” transformations: storyboards, clay renders, concept art—these are all sparse, high-level, abstract creative inputs. Training a VLM on such data enables it to understand users’ creative intent more professionally and precisely, converting sparse cues into rich, structured reasoning outputs.

CogOmniDiT: In-Context Unified Multi-Condition Control

On the generation side, CogOmniControl adopts CogOmniDiT, which unifies control signals from diverse modalities via in-context generation, and aligns its outputs with CogVLM’s reasoning through reinforcement learning (RL).

Closed-Loop Architecture

Even more intriguingly, CogOmniControl structures the entire system as a closed-loop “harness-like” architecture:

CogVLM interprets the user’s creative intent
CogOmniDiT generates candidate videos
CogVLM simultaneously acts as an evaluator, dynamically planning task-specific assessment criteria
A Best-of-N selection mechanism picks the optimal output

This enables the model not only to generate—but also to self-evaluate and iteratively improve.

Two New Benchmarks

The paper also introduces two novel, professionally grounded benchmarks: CogReasonBench and CogControlBench. Built from real animation production workflows, they encode authentic creative intent—not synthetic or simulated intent. On both benchmarks, CogOmniControl surpasses all existing open-source models.

Paper link: arXiv:2605.19995

Core Idea: Separating “Thinking” from “Drawing”

CogVLM: A Vision-Language Model That Speaks “Creative Language”

CogOmniDiT: In-Context Unified Multi-Condition Control

Closed-Loop Architecture

Two New Benchmarks

Related

CiteVQA: OpenDataLab's Document Intelligence Benchmark Makes Every AI Citation Verifiable

CLI-Anything Surges by 1,000 Stars in a Week: Making All Software "Agent-Native," A New Approach from the HKU Team

MMSkills: SJTU Decomposes Visual Agent Capabilities into a "Skill Pack"—A New Paradigm for Multimodal Agents