In the multimodal model arena, two distinctly different approaches are currently advancing in parallel.
On one side is the "brute-force scaling" approach: making models larger, using more data, and training longer, hoping that cross-modal capabilities will naturally emerge from sheer parameter scale. On the other side is ByteDance's newly released Lance—which explicitly rejects the "capacity scaling" route, aiming instead to solve unified multimodal challenges through architectural innovation and novel training paradigms.
Unification ≠ Patchwork
First, let's clarify what "unified multimodal" means: a single model capable of simultaneously understanding (comprehending images and videos), generating (creating images and videos from text), and editing (modifying existing images and videos).
Previous approaches either trained three separate models or relied on a single massive model triggered by different prompt formats to activate different capabilities. The former is costly, while the latter is prone to capability interference—the parameter requirements for understanding tasks and generation tasks are fundamentally conflicting.
Lance takes a clever approach:
Dual-stream MoE (Mixture of Experts) architecture. The model shares a foundational multimodal sequence representation, but splits into two independent expert pathways at the upper layers—one dedicated to understanding tasks, and the other to generation/editing tasks. Both pathways benefit from shared "in-context learning" (such as understanding image-text alignments), but their respective parameters do not interfere with each other.
This design resolves a fundamental contradiction: understanding requires discriminative, fine-grained analysis, while generation demands creative expressive capabilities. Forcing both into the same set of parameters typically results in mediocrity across the board.
The Mechanics of Multi-Task Synergistic Training
Architecture alone isn't enough; the training methodology is where Lance truly differentiates itself.
The paper introduces a phased multi-task training paradigm, centered on a "capability-oriented" philosophy:
- Early Stage: First, teach the model basic cross-modal alignment—image-text matching and inter-frame video relationships
- Mid Stage: Introduce generation and editing tasks, but employ adaptive data scheduling to ensure understanding and generation capabilities grow in tandem
- Late Stage: Conduct focused training on weaker tasks
This training strategy avoids the common "catastrophic forgetting" issue in traditional unified models—where learning generation causes the model to forget understanding, or vice versa.
The paper also introduces modality-aware Rotary Position Embedding (RoPE), a highly practical innovation. Tokens from different modalities (text tokens, image patch tokens, video frame tokens) have distinct positional encoding requirements. A unified RoPE would cause cross-modal interference. Lance's positional encoding automatically identifies the modality type of each token and applies tailored encoding strategies accordingly.
Performance
Lance "substantially outperforms existing open-source unified models" on image and video generation tasks—directly quoting the paper. At the same time, it maintains strong multimodal understanding capabilities.
Specifically, as a "lightweight" model, Lance surpasses larger-parameter competitors in video generation quality. This is attributed to the dual-stream architecture avoiding parameter waste, combined with the synergistic effects brought by multi-task training.
ByteDance's Multimodal Ambitions
Given ByteDance's massive business footprint in short videos and content generation, the release of Lance is far from a purely academic exercise. A unified, lightweight multimodal model can directly serve the content creation toolchains for products like Douyin (TikTok) and CapCut—understanding user intent, automatically generating assets, and intelligently editing videos, all in one seamless pipeline.
Its Apache-2.0 open-source release (GitHub: bytedance/Lance, 134 Stars) also indicates their desire for community involvement to rapidly iterate and validate the model.
Key Points to Watch
- How lightweight is it exactly? The paper emphasizes "lightweight" but does not specify the exact parameter count; community benchmarks will be needed
- Long-video capabilities: While Lance supports video generation and editing, the paper lacks detailed benchmarks on maximum duration and resolution
- Open-source progress: Currently at 134 Stars, it's still in early stages; code completeness and usability remain to be seen
Primary Sources:
- Lance: Unified Multimodal Modeling by Multi-Task Synergy
- https://lance-project.github.io/
- https://github.com/bytedance/Lance