ByteDance’s Lance: A From-Scratch Trained Unified Multimodal Model for Understanding, Generation, and Editing

Unified multimodal models (Unimodal → Unified) are a hot赛道 in 2026. Yet most approaches follow the old playbook: “scaling up parameters” or “starting with image-text and then extending.” Today, ByteDance’s research team launched Lance, taking a different path: not relying on parameter scale—but on multi-task synergy.

The paper spans 34 pages, includes 14 figures and 10 tables, and its code is already open-sourced. The project homepage is at lance-project.github.io.

Two Core Design Principles

Lance’s design philosophy rests on two pillars:

1. Unified Context Modeling

Lance is trained from scratch on shared, interleaved multimodal sequences using a dual-stream Mixture-of-Experts (MoE) architecture. Understanding and generation share underlying representations—but each has its own dedicated expert pathways. This enables the model to learn both “understanding” and “generating” simultaneously—not learning one first and then adapting it to the other.

2. Decoupled Capability Pathways

Understanding tasks and generation tasks have vastly different requirements: understanding demands fine-grained semantic analysis, whereas generation requires high-fidelity pixel/frame output. Lance addresses this by decoupling these two pathways within the MoE framework, allowing each to specialize in its domain—while still achieving cross-task semantic alignment via shared contextual learning.

Technical Details

Modality-aware Rotatory Positional Encoding (RoPE): To mitigate interference among visual tokens from different modalities, Lance introduces a modality-aware positional encoding scheme—significantly improving cross-task alignment quality.

Phased Multi-Task Training: Lance adopts a phased training paradigm, where each phase targets specific capabilities and employs adaptive data scheduling strategies—jointly strengthening both semantic understanding and visual generation.

Performance

The paper claims that Lance substantially surpasses existing open-source unified models on both image and video generation tasks—while preserving robust multimodal understanding. Concrete benchmark numbers await community replication, but given ByteDance’s proven expertise in video generation (e.g., Dream) this result is unsurprising.

Why It Matters

The core challenge of unified multimodal models is capability isolation: many models lose understanding ability when generation capability improves—or vice versa. Lance’s dual-stream MoE architecture provides a structured solution—not merely cramming all tasks into a single monolithic model.

If community replication confirms the reported performance, Lance may become the new benchmark for open-source unified multimodal models.

Primary sources:

arXiv:2605.18678 — Lance paper
Project homepage: https://lance-project.github.io

Two Core Design Principles

1. Unified Context Modeling

2. Decoupled Capability Pathways

Technical Details

Performance

Why It Matters

Related

APWA: A Distributed Architecture for True Parallelization in Multi-Agent Systems

Dual-Dimensional Consistency: A New Method to Save 10x Tokens During Inference-Time Scaling

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory Capabilities