Core Assessment
Among China’s top LLMs, the last player without vision support has finally filled this gap. DeepSeek V4’s image mode rollout speed is surprising — the 1M context feature hadn’t even been fully absorbed before another new capability dropped.
No press conference, no PR release — classic DeepSeek style: a researcher posted then deleted a message, and the feature went live quietly.
What Happened
Around April 30, DeepSeek V4 added an “Image Mode” (识图模式) tab in the official app, appearing alongside “Fast Mode” and “Expert Mode” above the dialog box, with the note “Image understanding feature in internal testing.”
This marks DeepSeek’s official entry into multimodal capabilities.
Real-World Test: True Understanding, Not Just OCR
The article’s author conducted a simple but critical test: uploading a photo of Guilin’s Elephant Trunk Hill with no text in the image.
DeepSeek V4 not only identified the landmark name but also reasoned about its morphological features and geographic location — proving it has genuine scene understanding, not just OCR text extraction.
Test comparison:
- OCR capability: Recognizing text within images (DeepSeek previously supported this)
- Visual understanding: Comprehending scene content, reasoning about meaning (new with Image Mode)
These are two different capability levels. Image Mode fills the latter gap.
Why It Matters
1. Closing the Last Gap
Among China’s top LLM camp, virtually every competitor (Tongyi Qianwen, ERNIE, Kimi, Zhipu GLM) already supported multimodal input. DeepSeek was the only remaining pure-text top-tier player. This update closes that gap.
2. Remarkable Iteration Speed
V4 was just released, and the excitement around the 1M context window hadn’t faded before Image Mode arrived. This iteration pace places DeepSeek firmly in the first tier of Chinese LLMs.
3. Gray-Scale Rollout
Image Mode is currently in gray-scale internal testing — some users may not yet see the entry point. The official recommendation is for users who don’t see the “Image Mode” icon to upgrade their app version.
Technical Background Analysis
DeepSeek V4 had already demonstrated strong reasoning capabilities and ultra-large context handling (1M tokens). The newly added visual understanding capability is most likely an extension of the visual encoder on the existing architecture, rather than a multimodal model built from scratch.
Advantages of this “incremental multimodal” approach:
- Fast iteration: No need to wait for a full V5 release; the existing architecture can extend vision
- Unified user experience: Seamless switching between text and visual tasks within the same model
- Cost-effective: Incremental training costs are lower compared to training a multimodal model from scratch
Industry Landscape Update
As of late April 2026, Chinese top-tier model multimodal capability comparison:
| Model | Text | Vision | Code | Long Context |
|---|---|---|---|---|
| DeepSeek V4 | ✅ | ✅ (Beta) | ✅ | ✅ (1M) |
| Qwen Series | ✅ | ✅ | ✅ | ✅ |
| ERNIE 5.1 | ✅ | ✅ | ✅ | ✅ |
| Kimi K2.6 | ✅ | ✅ | ✅ | ✅ |
| Zhipu GLM | ✅ | ✅ | ✅ | ✅ |
With the vision gap closed, DeepSeek V4 has largely leveled the playing field with competitors. The next phase of differentiation will focus on: visual accuracy, Agent capabilities, and vertical scenario optimization.
Action Items
- DeepSeek users: Upgrade to the latest app version and try Image Mode
- Competitor users: Watch for DeepSeek V4 vision capability benchmarks and compare with existing solutions
- Industry watchers: Note whether DeepSeek opens visual API access — a key signal for enterprise services