DeepSeek V4 Pro Matches GPT-5.2 on FoodTruck Bench: US-China Frontier Gap Shrinks to 10 Weeks

Core Signal

DeepSeek V4 Pro has matched GPT-5.2’s performance on the FoodTruck Bench agentic evaluation. This marks the first Chinese model to enter the frontier tier in this evaluation system.

The real kicker? Cost efficiency: DeepSeek V4 Pro costs approximately 1/8 of GPT-5.2—and when adjusted for equivalent output quality, the cost gap actually reaches 17 times.

What Is FoodTruck Bench?

FoodTruck Bench is an evaluation benchmark focused on agentic capabilities, measuring a model’s ability to autonomously plan, call tools, perform multi-step reasoning, and execute tasks in real-world scenarios. Unlike traditional static Q&A evaluations, it requires the model to complete end-to-end workflows like a real “digital employee.”

The evaluation team stated in their official announcement:

“DeepSeek V4 Pro just matched GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~8× cheaper. First Chinese model in our frontier tier.”

There are three layers of information behind this statement worth unpacking:

Layer one: Capability parity. DeepSeek V4 Pro performs on par with GPT-5.2 on agentic tasks. Given that GPT-5.2 is one of OpenAI’s strongest general-purpose models today, this is a milestone with symbolic significance.

Layer two: The time gap. “10 weeks later”—the evaluators deliberately emphasized the time difference. The gap between US and Chinese frontier models was previously estimated at about a year. It has now been compressed to under three months.

Layer three: Cost advantage. An 8x price difference means that if enterprises replace GPT-5.2 with DeepSeek V4 Pro for the same agentic workloads, annual API spending can drop from the million-dollar range to the hundred-thousand-dollar range.

Independent Verification

This news has been cross-validated by multiple sources:

Caisi Evaluations analysis indicates that while DeepSeek V4’s overall capabilities lag behind US frontier models by approximately 8 months, the V4 Pro version—through optimized reasoning paths and tool-calling strategies—has caught up on agentic tasks.
Multiple independent developers shared their hands-on experience with DeepSeek V4 Pro on X: “Now, a week in… it’s seamless man.” The transition from an initial adjustment period to smooth daily use means DeepSeek V4 Pro can already replace certain GPT scenarios in real workflows.
Notably, DeepSeek V4 Pro’s integration with Claude Code has also been established—switching requires just three environment variables, giving developers a plug-and-play alternative.

Practical Implications for Developers

Cost decision window: If you’re running high-frequency agentic workloads (data scraping, code generation, automated reports), now is the time to reassess your model selection. DeepSeek V4 Pro’s performance on agentic tasks no longer requires “settling”—it’s a genuine alternative.

Multi-model strategy: The risk of single-model dependence is growing in 2026. A sensible approach is to build a model matrix: GPT-5.2 for core tasks requiring the highest reliability, DeepSeek V4 Pro for high-volume, cost-sensitive agentic loops, and the Claude 4 series for scenarios demanding fine-grained reasoning.

Open-source ecosystem dividend: The DeepSeek series has always maintained an open-source tradition. While V4 Pro is currently available primarily via API, the transparency of its technical roadmap means community adaptation tools will emerge rapidly. Open-source projects like deepclaude have already proven this.

What to Watch Next

Whether FoodTruck Bench will include more Chinese enterprise models (Qwen, Kimi, GLM) in its next evaluation round
Whether DeepSeek V4 Pro’s API pricing will decrease further as scale effects kick in
OpenAI’s pricing response to GPT-5.2

The competition between US and Chinese frontier models is shifting from a “capability gap” narrative to a “cost-performance race.” DeepSeek V4 Pro’s performance on FoodTruck Bench sends a clear signal: Chinese models are no longer just “cheap alternatives”—they are starting to become “the better choice” in certain dimensions.

Core Signal

What Is FoodTruck Bench?

Independent Verification

Practical Implications for Developers

What to Watch Next

Related

Qwen 3.6 Hybrid Solver: Dual-Brain Reasoning with 4B Small Model + 35B Large Model

LeCun bets on JEPA: Did Trillions Go the Wrong Way? World Models vs LLMs Ultimate Route Debate

Qwen3.6 Self-Correction Trap: Why More "Thinking" Leads to Worse Results