LMSYS Three-Year Arena Review: Open Source Models Are Closing the Gap with Proprietary Ones

In early 2023, proprietary models led open source by 250 points in Chatbot Arena Text Arena. That was a nearly insurmountable gap.

By early 2026, that number dropped to single digits.

LMSYS published a dataset yesterday spanning three years across three Arenas (Text, Code, Expert Prompt), answering a question many people have been asking: Have open source models caught up?

The answer is basically yes. But not equally across all domains.

Text Arena: From +250 to Single Digits

The most intuitive curve. Early 2023: proprietary +250. Early 2025: compressed to "low double digits." Then—note this timing—DeepSeek R1 briefly overtook in early 2025, giving open source a historic Arena lead.

That lead didn't last. Proprietary models quickly reclaimed #1, but the gap is no longer an order of magnitude difference.

Code Arena: Compression Even Faster

Code Arena has a shorter history but the gap closed more aggressively. Proprietary lead peaked at +100, then compressed through spring 2026 to around +40 today.

+40 means proprietary still has a perceptible advantage, but it's no longer "once you try it you can't go back" territory.

Expert Prompt: +40 Still Held by Proprietary

Expert Prompt is the hardest Arena. Proprietary models still maintain a +40 lead here.

LMSYS's own words: "Expert prompts are the toughest challenge for open models."

Who's Driving This

DeepSeek R1's overtake in early 2025 wasn't accidental—MoE architecture and dramatically lower inference costs pushed open source性价比 to a new level. Qwen 3.6's performance on the Intelligence Index and Kimi K2.6's SWE-bench results are further proof.

My Take

If you're doing model selection, this data says one thing: open source models' position as the default option is forming.

Main sources:

Text Arena: From +250 to Single Digits

Code Arena: Compression Even Faster

Expert Prompt: +40 Still Held by Proprietary

Who's Driving This

My Take

Related

ACC: Compiling Agent Trajectories into Long-Context QA for Direct Reasoning

RLVR Credit Assignment, Revisited: DelTA Takes a Discriminator View on Token-Level Rewards

Do MLLMs Really Read People? MM-OCEAN Finds 51% of "Correct Ratings" Are Guessing