C
ChaoBro

OpenDeepThink: Using Voting Instead of Judgment to Boost Gemini's Codeforces Elo by 405 Points

Letting LLMs Score Themselves, But Is It Really Reliable?

Over the past year, the mainstream approach to enhancing reasoning capabilities has been to think deeper—pushing models further along a single reasoning path. Chain-of-thought in the o1 series and various test-time compute scaling methods are fundamentally built on this trajectory.

But OpenDeepThink asks a more fundamental question: If you can't think deep enough, could you just explore multiple directions and then pick the best one?

The answer is yes. But the real question is—how do you choose?

The Selection Bottleneck: Why Picking the Best is Harder Than Generating One

When you generate 50 candidate answers in parallel, you need a judge to select the best one.

Intuitively, you might just let the LLM act as its own judge. However, the paper highlights a critical issue: pointwise judging is noisy and biased. When scoring a single answer, the LLM's criteria are unstable, sensitive to phrasing, and easily misled by superficial fluency.

OpenDeepThink's solution employs the Bradley-Terry model—a statistical method originally derived from sports ranking. Instead of directly scoring answers, it has the model perform pairwise comparisons: "Between A and B, which is better?" Then all comparison results are aggregated into a global ranking.

Think of it as replacing "judge scoring" with "tournament matches"—every pair of answers plays a game, the winner gets points, and final standings are determined by accumulated points.

Evolutionary Iteration

After ranking selects the top candidates, the system doesn't just keep them. The top 75% of candidates are "mutated"—using the natural-language critiques produced during comparison as modification instructions. The bottom 25% are eliminated.

In the next round, the new candidate set enters the cycle of pairwise comparison, ranking, and mutation again.

This process repeats 8 times, taking approximately 27 minutes of wall-clock time. The result: Gemini 3.1 Pro's Codeforces Elo improved by 405 points from baseline.

An Interesting Finding: Works on Objective Problems, Reverses on Subjective Ones

On the HLE (Hard Long-Eval) multi-domain benchmark, the paper discovered a pattern worth noting: gains are concentrated in objectively verifiable domains, and even show reversed effects in subjective domains.

This suggests the core dependency of Bradley-Terry comparison—the comparison itself must have an objective standard. If answers don't have clear "good" vs "bad" criteria, pairwise comparison actually introduces noise.

The CF-73 Dataset

The paper also released a carefully curated Codeforces evaluation set: 73 problems, each annotated by International Grandmasters, with a 99% agreement rate between local evaluation and official verdicts.

For people working on reasoning benchmarks, this dataset is more reliable than most public benchmarks—because the annotators are people who can actually solve these problems.

Cross-Model Transfer

One highlight of OpenDeepThink is that the pipeline transfers across models of different capability levels without re-tuning. This means it's not a trick specific to one model, but a general reasoning framework.

Assessment

OpenDeepThink's core contribution isn't a specific technical breakthrough, but a shift in perspective: when "thinking deeper" hits a bottleneck, "thinking wider" + "choosing better" may be a more cost-effective path.

The idea of replacing pointwise judgment with Bradley-Terry comparison has implications for any scenario requiring LLM self-evaluation—from code generation to paper review, from option selection to dialogue quality control.


Primary sources: