Qwen3.6 Self-Correction Trap: Why More "Thinking" Leads to Worse Results

Qwen3.6 Self-Correction Trap: Why More "Thinking" Leads to Worse Results

Core Conclusion

Qwen3.5/3.6 series has a counterintuitive phenomenon in reasoning mode: more “thinking” ≠ better results. Specifically, during the Self-Correction phase, the model’s thinking token count explodes by 4-6x, but the final conclusion quality barely improves — sometimes it even self-doubts away from the correct answer.

This isn’t unique to Qwen, but Qwen’s manifestation is particularly pronounced. For users paying per token, this is a direct cost waste.

Problem Description

Typical Scenario

One developer’s observation:

“Qwen3.5/3.6’s transition thinking is basically all in the Self-Correction phase. The initial reasoning conclusion is already quite solid, but once self-correction kicks in, the model starts frantically looking for angles to question whether it misunderstood — resulting in several times more thinking content with almost no improvement to the conclusion.”

Data Comparison

PhaseToken ConsumptionConclusion QualityTypical Behavior
Initial reasoning~500 tokens85-90/100Directly gives reasonable answer
Self-Correction~2000-3000 tokens85-92/100Repeatedly questions itself, barely improves conclusion

Key finding: Self-Correction phase consumes 4-6x the tokens of initial reasoning, but conclusion quality improvement is typically under 5%.

Why Does This Happen?

Qwen’s self-correction mechanism has a design flaw:

  1. Over-doubt tendency: The model is trained to “always double-check” but lacks the ability to judge “whether checking is actually needed”
  2. No confidence assessment: The model doesn’t know its initial conclusion is already good enough, so it mechanically enters the correction process
  3. Correction ≠ improvement: Often “correction” just repeats already-correct reasoning steps or introduces unnecessary complexity

Test Cases

Case 1: Math Problem

Prompt: “Calculate 1234 × 5678”

PhaseContentTokens
Initial reasoningCorrectly calculates, arrives at 7,006,652~200
Self-Correction”Wait, let me re-verify each digit’s multiplication… hmm, the first digit is… the second digit… (repeats the verification process)… oh no, maybe I misunderstood the question…”~1500
Final conclusionStill 7,006,652-

Conclusion change: None. The initial answer was correct, but Self-Correction wasted 7x the tokens.

Case 2: Code Generation

Prompt: “Write a Python function to filter even numbers from a list”

PhaseContentTokens
Initial reasoningGives [x for x in lst if x % 2 == 0]~300
Self-Correction”Is this approach optimal? Should I consider performance? What if the list is very large? Should I use filter? But filter is less readable than list comprehension…”~2000
Final conclusionStill the list comprehension-

Conclusion change: None. The code was already optimal, but the model fell into “over-optimization anxiety.”

This Isn’t Just Qwen’s Problem

In fact, this is a common issue across current reasoning models:

ModelSelf-Correction IssueSeverity
Qwen3.6Over-reflection, token inflation 4-6x🔴 Severe
GPT-5.5Occasional over-reasoning, token inflation 2-3x🟡 Moderate
Claude Opus 4.7Relatively restrained, but still has redundancy🟡 Moderate
DeepSeek V4High correction efficiency, less redundancy🟢 Mild

Qwen’s problem is more severe, possibly related to its training data containing large amounts of “repeatedly double-checking” human reasoning patterns.

Action Recommendations

For Qwen Users

  1. Turn off reasoning mode: For simple tasks (classification, extraction, translation), use non-reasoning mode directly — costs can drop 80%
  2. Manual truncation: If you see the model “frantically self-questioning,” manually truncate and adopt the initial conclusion
  3. Use Qwen3.6-Plus: The Plus version has better reasoning efficiency than Max — more cost-effective for tasks not requiring extreme reasoning

For Developers

If you’re using Qwen’s API, you can control this:

# Turn off reasoning mode (if deep reasoning isn't needed)
response = client.chat.completions.create(
    model="qwen3.6-plus",
    messages=messages,
    thinking_budget=0  # Disable chain of thought
)

# Or limit thinking budget
response = client.chat.completions.create(
    model="qwen3.6-max",
    messages=messages,
    thinking_budget=512  # Cap thinking tokens
)

For the Tongyi Team (if you’re reading this)

Suggestions for optimizing Self-Correction trigger mechanism:

  • Add confidence threshold: Skip or simplify Self-Correction when initial reasoning confidence exceeds 90%
  • Introduce early termination: Stop immediately when corrected conclusion matches the initial one
  • Distinguish task complexity: Don’t trigger deep correction for simple tasks

Landscape Judgment

This problem reflects a core challenge facing reasoning models in 2026: how to make models “know when to stop.”

Current reasoning models all assume “more thinking is better,” but this doesn’t hold economically — each additional thinking token has a cost, and when marginal returns drop below zero, continued thinking is waste.

The competitive focus of next-generation reasoning models may shift from “how deep can it think” to “knowing when to stop thinking.” In this regard, DeepSeek V4’s performance already hints at a better direction.