Qwen3.6 Self-Correction Trap: Why More "Thinking" Leads to Worse Results

Core Conclusion

Qwen3.5/3.6 series has a counterintuitive phenomenon in reasoning mode: more “thinking” ≠ better results. Specifically, during the Self-Correction phase, the model’s thinking token count explodes by 4-6x, but the final conclusion quality barely improves — sometimes it even self-doubts away from the correct answer.

This isn’t unique to Qwen, but Qwen’s manifestation is particularly pronounced. For users paying per token, this is a direct cost waste.

Problem Description

Typical Scenario

One developer’s observation:

“Qwen3.5/3.6’s transition thinking is basically all in the Self-Correction phase. The initial reasoning conclusion is already quite solid, but once self-correction kicks in, the model starts frantically looking for angles to question whether it misunderstood — resulting in several times more thinking content with almost no improvement to the conclusion.”

Data Comparison

Phase	Token Consumption	Conclusion Quality	Typical Behavior
Initial reasoning	~500 tokens	85-90/100	Directly gives reasonable answer
Self-Correction	~2000-3000 tokens	85-92/100	Repeatedly questions itself, barely improves conclusion

Key finding: Self-Correction phase consumes 4-6x the tokens of initial reasoning, but conclusion quality improvement is typically under 5%.

Why Does This Happen?

Qwen’s self-correction mechanism has a design flaw:

Over-doubt tendency: The model is trained to “always double-check” but lacks the ability to judge “whether checking is actually needed”
No confidence assessment: The model doesn’t know its initial conclusion is already good enough, so it mechanically enters the correction process
Correction ≠ improvement: Often “correction” just repeats already-correct reasoning steps or introduces unnecessary complexity

Test Cases

Case 1: Math Problem

Prompt: “Calculate 1234 × 5678”

Phase	Content	Tokens
Initial reasoning	Correctly calculates, arrives at 7,006,652	~200
Self-Correction	”Wait, let me re-verify each digit’s multiplication… hmm, the first digit is… the second digit… (repeats the verification process)… oh no, maybe I misunderstood the question…”	~1500
Final conclusion	Still 7,006,652	-

Conclusion change: None. The initial answer was correct, but Self-Correction wasted 7x the tokens.

Case 2: Code Generation

Prompt: “Write a Python function to filter even numbers from a list”

Phase	Content	Tokens
Initial reasoning	Gives `[x for x in lst if x % 2 == 0]`	~300
Self-Correction	”Is this approach optimal? Should I consider performance? What if the list is very large? Should I use filter? But filter is less readable than list comprehension…”	~2000
Final conclusion	Still the list comprehension	-

Conclusion change: None. The code was already optimal, but the model fell into “over-optimization anxiety.”

This Isn’t Just Qwen’s Problem

In fact, this is a common issue across current reasoning models:

Model	Self-Correction Issue	Severity
Qwen3.6	Over-reflection, token inflation 4-6x	🔴 Severe
GPT-5.5	Occasional over-reasoning, token inflation 2-3x	🟡 Moderate
Claude Opus 4.7	Relatively restrained, but still has redundancy	🟡 Moderate
DeepSeek V4	High correction efficiency, less redundancy	🟢 Mild

Qwen’s problem is more severe, possibly related to its training data containing large amounts of “repeatedly double-checking” human reasoning patterns.

Action Recommendations

For Qwen Users

Turn off reasoning mode: For simple tasks (classification, extraction, translation), use non-reasoning mode directly — costs can drop 80%
Manual truncation: If you see the model “frantically self-questioning,” manually truncate and adopt the initial conclusion
Use Qwen3.6-Plus: The Plus version has better reasoning efficiency than Max — more cost-effective for tasks not requiring extreme reasoning

For Developers

If you’re using Qwen’s API, you can control this:

# Turn off reasoning mode (if deep reasoning isn't needed)
response = client.chat.completions.create(
    model="qwen3.6-plus",
    messages=messages,
    thinking_budget=0  # Disable chain of thought
)

# Or limit thinking budget
response = client.chat.completions.create(
    model="qwen3.6-max",
    messages=messages,
    thinking_budget=512  # Cap thinking tokens
)

For the Tongyi Team (if you’re reading this)

Suggestions for optimizing Self-Correction trigger mechanism:

Add confidence threshold: Skip or simplify Self-Correction when initial reasoning confidence exceeds 90%
Introduce early termination: Stop immediately when corrected conclusion matches the initial one
Distinguish task complexity: Don’t trigger deep correction for simple tasks

Landscape Judgment

This problem reflects a core challenge facing reasoning models in 2026: how to make models “know when to stop.”

Current reasoning models all assume “more thinking is better,” but this doesn’t hold economically — each additional thinking token has a cost, and when marginal returns drop below zero, continued thinking is waste.

The competitive focus of next-generation reasoning models may shift from “how deep can it think” to “knowing when to stop thinking.” In this regard, DeepSeek V4’s performance already hints at a better direction.