Core Conclusion
Qwen3.5/3.6 series has a counterintuitive phenomenon in reasoning mode: more “thinking” ≠ better results. Specifically, during the Self-Correction phase, the model’s thinking token count explodes by 4-6x, but the final conclusion quality barely improves — sometimes it even self-doubts away from the correct answer.
This isn’t unique to Qwen, but Qwen’s manifestation is particularly pronounced. For users paying per token, this is a direct cost waste.
Problem Description
Typical Scenario
One developer’s observation:
“Qwen3.5/3.6’s transition thinking is basically all in the Self-Correction phase. The initial reasoning conclusion is already quite solid, but once self-correction kicks in, the model starts frantically looking for angles to question whether it misunderstood — resulting in several times more thinking content with almost no improvement to the conclusion.”
Data Comparison
| Phase | Token Consumption | Conclusion Quality | Typical Behavior |
|---|---|---|---|
| Initial reasoning | ~500 tokens | 85-90/100 | Directly gives reasonable answer |
| Self-Correction | ~2000-3000 tokens | 85-92/100 | Repeatedly questions itself, barely improves conclusion |
Key finding: Self-Correction phase consumes 4-6x the tokens of initial reasoning, but conclusion quality improvement is typically under 5%.
Why Does This Happen?
Qwen’s self-correction mechanism has a design flaw:
- Over-doubt tendency: The model is trained to “always double-check” but lacks the ability to judge “whether checking is actually needed”
- No confidence assessment: The model doesn’t know its initial conclusion is already good enough, so it mechanically enters the correction process
- Correction ≠ improvement: Often “correction” just repeats already-correct reasoning steps or introduces unnecessary complexity
Test Cases
Case 1: Math Problem
Prompt: “Calculate 1234 × 5678”
| Phase | Content | Tokens |
|---|---|---|
| Initial reasoning | Correctly calculates, arrives at 7,006,652 | ~200 |
| Self-Correction | ”Wait, let me re-verify each digit’s multiplication… hmm, the first digit is… the second digit… (repeats the verification process)… oh no, maybe I misunderstood the question…” | ~1500 |
| Final conclusion | Still 7,006,652 | - |
Conclusion change: None. The initial answer was correct, but Self-Correction wasted 7x the tokens.
Case 2: Code Generation
Prompt: “Write a Python function to filter even numbers from a list”
| Phase | Content | Tokens |
|---|---|---|
| Initial reasoning | Gives [x for x in lst if x % 2 == 0] | ~300 |
| Self-Correction | ”Is this approach optimal? Should I consider performance? What if the list is very large? Should I use filter? But filter is less readable than list comprehension…” | ~2000 |
| Final conclusion | Still the list comprehension | - |
Conclusion change: None. The code was already optimal, but the model fell into “over-optimization anxiety.”
This Isn’t Just Qwen’s Problem
In fact, this is a common issue across current reasoning models:
| Model | Self-Correction Issue | Severity |
|---|---|---|
| Qwen3.6 | Over-reflection, token inflation 4-6x | 🔴 Severe |
| GPT-5.5 | Occasional over-reasoning, token inflation 2-3x | 🟡 Moderate |
| Claude Opus 4.7 | Relatively restrained, but still has redundancy | 🟡 Moderate |
| DeepSeek V4 | High correction efficiency, less redundancy | 🟢 Mild |
Qwen’s problem is more severe, possibly related to its training data containing large amounts of “repeatedly double-checking” human reasoning patterns.
Action Recommendations
For Qwen Users
- Turn off reasoning mode: For simple tasks (classification, extraction, translation), use non-reasoning mode directly — costs can drop 80%
- Manual truncation: If you see the model “frantically self-questioning,” manually truncate and adopt the initial conclusion
- Use Qwen3.6-Plus: The Plus version has better reasoning efficiency than Max — more cost-effective for tasks not requiring extreme reasoning
For Developers
If you’re using Qwen’s API, you can control this:
# Turn off reasoning mode (if deep reasoning isn't needed)
response = client.chat.completions.create(
model="qwen3.6-plus",
messages=messages,
thinking_budget=0 # Disable chain of thought
)
# Or limit thinking budget
response = client.chat.completions.create(
model="qwen3.6-max",
messages=messages,
thinking_budget=512 # Cap thinking tokens
)
For the Tongyi Team (if you’re reading this)
Suggestions for optimizing Self-Correction trigger mechanism:
- Add confidence threshold: Skip or simplify Self-Correction when initial reasoning confidence exceeds 90%
- Introduce early termination: Stop immediately when corrected conclusion matches the initial one
- Distinguish task complexity: Don’t trigger deep correction for simple tasks
Landscape Judgment
This problem reflects a core challenge facing reasoning models in 2026: how to make models “know when to stop.”
Current reasoning models all assume “more thinking is better,” but this doesn’t hold economically — each additional thinking token has a cost, and when marginal returns drop below zero, continued thinking is waste.
The competitive focus of next-generation reasoning models may shift from “how deep can it think” to “knowing when to stop thinking.” In this regard, DeepSeek V4’s performance already hints at a better direction.