Qwen3.6 Heretic 35B: Community Fine-Tune Cuts Refusals, Runs on RTX 4090

Bottom Line

Qwen3.6 Heretic 35B is the hottest community fine-tune right now. Based on Alibaba’s Qwen3.6-35B, it significantly reduces safety refusal rates while maintaining the original model’s intelligence level. Quantized versions run on consumer-grade RTX 3090/4090 GPUs with 260K context for Agent tasks.

What Happened

In late April, the community released Qwen3.6 Heretic 35B, a targeted fine-tune of the Qwen3.6-35B base model. Key specs:

Dimension	Qwen3.6-35B Original	Qwen3.6 Heretic 35B
Intelligence	Baseline	Maintained
Safety Refusal Rate	High	Significantly reduced
Max Context	260K tokens	260K tokens
Hardware	Multi-GPU/A100	RTX 3090/4090 (quantized)
Agent Tool Use	Supported	Smoother
License	Open	Open

On the DGX-Spark leaderboard, quantized versions of Qwen3.6-35B hit 95 tps, 92 tps, and 73 tps inference speeds, outperforming gpt-oss-120B and gemma4-26B.

Why “Fewer Refusals” Matters

For developers, the original Qwen3.6 triggers excessive safety refusals on edge cases — fatal in Agent workflows:

Code Generation: System-level or network request code gets refused
Data Processing: Data cleaning tasks with sensitive field names get blocked
Agent Tool Calling: Certain MCP tool parameter combinations trigger safety filters

Heretic dramatically reduces these “false positives” through community fine-tuning, without degrading core capabilities:

More stable Agent workflows: Fewer task interruptions from refusals
Better debugging: No need to rewrite prompts to bypass safety filters
Local deployment friendly: Consumer GPUs suffice, no cloud API needed

Deployment Guide

Quantization Options

Format	VRAM	Speed	Precision Loss
Q4_K_M	~20GB	95 tps	Minimal
Q5_K_M	~22GB	92 tps	Negligible
Q6_K	~26GB	73 tps	Almost none

RTX 4090 (24GB): Q4_K_M or Q5_K_M. RTX 3090 (24GB): same.

Recommended Stack

LM Studio: Auto-discovers models, zero-config loading
Ollama: One command ollama run qwen3.6-heretic-35b
vLLM: Production deployment, high concurrency

Landscape Assessment

Qwen3.6 Heretic reflects two trends:

Community fine-tune ecosystem maturing: The last mile from “usable” to “great” is filled by the community
Consumer GPU inference going mainstream: 35B-class models now run smoothly on single consumer GPUs

Compared to peers:

Kimi K2.6 (1T MoE, 32B active) focuses on Agent swarm capabilities
DeepSeek-V4-Pro wins on API cost-effectiveness
Qwen3.6 Heretic differentiates on local deployment + low refusal rate

Action Items

RTX 3090/4090 owners: Deploy now, replace your existing Qwen3.6 base
Agent developers: Heretic is more stable in tool-calling scenarios
Enterprise users: Note Heretic is a community fine-tune with adjusted safety policies — assess compliance risk
A/B test: Compare with original Qwen3.6-35B in your specific use cases

Bottom Line

What Happened

Why “Fewer Refusals” Matters

Deployment Guide

Quantization Options

Recommended Stack

Landscape Assessment

Action Items

相关内容

GPT-6 Enters Safety Alignment Phase: 5-6 Trillion Parameters, Math Reasoning 92.5%, Code Pass Rate 96.8%

MiniMax M3 Launching This Month: Targeting Office Scenarios with Major Agentic Capability Upgrades

GLM-5.1 Lands on 0G Private Computer: What Running a 754B MoE Model Inside a TEE Means