Qwen Team Open-Sources FlashQLA: Linear Attention Kernels Deliver 2-3× Inference Speedup

Qwen Team Open-Sources FlashQLA: Linear Attention Kernels Deliver 2-3× Inference Speedup

The Qwen team has just open-sourced a subtle but potentially impactful infrastructure project — FlashQLA, a set of high-performance linear attention kernels built on TileLang.

Core Metrics

MetricImprovement
Forward Inference2-3× speedup
Backward Training2× speedup
Target HardwareConsumer GPUs / Personal devices
Target ScenarioAgent AI on-device deployment

Technical Highlights

  1. Gate-driven automatic intra-card CP: Parallel computation across chips via gating mechanism, reducing manual tuning
  2. Hardware-friendly algebraic optimization: Specifically optimized for consumer GPU memory hierarchies
  3. Built on TileLang: Leverages TileLang’s abstraction layer for cross-hardware portability

Why It Matters

FlashQLA isn’t another “benchmark-chasing” model. It’s pure infrastructure-level optimization that directly acts on inference engines:

  • Once CUDA kernels integrate into vLLM, llama.cpp, SGLang and other mainstream inference frameworks, inference costs for all Qwen models will drop 2-3×
  • For on-device Agent scenarios (phones, laptops, edge devices), this speedup means models that couldn’t run before now can
  • Linear attention natively supports infinite context, paired with acceleration kernels, long-context Agents on consumer hardware become significantly more practical

Comparison with Similar Solutions

SolutionOptimization TargetSpeedupScope
FlashQLALinear attention kernels2-3×Qwen linear attention models
FlashAttention-3Standard attention kernels1.5-2×All Transformers
TensorRT-LLMInference engine1.5-3×NVIDIA GPUs

FlashQLA’s unique value lies in its deep optimization for linear attention, the core component of next-generation long-context models.

Action Recommendations

  • On-device Agent developers: Once FlashQLA integrates into llama.cpp, try running Qwen 3.6 locally
  • API users: Short-term impact is limited, but Qwen API prices may drop further as costs decrease
  • Model trainers: 2× backward speedup means more fine-tuning experiments on the same budget

Sources: Qwen GitHub, X/Twitter