Can Models Get Stronger Without Training? Darwin Family Uses Evolutionary Merging to Push LLM Reasoning to GPQA Diamond 86.9%

A Counterintuitive Question

Most approaches to improving LLM performance revolve around one premise: you need more training—more data, more compute, more iterations.

Darwin Family asks a different question: If models already possess certain capabilities individually, can we “stitch” them together—without any further training?

The answer sounds implausible—but the paper provides compelling evidence.

Three Core Innovations

1. 14-Dimensional Adaptive Merging Genome

Traditional model merging (e.g., simple averaging, task arithmetic) typically operates at the full-model level. Darwin refines the granularity down to component- and block-level: the merging weight for each layer is an independent, optimizable parameter. A 14-dimensional “genome” enables fine-grained recombination across modules.

2. MRI-Trust Fusion

This method’s name sounds academic—but its core idea is intuitive: it combines two signals to determine how each layer should be merged—the diagnostic layer importance signal and the evolutionary search signal—balanced dynamically by a learnable trust parameter.

In short: first diagnose how important each layer is for reasoning capability; then use evolutionary search to explore optimal merging configurations; finally, the trust parameter determines how much weight to assign to the diagnostic result.

3. Architecture Mapper

This is the most radical part: Darwin supports cross-architecture “hybridization.” Components from Transformer and Mamba architectures can be merged together. This is not merely mixing two architectures within a single model—it maps checkpoints from disparate architectures into a shared weight space via a dedicated mapper, then performs evolutionary merging.

Let the Numbers Speak

Darwin-27B-Opus achieves 86.9% on GPQA Diamond.

What does this mean? It ranks 6th among 1,252 evaluated models—and crucially, it underwent no gradient-based training whatsoever. All performance gains stem entirely from weight-space recombination of pre-existing checkpoints.

Even more strikingly, it consistently outperforms its respective “parent models” across multiple scales—from 4B to 35B—and supports recursive multi-generation evolution: merged models can serve as starting points for the next evolutionary generation.

Why This Matters

Training cost remains one of the biggest bottlenecks in the LLM field. If training-free methods can deliver performance comparable to—or even exceeding—that of post-training, the implications are profound for resource-constrained research teams and small companies.

Darwin’s contribution is not a single new merging algorithm (methods like SLERP, TIES merging, and DARE already exist), but rather the demonstration of a complete, systematic pipeline: diagnostic evaluation → evolutionary search → cross-architecture mapping, which yields consistent, scalable performance gains.

Limitations

The paper candidly acknowledges limitations: the performance ceiling of evolutionary merging is bounded by the quality and capability distribution of the constituent models. If all parent models lack competence in a given capability dimension, merging cannot conjure that ability from nothing. Moreover, search-space complexity grows exponentially with model scale—requiring highly efficient evolutionary strategies to manage.

Relationship to Model Soups

This work can be viewed as a continuation and extension of the Model Soups line—boosting model performance via weight-space composition rather than data-space composition. Yet Darwin makes substantive advances in granularity (block-level), cross-architecture support, and evolutionary strategy.

Primary Sources:

arXiv:2605.14386 Darwin Family
Taebong Kim, Youngsik Hong, Minsik Kim, Sunyoung Choi, Jaewon Jang, Junghoon Shin, Minseo Kim
NeurIPS 2026 submission

A Counterintuitive Question

Three Core Innovations

Let the Numbers Speak

Why This Matters

Limitations

Relationship to Model Soups

Related

APWA: A Distributed Architecture for True Parallelization in Multi-Agent Systems

Dual-Dimensional Consistency: A New Method to Save 10x Tokens During Inference-Time Scaling

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory Capabilities