Why do production LLM Agents break?
Most answers: model isn't good enough, prompt isn't right, tool calling has bugs.
This paper offers a different angle: the problem may lie at the boundary between stochastic model outputs and deterministic systems—a boundary that has never been treated as a formal architectural object.
SDB: Stochastic-Deterministic Boundary
The authors name it: Stochastic-Deterministic Boundary.
It's a four-part contract:
- Proposer: LLM generates candidate output
- Verifier: checks if the output meets constraints
- Commit step: turns verified output into system action
- Reject signal: what to do when it fails
The paper argues that SDB is the load-bearing primitive of production agent runtimes.
Six runtime patterns
Around SDB, the authors organize Agent runtime design into three concerns: Coordination, State, Control.
Then they borrow six patterns from distributed systems, each mapping to different scenarios:
- Hierarchical Delegation: conversational agents
- Scatter-Gather + Saga: parallel sub-tasks that need aggregation
- Event-Driven Sequencing: async task flows
- Shared State Machine: multi-agent collaboration
- Supervisor + Gate: autonomous agents
- Human-in-the-Loop: critical decisions needing human review
Each pattern traces back to classical distributed systems concepts, but the paper identifies what changes when the worker becomes stochastic (an LLM).
A key failure mode: Replay Divergence
The paper proposes a failure mode I hadn't seen named before: Replay Divergence.
Scenario: you record all agent inputs/outputs in a deterministic event log. Later, you change the model version or prompt and replay the same log—the downstream outputs differ.
This wouldn't happen in traditional distributed systems. But in LLM Agents, it's inevitable. LLMs are stochastic; the same input can produce different outputs.
Naming this matters for debugging and auditing.
Practical takeaways
If you're running Agents in production:
- Define your SDB explicitly. Don't pipe LLM output directly into system workflows. Define: who proposes, who verifies, how to commit, how to backtrack.
- Pattern choice matters more than model choice. As model variance decreases, pattern choice and SDB strength become the more important levers for long-run reliability.
- There's a framework for failure diagnosis. The paper provides a five-step methodology that maps production failures to pattern weaknesses.