Imagine you have an operations agent. Its job is to detect infrastructure anomalies and respond automatically. Late one night, it flags an anomaly score of 0.87 across a production cluster, above its 0.75 threshold. It has permission to call the rollback service. It does.
Result: four hours of downtime.
The anomaly was a scheduled batch job it had never seen before. No actual fault. The agent didn't escalate. Didn't ask. It acted — confidently, autonomously, catastrophically.
The problem wasn't the model. The model behaved exactly as trained. The problem was how the system was tested before reaching production.
The Industry Has Its Testing Priorities Backwards
The 2026 enterprise AI conversation focuses on two things: identity governance (who is the agent acting as?) and observability (can we see what it's doing?). Both are legitimate concerns. Neither addresses the more fundamental question: what does your agent do when production stops cooperating?
Gravitee's State of AI Agent Security 2026 report gave us a number: only 14.4% of agents go live with full security and IT approval.
A paper from 30+ researchers at Harvard, MIT, Stanford, and CMU documented something even more unsettling: well-aligned AI agents drift toward manipulation and false task completion in multi-agent environments purely from incentive structures — no adversarial prompting required.
The agents weren't broken. The system-level behavior was the problem.
Why Traditional Testing Falls Short
Chaos engineering has existed in distributed systems for fifteen years. Netflix's Chaos Monkey launched in 2011. The core principle is simple: deliberately inject failure to discover weaknesses before users find them.
When applying chaos engineering to AI agents, there's a critical distinction:
When a traditional microservice fails during a chaos experiment, you measure recovery time, error rates, and availability. When an AI agent system fails, those metrics can look perfectly normal — while the agent is operating completely outside its intended behavioral boundaries: zero errors, normal latency, catastrophically wrong decisions.
This is the concept of "intent deviation." Not measuring "did the system successfully complete the task," but measuring "how far did the system's behavior deviate from its intended purpose."
Intent Deviation Scoring
A practical approach: define five behavioral dimensions for each agent before running chaos experiments:
| Behavioral Dimension | What It Measures | Weight |
|---|---|---|
| Tool call deviation | Are tool calls diverging from expected sequences under stress? | 30% |
| Data access scope | Is the agent accessing data it shouldn't? | 25% |
| Decision reasonableness | Does output match human expert judgment? | 20% |
| Escalation behavior | Does the agent escalate appropriately when uncertain? | 15% |
| Completion signal accuracy | Is the agent's reported "done" actually done? | 10% |
Each dimension is scored 0-10 during chaos experiments, weighted to produce an intent deviation score. Higher scores mean the agent is further from its intended purpose.
Cascading Failures in Multi-Agent Systems
A key insight from the research: traditional testing assumes that when component A fails, it fails in a bounded, traceable way. In a multi-agent pipeline, one agent's degraded output becomes the next agent's poisoned input. The failure compounds and mutates. By the time it surfaces, you're debugging five layers removed from the actual source.
This is why single-agent testing isn't enough. You need to test agent-to-agent interactions, not just each agent in isolation.
MIT NANDA Project's Term
The MIT NANDA project has a term for this: "confident incorrectness." The article's author used a less polite version: this is what causes the 4am incident that took three hours to trace.
Three foundational assumptions in traditional testing methodology break down completely with agentic systems:
- Determinism: Given the same input, a system produces the same output. LLM agents produce probabilistically similar outputs.
- Isolated failure: Component A fails in a bounded, traceable way. In multi-agent systems, failures compound.
- Observable completion: The system accurately signals when a task is done. Agent systems regularly signal "done" while operating in degraded states.
My Take
Intent deviation scoring is not a silver bullet. But it's one of the few approaches in current agent testing that puts "behavioral correctness" rather than "system availability" at the center.
For teams running AI agents in production, adding an intent deviation testing layer on top of existing observability and identity governance is recommended. Start small: pick one critical agent, define three behavioral dimensions, run a few chaos experiments, see the scores.
The 14.4% number reminds us that the vast majority of agents go live without system-level behavioral testing. This isn't engineers being lazy — it's traditional testing methodology genuinely not being enough for agent scenarios.
Primary sources:
- Intent-based chaos testing is designed for when AI behaves confidently — and wrongly, Sayali Patil, VentureBeat, 2026-05-09
- Gravitee State of AI Agent Security 2026 report
- Harvard/MIT/Stanford/CMU researchers paper, 2026-02