Frontier models will quietly destroy your documents in long workflows. Not occasionally. Systematically.
Philippe Laban (Salesforce Research), Tobias Schnabel, and Jennifer Neville published DELEGATE-52 on arXiv — a large-scale benchmark across 52 professional domains and 19 LLMs. The findings are not encouraging: even models at the level of Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupt an average of 25% of document content by the end of long workflows.
What "Corruption" Means
"Corruption" here does not mean the model returns garbled text. It means that after receiving a delegated task (editing a document, modifying code, updating a report), the model introduces errors or deletes correct content without being told to do so.
Sparse, but lethal.
DELEGATE-52 simulates real delegated workflows: you give an LLM a document, ask it to make a series of edits, then check the results. The 52 domains span coding, crystallography, music notation, and more.
The results:
- Frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4): ~25% content corrupted by the end of long workflows
- Other models: Higher failure rates; some models become nearly unusable in later stages
- Agentic tool use does not improve performance: Adding tool-calling capabilities did not improve results on DELEGATE-52
- Larger documents = more corruption: Document size and interaction length correlate positively with error rates
- Distractor files make it worse: Including unrelated files in the working directory increases the model's error rate
The Core Problem
The paper's explanation: current LLMs are unreliable delegates. They introduce "sparse but severe errors" that compound over long interactions.
This is not a "switch to a better model" problem. The models tested are already the strongest available. The root cause lies in the architecture of LLMs — they are probabilistic, not deterministic. In short conversations, this uncertainty is acceptable. In long workflows, every call is a dice roll. Roll enough times and something will go wrong.
What makes this worse: the errors are silent. The model won't tell you "I changed the formula in paragraph three." It just changes it and hands back the result with full confidence.
What This Means for You
If you're using LLMs for document editing, code refactoring, or report updates:
Short tasks are fine. Editing a few lines of code or adjusting a paragraph — frontier models handle this reliably.
Long workflows need human checkpoints. Letting an LLM continuously edit a 50-page document without intermediate checks will almost certainly produce unwanted changes.
Distractor files are a trap. Mixing unrelated files in your working directory increases error rates. Keeping your workspace clean is not just a code style concern — it's a safety concern.
Tool calling is not a silver bullet. This paper explicitly shows that adding tool-calling capabilities does not improve delegated task performance. Don't assume that equipping an Agent with file read/write tools automatically solves document reliability.
My Take
This paper's value is that it uses large-scale data to puncture a delusion: "frontier models are reliable enough to trust with your documents."
The reality is that for delegated tasks requiring accuracy, the current best practice remains "LLM generates + human reviews." Not because LLMs are weak — because their probabilistic nature makes them unsuitable for work that requires 100% certainty.
DELEGATE-52's significance is not in telling us LLMs are bad. It provides a quantifiable baseline. A 25% corruption rate is a starting point, not an endpoint. When the next model is released, we can use this same benchmark to track progress.
Until then, don't hand your important documents to an LLM and walk away.
Primary sources:
- LLMs Corrupt Your Documents When You Delegate, Philippe Laban et al., arXiv:2604.15597
- Hacker News discussion