LLMs Quietly Destroy 25% of Your Documents in Delegated Workflows

Frontier models will quietly destroy your documents in long workflows. Not occasionally. Systematically.

Philippe Laban (Salesforce Research), Tobias Schnabel, and Jennifer Neville published DELEGATE-52 on arXiv — a large-scale benchmark across 52 professional domains and 19 LLMs. The findings are not encouraging: even models at the level of Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupt an average of 25% of document content by the end of long workflows.

What "Corruption" Means

"Corruption" here does not mean the model returns garbled text. It means that after receiving a delegated task (editing a document, modifying code, updating a report), the model introduces errors or deletes correct content without being told to do so.

Sparse, but lethal.

DELEGATE-52 simulates real delegated workflows: you give an LLM a document, ask it to make a series of edits, then check the results. The 52 domains span coding, crystallography, music notation, and more.

The results:

Frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4): ~25% content corrupted by the end of long workflows
Other models: Higher failure rates; some models become nearly unusable in later stages
Agentic tool use does not improve performance: Adding tool-calling capabilities did not improve results on DELEGATE-52
Larger documents = more corruption: Document size and interaction length correlate positively with error rates
Distractor files make it worse: Including unrelated files in the working directory increases the model's error rate

The Core Problem

The paper's explanation: current LLMs are unreliable delegates. They introduce "sparse but severe errors" that compound over long interactions.

This is not a "switch to a better model" problem. The models tested are already the strongest available. The root cause lies in the architecture of LLMs — they are probabilistic, not deterministic. In short conversations, this uncertainty is acceptable. In long workflows, every call is a dice roll. Roll enough times and something will go wrong.

What makes this worse: the errors are silent. The model won't tell you "I changed the formula in paragraph three." It just changes it and hands back the result with full confidence.

What This Means for You

If you're using LLMs for document editing, code refactoring, or report updates:

Short tasks are fine. Editing a few lines of code or adjusting a paragraph — frontier models handle this reliably.

Long workflows need human checkpoints. Letting an LLM continuously edit a 50-page document without intermediate checks will almost certainly produce unwanted changes.

Distractor files are a trap. Mixing unrelated files in your working directory increases error rates. Keeping your workspace clean is not just a code style concern — it's a safety concern.

Tool calling is not a silver bullet. This paper explicitly shows that adding tool-calling capabilities does not improve delegated task performance. Don't assume that equipping an Agent with file read/write tools automatically solves document reliability.

My Take

This paper's value is that it uses large-scale data to puncture a delusion: "frontier models are reliable enough to trust with your documents."

The reality is that for delegated tasks requiring accuracy, the current best practice remains "LLM generates + human reviews." Not because LLMs are weak — because their probabilistic nature makes them unsuitable for work that requires 100% certainty.

DELEGATE-52's significance is not in telling us LLMs are bad. It provides a quantifiable baseline. A 25% corruption rate is a starting point, not an endpoint. When the next model is released, we can use this same benchmark to track progress.

Until then, don't hand your important documents to an LLM and walk away.

Primary sources:

LLMs Corrupt Your Documents When You Delegate, Philippe Laban et al., arXiv:2604.15597
Hacker News discussion

What "Corruption" Means

The Core Problem

What This Means for You

My Take

Related

ACC: Compiling Agent Trajectories into Long-Context QA for Direct Reasoning

RLVR Credit Assignment, Revisited: DelTA Takes a Discriminator View on Token-Level Rewards

Do MLLMs Really Read People? MM-OCEAN Finds 51% of "Correct Ratings" Are Guessing