Asking an LLM to write a scheduling algorithm, bin-packing solution, or path planning code sounds like a natural use case. But a new paper from the University of Pennsylvania (Dan Roth’s team) tells us: don’t rush to let it optimize—you might actually make things worse.
Three Approaches: Which One Works Best?
The paper introduces a benchmark called CP-SynC-XL, comprising 100 combinatorial optimization problems and 4,577 instances. LLMs were tasked with generating solvers using three different paradigms:
- Pure Python Algorithmic Search: The LLM directly writes Python code for the search algorithm
- Python + OR-Tools Constraint Modeling: The LLM uses the OR-Tools Python API for modeling and calls the solver
- MiniZinc + OR-Tools Declarative Modeling: The LLM models using the declarative MiniZinc language, which is then solved by OR-Tools
The results are quite interesting.
Python + OR-Tools achieved the highest win rate—delivering the highest correctness across multiple LLMs. MiniZinc + OR-Tools, despite using the same OR-Tools backend, had lower coverage—the abstraction of the declarative language actually made it easier for the LLM to make mistakes. Pure Python was most prone to producing solutions that "pass schema validation but fail in actual testing."
This finding directly overturns an intuition: higher-level abstraction (MiniZinc) = better results. In reality, LLMs are most reliable with API-level constraint modeling (Python + OR-Tools).
The Heuristic Trap: The Urge to Optimize Backfires on the Model
Next comes the paper's most core finding.
Researchers asked the LLMs to incorporate search optimizations (such as pruning strategies, heuristic functions, or bound settings) when generating solvers. The results?
- A median speedup of only 1.03-1.12x—almost no effect
- A strong bimodal distribution—many instances actually became slower
- A sharp drop in correctness on long-tail problems
Why is this? Researchers conducted a code-level review and discovered a recurring pattern—they call it the heuristic trap:
- In the pure Python path, the LLM might substitute a local approximation for a complete search
- In the Python + OR-Tools path, the LLM might inject unverified bound constraints
- In the MiniZinc + OR-Tools path, the LLM might add redundant declarative structures that actually overload the solver
In other words, the LLM's understanding of "optimization" is often flawed. It thinks it's speeding things up, but in reality, it's introducing bugs or altering the problem's semantics.
"Formalize, Don't Optimize"
Based on these findings, the paper proposes a conservative but practical design principle:
Let the LLM primarily handle formalization—defining variables, constraints, and objective functions—and leave the actual solving to verified solvers. Any search optimization written by the LLM must be independently reviewed before use.
This is a "step back" strategy. It acknowledges that LLMs are useful for modeling (defining "what is") but unreliable for algorithmic optimization (deciding "how to search"). By separating these two tasks, we let the LLM do what it does best and hand over optimization to mature solving engines.
Implications for AI Agent Development
This finding has direct value for developers building AI coding agents.
Many agent tools (Cursor, Claude Code, Copilot) rely on LLMs to generate and modify code. If an LLM is writing an algorithm involving search/optimization, the agent should know: do not blindly accept the LLM's "optimization suggestions" unless there is an independent validation mechanism.
The paper also hints at a broader pattern: LLMs are more reliable at "describing problem structures" than at "designing solution strategies." This distinction likely holds across many other domains—such as database query optimization, compiler optimization, or even network routing strategies.
Paper Link: arXiv:2605.12421