A research paper—end-to-end generated by AI—can cost as little as $15.
This isn’t science fiction. It’s a fact disclosed in the new paper AI for Auto-Research: Roadmap & User Guide, published today on arXiv. Authors include Ziwei Liu, Tat-Seng Chua, Wei Tsang Ooi, and other scholars from the National University of Singapore.
But the paper’s core message isn’t “AI can now write papers”—it’s rather: “The problems with AI-written papers are more worrisome than the capabilities they demonstrate.”
Analysis Across Four Epistemic Stages
The paper divides the full research lifecycle into four “epistemic stages”:
1. Creation
- Idea generation
- Literature review
- Coding and experimentation
- Table and figure generation
Conclusion: AI excels at structured, retrieval-supported, tool-mediated tasks. However, ideas it generates often “degrade” upon implementation—sounding promising in theory but failing in practice.
2. Writing
- Paper drafting
Conclusion: This is one of AI’s strongest stages. Language generation and structural coherence are already highly mature.
3. Validation
- Simulated peer review
- Refutation and revision
Conclusion: This is the most problematic stage. Even state-of-the-art LLMs still fabricate results, miss hidden errors, and cannot reliably judge scientific novelty.
4. Dissemination
- Posters, slides, videos
- Social media posts, project pages
- Interactive Agents
Conclusion: AI is highly capable here—but high “dissemination efficiency” may ironically amplify the reach and impact of low-quality research.
Key Finding: The Boundary Between Automation and Reliability
The paper introduces a critical insight: reliability and automation level exhibit a stage-dependent boundary.
| Task Type | AI Reliability |
|---|---|
| Structured retrieval tasks | ✅ High |
| Tool-mediated tasks | ✅ High |
| Truly novel ideas | ❌ Fragile |
| Research-grade experiments | ❌ Fragile |
| Scientific judgment | ❌ Fragile |
Even more pointedly: the quality of research-grade code lags far behind pattern-matching benchmarks. This means high scores achieved by Agents on benchmarks like SWE-Bench bear little relation to actual scientific coding capability—a substantial gap remains.
End-to-End Automation Has Not Yet Reached “Top-Conference Standards”
The paper states bluntly: end-to-end autonomous systems have not yet stably met the acceptance standards of top-tier conferences. Higher automation levels may obscure—not eliminate—failure modes.
Final conclusion: human-governed collaboration is the most trustworthy deployment paradigm.
Value of This Roadmap
The paper delivers cross-stage design principles, curated tool lists, benchmark suites, and a practitioner-oriented “user guide.” For researchers exploring AI-assisted science, this roadmap serves both as a practical toolkit—and as a timely warning.
In the current AI-research hype cycle, a paper that calmly declares “we’re not there yet” is precisely the one that carries the greatest value.
Primary sources:
- arXiv:2605.18661 — AI for Auto-Research Roadmap Paper
- Project homepage: https://worldbench.github.io/awesome-ai-auto-research