Let LLMs Do Epidemic Forecasting Themselves: Harvard Team Predicts Multi-Pathogen Diseases with Autonomous Tree Search

Which mathematical model should be used to predict the next flu peak?

This is not a simple question. Epidemiological modeling is a highly complex field—SEIR models, time series models, machine learning models, hybrid models, with countless variants under each broad category. Decisions such as which model to choose, how to tune parameters, and how to handle interactions between different pathogens directly impact forecasting accuracy.

A recent paper by a team from Harvard University and Massachusetts General Hospital (MGH) (Prospective Multi-Pathogen Disease Forecasting Using Autonomous LLM-Guided Tree Search) proposes an interesting solution: letting the LLM itself search for the optimal modeling strategy.

LLM-Guided Tree Search

The methodological core of the paper is "autonomous LLM-guided tree search."

Imagine the decision space for epidemic modeling as a tree. Each node in the tree represents a modeling choice—which model framework to use, which variables to include, how to handle seasonal factors, and whether to consider competitive relationships between pathogens. A path from the root node to a leaf node constitutes a complete modeling pipeline.

The traditional approach relies on manual exploration of this space—domain experts select models and adjust parameters based on their own experience. This process is time-consuming and susceptible to personal bias.

Instead, the paper employs the LLM as an autonomous search agent to explore this tree. The LLM doesn't choose randomly—it makes informed decisions based on an analysis of the historical performance of various branches. It autonomously determines which paths warrant deeper exploration and which should be pruned.

The Complexity of Multi-Pathogen Forecasting

The paper focuses on a particularly challenging scenario: simultaneously forecasting the transmission dynamics of multiple pathogens.

Forecasting a single pathogen is already difficult. Multiple pathogens are even more complex due to interactions between them. For instance, after a person is infected with one respiratory virus, their short-term susceptibility to other viruses changes. Factors like school holidays, climate change, and population mobility also affect different pathogens in varying ways.

The value of the LLM here isn't that it understands disease transmission better than epidemiologists, but that it can systematically explore a much larger hypothesis space in parallel. While experts may be constrained by the model categories they are familiar with, LLMs can combine and innovate across different categories and methodologies.

Prospective Validation

A key design in the paper is prospective validation—rather than backtesting with historical data (which is prone to overfitting), it makes real-time forecasts at actual points in time and then waits for real-world data for validation.

This validation approach is crucial in epidemiological research. Good backtesting results do not equate to strong predictive power—you might just be memorizing historical patterns. Only prospective, real-time forecasting can truly test a model's practical value.

Relationship with Other AI for Science Works

Recent applications of AI in the scientific domain have revealed several distinct paradigms:

Substitution Paradigm: Completely replacing traditional scientific methods with AI. For example, using end-to-end models to directly predict outcomes, bypassing physical/biological modeling. This direction is highly controversial due to its lack of interpretability.

Assistance Paradigm: AI serves as a tool to assist scientists. Examples include accelerating computations, automating literature reviews, and generating hypotheses. This direction is relatively mature, but AI's role remains that of a "tool" rather than a "collaborator."

Autonomous Paradigm: AI conducts scientific exploration autonomously. This is the direction represented by ARIS (Shanghai Jiao Tong University's autonomous research agent) and this paper. AI doesn't just execute commands; it proactively searches hypothesis spaces, designs experiments, and makes decisions.

LLM-guided tree search falls under the autonomous paradigm, but it is more focused than ARIS—it doesn't task the LLM with the entire scientific workflow, but rather enables it to autonomously explore within a specific, structured search space.

My Thoughts

This paper showcases a new role for LLMs in scientific modeling: not as a conversational partner or a text generator, but as an autonomous search agent.

This shift in role is significant. When we position an LLM as a "conversational tool," our expectation is "answer my questions." But when it becomes a "search agent," our expectation shifts to "conduct valuable exploration in directions I haven't explicitly specified."

The latter places higher demands on the LLM. It requires sufficient domain understanding to make sound search decisions, self-evaluation capabilities to determine which directions are worth pursuing, and the flexibility to combine and innovate across knowledge domains.

The paper's prospective validation design is also commendable. In the AI for Science field, too much work remains stuck at the backtesting stage. Only prospective, real-time validation can truly build trust in AI's forecasting capabilities.

Of course, this approach also has limitations. The quality of the LLM's search heavily depends on prompt design and search strategy architecture. If the search space is poorly defined, or if the LLM's evaluation criteria are biased, the search process may head in the wrong direction.

But within the right framework, letting AI autonomously explore scientific hypothesis spaces is a direction exciting enough in itself.

Primary Source:

arXiv:2605.16238 - LLM-Guided Disease Forecasting

LLM-Guided Tree Search

The Complexity of Multi-Pathogen Forecasting

Prospective Validation

Relationship with Other AI for Science Works

My Thoughts

Related

APWA: A Distributed Architecture for True Parallelization in Multi-Agent Systems

Dual-Dimensional Consistency: A New Method to Save 10x Tokens During Inference-Time Scaling

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory Capabilities