C
ChaoBro

TIGER-Lab’s New Paper: Stop Obsessing Over Semantic Similarity—Agentic Search Needs “Direct Corpus Interaction”

TIGER-Lab’s New Paper: Stop Obsessing Over Semantic Similarity—Agentic Search Needs “Direct Corpus Interaction”

RAG (Retrieval-Augmented Generation) has been running for three years—and everyone has been optimizing one thing: embedding-based similarity.

Better embedding models, better vector databases, better chunking strategies—the approaches vary widely, yet the underlying assumption has never been questioned: retrieval is similarity matching.

Now, TIGER-Lab (Stony Brook University’s AI lab) has published a paper on Hugging Face Daily Papers that directly challenges this foundational premise. With 87 upvotes, it ranked second on its release day.

Paper title: "Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction".

In plain terms: Stop using semantic similarity for retrieval. Instead, empower search agents to interact with corpora directly.

The Fundamental Limitation of Semantic Similarity Retrieval

The paper identifies a critical weakness: semantic similarity retrieval is inherently inadequate in one specific scenario—when the information needed to answer a user’s query is distributed across multiple locations in a document, rather than concentrated in a single passage semantically close to the query text.

For example:

User asks: “How is this company’s financial health?”

A semantic similarity retriever embeds this query and then searches the corpus for passages most semantically similar to phrases like “company financial health.”

But what if the document contains no sentence explicitly stating “The company’s financial health is…”? What if financial information is scattered across income statements, expense records, cash flow reports, and management discussion sections?

Semantic similarity retrieval will return several fragments that appear relevant—but may miss truly critical information, precisely because those fragments bear little surface-level or semantic resemblance to the user’s question.

The Direct Corpus Interaction Approach

The paper proposes an alternative: Rather than having a retrieval system perform similarity matching, let the agent itself actively “explore” and “query” the corpus.

To illustrate:

  • Semantic similarity retrieval: Like a librarian who listens to your request and hands you several books that “seem relevant.”
  • Direct Corpus Interaction: Like walking into the library yourself—browsing the catalog, following cross-references, tracing leads.

The latter is more flexible—but also far more complex. It demands that the agent possess:

  • The ability to understand document structure (tables of contents, section hierarchies, cross-references);
  • The ability to dynamically adapt its search strategy (e.g., jumping from one clue to another);
  • The ability to synthesize fragmented information (i.e., assembling a coherent picture from disjointed pieces).

Technical Implementation

According to the paper, Direct Corpus Interaction centers on an agent-driven retrieval pipeline:

  1. Initial Exploration: The agent reads the corpus’s global structure (titles, table of contents, abstracts) to build a mental “map” of the corpus.
  2. Targeted Querying: Based on the user’s question, the agent decides which sections warrant deeper investigation.
  3. Cross-Validation: The agent establishes connections across different sections to verify consistency among retrieved information.
  4. Information Integration: The agent synthesizes findings into a coherent, unified response.

This process requires no embeddings, no vector database, and no manual chunking. Instead, it relies on an agent capable of understanding document structure, planning search paths, and reasoning about information relationships.

What’s the Cost?

The trade-off is explicit: Each retrieval requires LLM-based reasoning—not a single vector nearest-neighbor lookup.

  • Vector search: Millisecond latency; near-zero cost.
  • Agent-based retrieval: Second-scale latency; per-query LLM token consumption.

The paper confronts a direct question: Does the gain in accuracy justify this added cost?

For certain domains, the answer is unequivocally yes:

  • Legal consultation: Missing a single clause could reverse an entire judgment.
  • Clinical diagnosis: Clues scattered across disparate lab reports may converge on a critical conclusion.
  • Academic research: Integrating insights across multiple papers is often essential.

Yet for other use cases, semantic similarity retrieval remains the pragmatic choice:

  • FAQ-style Q&A
  • Simple document lookup
  • Latency-sensitive applications

A Deeper Signal

Beyond its technical contribution, this paper signals a broader shift in research focus: from “How do we make retrieval better?” to “What should retrieval be?”

Over the past three years, most RAG research has optimized within the existing retrieval paradigm—improving embeddings, refining vector indexes, enhancing re-ranking. TIGER-Lab’s work asks a more fundamental question: If retrieval isn’t just about “finding similar text,” but about “navigating knowledge space,” how should the entire architecture be reimagined?

This resonates with PageIndex’s “vectorless RAG” direction (previously covered), but TIGER-Lab’s framing is more precise—it emphasizes agentic interaction, not merely the technical absence of vectors.

Assessment

Direct Corpus Interaction is a promising direction—especially for complex document understanding and multi-hop reasoning. Yet it won’t replace semantic similarity retrieval; their roles are complementary, not competitive.

The more likely future involves coexistence: simple queries handled by vector search, complex ones delegated to agent exploration—or even more ambitiously: agents autonomously selecting the optimal retrieval method for each query.

Primary sources: