C
ChaoBro

PageIndex: RAG Without Vector Search, the Technology Bet Behind 31,000 Stars

Vector embedding for RAG has been used for nearly three years. Suddenly someone says: no vectors, use indexing.

VectifyAI / PageIndex has 31,302 stars on GitHub. 31,000 stars is not top-tier among AI open-source projects — but this project claim is bold enough: it claims to achieve better document retrieval than traditional vector RAG without relying on vector embeddings.

Core idea

Traditional RAG workflow: documents → chunking → vectorization → store in vector database → compute similarity on query → return most relevant chunks.

PageIndex approach: documents → generate structured index → locate through index on query → LLM reasoning filter → return results.

Key difference is in how similarity is computed. Traditional approach compresses semantic similarity into a distance value in vector space — efficient but loses a lot of structured information. PageIndex uses LLM reasoning for matching judgment — more "expensive" but more "intelligent."

Is this thing legit

31,000 stars does not equal 31,000 production deployments. Stars are sentiment, deployments are reality.

From technical feasibility perspective, this direction has several valid points:

Vector embedding limitation is semantic ambiguity. "Bank" in financial documents and riverside documents may have similar vectors, but the meanings are completely different. LLM reasoning naturally has contextual disambiguation capability.

Additionally, long document retrieval has always been a pain point for vector RAG. When documents exceed 10 pages, chunking strategy severely destroys contextual coherence. PageIndex indexing approach theoretically can maintain global document structure.

But the cost is clear: cost and speed. Each index query requires calling LLM for reasoning judgment — meaning each retrieval costs several to dozens of times more than traditional vector search.

Project status

From GitHub data, the project has 2,669 forks, meaning some people are actually forking for secondary development. Issue area activity needs to be checked personally — but the star growth curve shows community attention is rising.

Author team comes from VectifyAI, a startup focused on document processing AI. This is not a weekend hobby project, but a project with clear commercial goals.

My take

I do not think PageIndex will completely replace vector RAG. But it may become a better choice in specific scenarios:

  • High-value document precise retrieval: contracts, legal documents, medical literature — accuracy matters far more than speed in these scenarios
  • Long document scenarios: global retrieval of entire books, entire reports
  • Multi-language mixed documents: vector embedding support for multi-language has always been不够好, LLM reasoning is naturally cross-lingual

For most daily scenarios — FAQ retrieval, knowledge base Q&A — vector RAG remains the most cost-effective choice. But if you have been tortured by semantic ambiguity or long document issues with vector RAG, PageIndex is worth half an hour to try.

One observation point: if this approach latency and cost can be reduced to acceptable range in upcoming iterations, the RAG technical route discussion will get interesting. Vector vs. indexing — may not be a replacement relationship, but a complementary one.


Primary sources: