Xiaohongshu’s AI team has published a paper on Hugging Face Daily Papers titled "HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents" — earning 57 upvotes.
Though lengthy, the title encapsulates three core ideas:
- Parallel Multimodal Search: Simultaneously leveraging multiple modalities—text, images, and video—for search
- Dual-Grained: Optimization performed at two distinct granularities
- Efficiency-Aware RL: Reinforcement learning that explicitly accounts for computational efficiency—enabling agents to pursue performance while conserving compute resources
Why Xiaohongshu Built This
Consider how users interact with Xiaohongshu. A query like “How is this restaurant?” returns not only textual reviews but also photos, videos, location data, and pricing information—a truly multimodal experience.
Moreover, search is inherently parallel: the system must process multiple modalities concurrently—not sequentially (e.g., text first, then images, then video). This differs fundamentally from traditional search engines. While Google’s core remains text-centric—with images and video as supplementary features—Xiaohongshu’s user experience is built on seamless multimodal integration.
Thus, Xiaohongshu’s AI team faces a concrete technical challenge: How do we enable search agents to perform efficient, parallel multimodal search—while rigorously controlling computational cost?
What “Dual-Grained” Means
In this paper, “dual-grained” refers to optimization at two complementary levels:
Fine-Grained: Decision-level optimization for individual agents. For example, a text-search agent must decide which query to issue, how many results to retrieve, and when to terminate its search. Here, RL optimizes each agent’s specific behavioral policy.
Coarse-Grained: Coordination-level optimization across multiple agents. The system deploys several parallel agents (e.g., text, image, and video agents), and coarse-grained RL governs resource allocation—e.g., assigning more compute budget to one agent while throttling another.
Both granularities must be optimized jointly. Optimizing only at the fine-grained level risks inefficiency—for instance, all three agents might redundantly explore overlapping semantic spaces. Conversely, optimizing only at the coarse-grained level may overlook internal inefficiencies within individual agents.
The “Efficiency-Aware” Design
The most pragmatic contribution of this work lies in embedding efficiency directly into the RL reward function.
Many RL papers optimize solely for effectiveness (e.g., accuracy or recall) while ignoring cost. In industrial settings, however, this is unsustainable: Xiaohongshu’s search system processes massive query volumes daily—if every query triggers unbounded multimodal model inference, infrastructure costs would become prohibitive.
HyperEyes addresses this by defining reward as a weighted sum of effectiveness and cost:
Reward = α × Search Effectiveness + β × (-Computational Cost)
Agents must therefore trade off performance against cost. Sometimes, a coarse but sufficient result is “good enough”—and the agent must learn to stop early. It learns when to stop, not just how to search.
Challenges in Industrial Deployment
Bridging the gap from research to production involves several hurdles:
First, reward design is nontrivial. How do we quantify “search effectiveness”? Click-through rate? Dwell time? User satisfaction scores? Different metrics may steer agents toward divergent behaviors. Moreover, tuning the relative weights (α and β) remains largely empirical.
Second, coordinating parallel agents poses significant engineering challenges. With multiple agents running concurrently, infrastructure must robustly support inter-agent synchronization, resource contention management, and failure recovery—issues beyond the scope of any single paper.
Third, the long-term value of this optimization framework depends on evolving hardware and model efficiencies. If multimodal model inference costs drop tenfold next year, HyperEyes’ carefully engineered efficiency-aware mechanisms may lose much of their practical relevance.
A Signal: What Are Top-Tier AI Teams Publishing?
From a broader perspective, this paper reflects an emerging trend: China’s leading tech companies are shifting focus—from chasing novel models to building intelligent, production-grade systems.
Two years ago, most industry papers centered on new model architectures, training paradigms, or benchmark proposals. Today, an increasing share focuses on systemic questions: How do we compose existing models into high-efficiency pipelines? How do we engineer principled trade-offs between performance and cost?
When read alongside Tencent’s Hunyuan Listwise Policy Optimization and Google’s Agentic Discovery, HyperEyes reveals a shared objective: designing AI systems that use their own capabilities more intelligently—not merely scaling models larger or stronger.
Assessment
HyperEyes is an engineering-driven research paper. Its academic novelty may be modest compared to recent advances like mean-variance split residuals or listwise policy optimization—but its practical utility is potentially higher, precisely because it targets a real-world industrial pain point head-on.
For teams building search systems, recommendation engines, or any application requiring multimodal retrieval, HyperEyes’ dual-grained optimization strategy and efficiency-aware reward formulation offer actionable insights.
Primary Sources:
- Hugging Face Daily Papers - May 11, 2026
- Xiaohongshu AI, "HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents"