AI Research

Papers, benchmarks, datasets, and experimental advances worth tracking

Research May 26, 2026

APWA: A Distributed Architecture for True Parallelization in Multi-Agent Systems

APWA proposes a distributed architecture designed for parallelizable agent workloads, addressing the inference, coordination, and computational scaling bottlenecks that multi-agent systems face as task scale and complexity increase.

#Multi-Agent #Distributed Architecture #Paper Review

Research May 26, 2026

Dual-Dimensional Consistency: A New Method to Save 10x Tokens During Inference-Time Scaling

DDC proposes a unified inference-time scaling framework that reduces token consumption by over 10x while maintaining or surpassing baseline accuracy across 5 benchmarks, utilizing a confidence-weighted Bayesian protocol and trend-aware stratified pruning.

#Inference Optimization #Token Efficiency #LLM Inference

Research May 26, 2026

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory Capabilities

MemEye introduces a visual-centric evaluation framework for multimodal Agent memory, born from the collaboration of 17 researchers, filling a gap in the evaluation of Agent memory systems.

#Multimodal #Agent Memory #Evaluation Framework

Research May 26, 2026

MemLens: NVIDIA Benchmarks Long-Term Memory for Multimodal Large Models

NVIDIA releases MemLens, the first benchmark targeting the multimodal long-term memory capabilities of Large Vision-Language Models (LVLMs), filling a gap in LVLM memory evaluation.

#NVIDIA #Multimodal #Benchmark

Research May 26, 2026

Microsoft Orchard Framework: An Agent Training Paradigm Distilled from 107,000 Trajectories

Microsoft Research has open-sourced Orchard, a scalable Agent modeling framework. Spanning from Code Agents to GUI Agents and personal assistants, it achieves cross-domain training through a unified, lightweight environment layer. It achieves 67.5% on SWE-bench Verified, while its GUI Agent reaches the strongest open-source performance using only 400 distilled trajectories.

#Microsoft #Open-Source Framework #Agent Training

Research May 23, 2026

CiteVQA: OpenDataLab's Document Intelligence Benchmark Makes Every AI Citation Verifiable

OpenDataLab has released the CiteVQA benchmark, specifically designed to measure the evidence attribution capability of answers in document intelligence systems. Topping the HuggingFace Daily Papers with 143 upvotes, it signals that trustworthy AI is evolving from a slogan into a quantifiable technical metric.

#CiteVQA #OpenDataLab #Document Intelligence

Research May 23, 2026

CLI-Anything Surges by 1,000 Stars in a Week: Making All Software "Agent-Native," A New Approach from the HKU Team

The CLI-Anything project, released by HKU's HKUDS team, topped GitHub Trending with over 36,000 stars. Its core philosophy is to make all software "Agent-Native"—not just a simple tool, but a fundamental shift in software architecture thinking.

#CLI-Anything #Agent-Native #HKU

Research May 23, 2026

MMSkills: SJTU Decomposes Visual Agent Capabilities into a "Skill Pack"—A New Paradigm for Multimodal Agents

Shanghai Jiao Tong University has released the MMSkills framework, decoupling the capabilities of multimodal visual agents into composable and reusable skill units. Garnering 99 upvotes to top Hugging Face's trending papers, this suggests that the "skill-based" evolution of agents may be closer to the future than the "model-based" approach.

#MMSkills #Multimodal Agent #Shanghai Jiao Tong University

Research May 23, 2026

Decoding the PhysBrain 1.0 Technical Report: AI Finally Begins to "Understand" the Physical World

DeepCybo has released the PhysBrain 1.0 technical report, building an AI system capable of understanding physical laws. From intuitive physics to video generation verification, this technical route may bring us closer to true "intelligence" than pure language models.

#PhysBrain #Physical Reasoning #DeepCybo

Research May 23, 2026

Tencent Hunyuan's New Paper: How Much Efficiency Can On-Policy Distillation Actually Unlock?

The Tencent Hunyuan team has released a new paper systematically studying the efficiency of On-Policy Distillation in unlocking model potential. The paper reveals the critical impact of distillation strategy selection on model performance, providing empirical evidence for large-scale model training.

#On-Policy Distillation #Tencent Hunyuan #Model Distillation

Research May 20, 2026

TideGS: Training Over 1 Billion 3D Gaussians on a Single 24GB GPU, ICML 2026 Spotlight

TideGS leverages an SSD-CPU-GPU hierarchical storage management system to train 3DGS with over 1 billion Gaussian primitives on a single 24GB GPU. This represents a 10x improvement over previous out-of-core baselines (~100 million) and roughly a 100x increase over in-memory training (~11 million). The paper has been accepted as a Spotlight at ICML 2026.

#TideGS #3D Gaussian Splatting #Out-of-Core

Research Featured May 20, 2026

Anti-Self-Distillation: Inverse Self-Distillation Accelerates Reasoning RL Training by 2–10×

Anti-SD identifies, via pointwise mutual information (PMI) analysis, that privileged context suppresses models’ reasoning deliberation tokens—and proposes “anti-self-distillation”: intentionally increasing, rather than decreasing, divergence between student and teacher. On mathematical reasoning benchmarks, it achieves the same accuracy as the GRPO baseline in just 2–10× fewer training steps, with final accuracy gains up to +11.5 points.

#Anti-Self-Distillation #reasoning RL #GRPO

Research May 20, 2026

CogOmniControl: Turning “Creative Intent Understanding” into a Reasoning Engine for Video Generation

CogOmniControl proposes a reasoning-driven framework for controllable video generation, decomposing the generation process into two stages: creative intent cognition and video synthesis. CogVLM—trained on professional anime production data—accurately interprets sparse, abstract conditions; combined with CogOmniDiT and RL alignment, it outperforms existing open-source models on two newly established benchmarks.

#CogOmniControl #video generation #controllable generation

Research Featured May 20, 2026

GoLongRL: An Open-Source Long-Context RL Training Framework—30B Model Matches DeepSeek-R1-0528 Performance

GoLongRL introduces a fully open-source reinforcement learning (RL) post-training framework for long-context language modeling, releasing a 23K-sample RLVR dataset and complete training code. The Qwen3-30B-A3B model achieves performance on long-context tasks comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507.

#GoLongRL #long context #reinforcement learning

Research Featured May 20, 2026

OpenComputer: Building a Verifiable Software World for Computer-Use Agents, 33 Apps, 1000 Tasks

OpenComputer introduces a verifier-based framework to build a verifiable software environment for computer-use agents. Covering 33 desktop applications and 1,000 tasks, experiments show that its hard-coded verifiers align more closely with human judgment than LLM-as-judge approaches.

#OpenComputer #Computer-Use Agent #Verifiable Environment

Research May 20, 2026

AI-Powered Fully Automated Research Roadmap: A Paper Can Be Generated for as Low as $15—but “Reliability” Remains a Major Challenge

A joint AI research roadmap titled *AI for Auto-Research: Roadmap & User Guide*, released by institutions including the National University of Singapore, systematically analyzes the capability boundaries of AI across the full research lifecycle: papers can be auto-generated for just $15, yet LLMs still fabricate results, overlook hidden errors, and cannot reliably assess scientific novelty.

#AI research #Auto-Research #academic integrity

Research May 20, 2026

SkillsVote: Adding a "Voting System" to AI Agent Skills for Self-Evolution Without Model Updates

IAAR-Shanghai and Memtensor Research Group propose SkillsVote, a full-lifecycle governance framework for Agent skills. Offline evolution improves GPT-5.2 by 7.9 percentage points on Terminal-Bench 2.0, while online evolution boosts SWE-Bench Pro by 2.6 percentage points.

#Agent #SkillsVote #Skill Evolution

Research May 19, 2026

ByteDance’s Lance: A From-Scratch Trained Unified Multimodal Model for Understanding, Generation, and Editing

ByteDance has released Lance—a natively unified multimodal model trained from scratch—that supports understanding, generation, and editing of images and videos. Leveraging a dual-stream Mixture-of-Experts (MoE) architecture, Lance significantly outperforms existing open-source unified models in generation quality while retaining strong understanding capabilities.

#ByteDance #Lance #multimodal

Research May 19, 2026

Code as Agent Harness: When Code Is No Longer the Output—but the Operating System of Agents

Hugging Face’s #1 Paper of the Day—a survey paper authored by 42 researchers (including prominent academics and industry scientists)—systematically introduces the “Code as Agent Harness” framework, positioning code as a unified infrastructure layer for agent reasoning, action, and environment modeling.

#Agent #code generation #Agent Harness

Research May 19, 2026

NVIDIA LongLive-2.0: NVFP4 Full-Stack Parallel Infrastructure, Accelerating Long Video Generation Training by 2.15x and Inference to 45.7 FPS

The NVIDIA team has released LongLive-2.0, the first full-stack system for long video generation training and inference based on NVFP4 precision. By introducing sequence-parallel autoregressive training and W4A4 inference, it achieves a 2.15x speedup in training and a 1.84x speedup in inference, with the 5B model reaching 45.7 FPS.

#NVIDIA #LongLive-2.0 #Video Generation

Research May 19, 2026

AI Auto-Research Roadmap: It Can Write a Paper, But the Pitfalls of Scientific Integrity Run Deep

The NUS team has released the *AI for Auto-Research* roadmap, systematically analyzing the reliability boundaries of AI across the entire research lifecycle. It clarifies which stages—from idea generation to publication—AI can handle independently and which still require human oversight.

#AI Research #Automated Research #Paper Generation

Research May 19, 2026

Tsinghua KVPO: Bringing GRPO into Video Generation, Using KV Cache for Semantic Exploration to Make AI-Generated Videos Better Align with Human Aesthetics

The Tsinghua team proposes KVPO, an ODE-Native online GRPO framework that aligns autoregressive video generation models with human preferences by shifting the exploration source from random noise to historical KV Cache, achieving improvements in visual quality, motion quality, and text-video consistency.

#Tsinghua University #KVPO #Video Generation

Research May 19, 2026

Tsinghua ZEDA: Skip Half the Experts in Pre-trained MoE Models via Self-Distillation, Boosting Inference Speed by 1.2x

The Tsinghua team introduces ZEDA, a low-cost framework that converts pre-trained static MoE models into dynamic MoE. It eliminates over 50% of expert FLOPs on Qwen3-30B-A3B and GLM-4.7-Flash, achieving approximately 1.2x end-to-end inference speedup.

#Tsinghua University #ZEDA #MoE

Research May 19, 2026

ByteDance Lance: Unifying Multimodal Understanding, Generation, and Editing via "Multi-Task Synergy" Instead of Parameter Scaling

ByteDance Research has released Lance, a lightweight native unified multimodal model. Through a dual-stream MoE architecture and multi-task synergistic training, it simultaneously achieves understanding, generation, and editing of images and videos without relying on scaling up model capacity.

#ByteDance #Lance #Multimodal

Research May 19, 2026

NVIDIA LongLive-2.0: Breaking the Compute Wall for Long Video Generation with NVFP4 Parallel Infrastructure

NVIDIA has released LongLive-2.0, a long video generation infrastructure built on NVFP4 quantization and parallel inference. With 1.22k GitHub Stars, it explores how to generate longer video sequences without sacrificing quality.

#NVIDIA #LongLive #Video Generation

Research Featured May 18, 2026

Shanghai Jiao Tong University ARIS: Empowering AI to Conduct Autonomous Research Like Scientists, The Ambition of Adversarial Multi-Agent Collaboration

The ARIS system, released by Shanghai Jiao Tong University, enables multiple AI agents to autonomously complete research tasks through adversarial collaboration. It has received 116 upvotes on Papers with Code and 9.7k GitHub stars, making it one of the most notable AI for Science projects recently.

#Multi-Agent Systems #Autonomous Research #Adversarial Collaboration

Research Featured May 18, 2026

Tsinghua Team Causal Forcing++: Turning Video Generation from "Wait Minutes" into "Real-Time Interaction"

Tsinghua ML Group Causal Forcing++ paper proposes a scalable few-step autoregressive diffusion distillation method, transforming interactive video generation from a multi-minute wait into real-time response. What does this mean for gaming, VR, and interactive content creation?

#Video Generation #Diffusion Models #Distillation

Research May 18, 2026

Can Models Get Stronger Without Training? Darwin Family Uses Evolutionary Merging to Push LLM Reasoning to GPQA Diamond 86.9%

Darwin Family introduces a training-free evolutionary merging framework that composes the latent capabilities of existing models via gradient-free weight-space recombination. Its flagship model, Darwin-27B-Opus, achieves 86.9% on GPQA Diamond—ranking 6th among 1,252 evaluated models—without any gradient-based training.

#Darwin Family #model merging #evolutionary merging

Research Featured May 18, 2026

FORGE: Enabling Agent Memory to Self-Evolve Without Updating Weights—A Paper with a Bold Approach

The new arXiv paper FORGE proposes a method that enables Agent memory to self-evolve without updating model weights. Through a population broadcast mechanism, Agents can share experiences and learn from each other, achieving continuous memory evolution. This approach bypasses the traditional fine-tuning pipeline, offering a lightweight path for continuous Agent learning.

#Agent Memory #Self-Evolution #Population Broadcast

Research Featured May 18, 2026

Gold-Medal-Level Olympiad Reasoning: Large Models Achieve It via Simple Scaling, Which Is Unsettling

A new paper proves that through a simple and unified scaling strategy, large language models can reach gold-medal-level reasoning at the International Mathematical Olympiad. No flashy new architectures, no complex training tricks—just scaling. The implications of this may be more thought-provoking than the paper itself.

#LLM Reasoning #Mathematical Olympiad #Scaling Laws

Research May 18, 2026

KAIST New Paper: Proactively Steering RL Training "Out of the Comfort Zone" for More Efficient Strategy-Guided Exploration

A paper from the KAIST AI lab proposes a strategy-guided exploration method that proactively steps reinforcement learning training "out of its comfort zone," improving learning efficiency without increasing training data volume. The paper has gained attention on Hugging Face Daily Papers.

#Reinforcement Learning #RLVR #Exploration Strategy

Research May 18, 2026

Let LLMs Do Epidemic Forecasting Themselves: Harvard Team Predicts Multi-Pathogen Diseases with Autonomous Tree Search

A team from Harvard University and Massachusetts General Hospital proposes a multi-pathogen disease forecasting method based on autonomous LLM-guided tree search. LLMs are no longer just conversational tools; they transform into autonomous search agents that explore complex hypothesis spaces to find optimal forecasting models. This work demonstrates a new role for LLMs in scientific modeling.

#AI for Science #Disease Forecasting #Autonomous Search

Research May 18, 2026

Even LLM Tutors Have Weak Spots: Paper Reveals AI Tutoring Agents Falter Precisely Where Feedback Matters Most

A new paper systematically evaluates the feedback quality of LLM tutoring agents across different scenarios, revealing a counterintuitive finding: AI tutors perform well when confirming correct answers, but tend to provide inaccurate or incomplete responses precisely when students make mistakes and need high-quality feedback the most.

#AI Education #Tutoring Agents #LLM

Research May 18, 2026

NVIDIA Releases MemLens: Multimodal Large Models' "Memory" Finally Gets a Standardized Exam

NVIDIA's MemLens benchmark systematically evaluates the multimodal long-term memory capabilities of large vision-language models for the first time. It reveals the true memory level of current multimodal models and how far they are from 'truly remembering'.

#NVIDIA #Multimodal Large Language Models #Long-Term Memory

Research Featured May 18, 2026

MMSkills: SJTU Wants Visual Agents to Truly "See" and "Act", Not Just Memorize

Shanghai Jiao Tong University's MMSkills proposes a multimodal skill learning framework for general-purpose visual agents. Unlike existing approaches that rely on rote memorization, MMSkills enables agents to truly grasp the multimodal nature of skills—knowing not just 'what they see,' but 'how to act.' The paper received 39 upvotes on Hugging Face Daily Papers.

#Multimodal #Visual Agent #Skill Learning

Research May 18, 2026

OpenDeepThink: Using Voting Instead of Judgment to Boost Gemini's Codeforces Elo by 405 Points

OpenDeepThink proposes a population-based test-time reasoning framework grounded in pairwise Bradley-Terry comparisons. Eight rounds of LLM calls (approximately 27 minutes of wall-clock time) boosted Gemini 3.1 Pro's Codeforces Elo by 405 points. It also open-sources the CF-73 dataset—73 Codeforces problems annotated by International Grandmasters.

#OpenDeepThink #Parallel Reasoning #Bradley-Terry

Research May 18, 2026

SANA-WM: A 2.6B-Parameter, Minute-Scale World Model from NVIDIA—Trainable on 64 H100s in 15 Days, Deployable on a Single GPU

SANA-WM is an open-source world model with 2.6B parameters, natively supporting one-minute video generation. Trained for 15 days on 64 H100 GPUs using ~213K publicly available video clips, its distilled variant can denoise a 60-second, 720p video in just 34 seconds on a single RTX 5090 GPU using NVFP4 quantization.

#SANA-WM #world model #video generation

Research May 18, 2026

SDAR: Solving GRPO’s Stability Issues by Integrating Self-Distillation with Agent Reinforcement Learning

SDAR (Self-Distilled Agentic Reinforcement Learning) introduces on-policy self-distillation as a gated auxiliary objective into RL training for LLM agents, achieving +9.4%, +10.2%, and +7.0% improvements over GRPO on ALFWorld, WebShop, and Search-QA respectively—while avoiding the instability inherent in naive GRPO+OPSD combinations.

#SDAR #self-distillation #agent reinforcement learning

Research Featured May 18, 2026

Self-Distilled Agentic RL: AI Agents No Longer Need Human-Fed Data, Teaching Themselves to Evolve

Self-Distilled Agentic Reinforcement Learning proposes a new paradigm for Agent training: enabling the Agent to learn from its own experiences through self-distillation, rather than relying on human annotations or external reward signals. This could fundamentally change how we train AI Agents.

#Reinforcement Learning #Agentic AI #Self-Distillation

Research May 18, 2026

Solvita: Nanjing University Enhances LLM Competitive Programming Capabilities via "Agent Evolution"

Solvita, developed by NJU-LINK Lab at Nanjing University, proposes an Agent Evolution paradigm to enhance the competitive programming capabilities of large language models. Unlike traditional supervised fine-tuning, Solvita enables Agents to evolve stronger programming reasoning abilities through self-play and continuous iteration.

#Competitive Programming #Agent Evolution #LLM

Research May 18, 2026

SU-01: A 30B Model Achieving Gold-Medal Performance on the IMO and IPhO—What’s the Secret Recipe?

SU-01 is a 30B-A3B Mixture-of-Experts (MoE) model that achieves gold-medal-level performance on the IMO 2025, USAMO 2026, and IPhO 2024/2025 using a simple, unified training recipe. Core pipeline: reverse-perplexity SFT curriculum → two-stage RL (verifiable-reward RL → proof-level RL) → test-time scaling. Supports stable reasoning trajectories exceeding 100K tokens.

#SU-01 #olympiad reasoning #IMO

Research Featured May 15, 2026

Kronos: Predicting the Stock Market with Transformers, A Financial Foundation Model Experiment Behind 24,900 Stars

Kronos is a foundation model for financial markets that treats financial data as a "language" for modeling. The project has garnered 24,946 stars on GitHub, proposing an approach that uses a tokenizer to discretize financial time-series data into token sequences, which are then predicted using a Transformer. Is this path viable?

#Finance #Foundation Model #Time Series

Research May 15, 2026

Blind Spots in Mental Health AI Safety Evaluation: Why Single-Turn Scoring Fails to Detect Gradual Harm

A recent arXiv paper highlights a fundamental flaw in current mental health AI safety evaluations—they assess isolated responses or final outcomes, while the most clinically dangerous harms often stem from cumulative effects across interaction sequences: escalating dependency, repeatedly reinforced negative patterns, and gradual deterioration across turns. The paper proposes a theoretical framework of "Temporal Safety Non-Identifiability" and the SCOPE-MH evaluation standard.

#AI Safety #Mental Health #Temporal Evaluation

Research May 15, 2026

NVIDIA AnyFlow: A "Step-Agnostic" Experiment in Video Diffusion Models—Can On-Policy Distillation End Inference Step Anxiety?

NVIDIA's newly released AnyFlow paper introduces an "arbitrary-step" video diffusion model—the same model can seamlessly switch between 1 and dozens of steps without requiring separate training for each step count. Its core method, On-Policy Flow Map Distillation, trains the model by randomly sampling step counts and using self-guided distillation, enabling it to maintain stable generation quality across any inference step count.

#Video Generation #Diffusion Models #NVIDIA

Research Featured May 15, 2026

OpenDeepThink: Don't Force a Single Reasoning Chain, Use a "Tournament" to Let LLMs Compete for the Right Answer

The latest arXiv paper OpenDeepThink proposes a population-competition reasoning framework—instead of forcing the model down a single reasoning chain, it pits multiple candidate solutions head-to-head, aggregating judgments via the Bradley-Terry model. Leveraging this, Gemini 3.1 Pro achieved a massive +405 Elo surge on Codeforces, with the entire process taking only about 27 minutes.

#Reasoning Capability #Test-time Compute #LLM