C
ChaoBro

AI Research

Papers, benchmarks, datasets, and experimental advances worth tracking

Research

CiteVQA: OpenDataLab's Document Intelligence Benchmark Makes Every AI Citation Verifiable

OpenDataLab has released the CiteVQA benchmark, specifically designed to measure the evidence attribution capability of answers in document intelligence systems. Topping the HuggingFace Daily Papers with 143 upvotes, it signals that trustworthy AI is evolving from a slogan into a quantifiable technical metric.

#CiteVQA #OpenDataLab #Document Intelligence
Research

MMSkills: SJTU Decomposes Visual Agent Capabilities into a "Skill Pack"—A New Paradigm for Multimodal Agents

Shanghai Jiao Tong University has released the MMSkills framework, decoupling the capabilities of multimodal visual agents into composable and reusable skill units. Garnering 99 upvotes to top Hugging Face's trending papers, this suggests that the "skill-based" evolution of agents may be closer to the future than the "model-based" approach.

#MMSkills #Multimodal Agent #Shanghai Jiao Tong University
Research

Tencent Hunyuan's New Paper: How Much Efficiency Can On-Policy Distillation Actually Unlock?

The Tencent Hunyuan team has released a new paper systematically studying the efficiency of On-Policy Distillation in unlocking model potential. The paper reveals the critical impact of distillation strategy selection on model performance, providing empirical evidence for large-scale model training.

#On-Policy Distillation #Tencent Hunyuan #Model Distillation
Research

TideGS: Training Over 1 Billion 3D Gaussians on a Single 24GB GPU, ICML 2026 Spotlight

TideGS leverages an SSD-CPU-GPU hierarchical storage management system to train 3DGS with over 1 billion Gaussian primitives on a single 24GB GPU. This represents a 10x improvement over previous out-of-core baselines (~100 million) and roughly a 100x increase over in-memory training (~11 million). The paper has been accepted as a Spotlight at ICML 2026.

#TideGS #3D Gaussian Splatting #Out-of-Core
Research

CogOmniControl: Turning “Creative Intent Understanding” into a Reasoning Engine for Video Generation

CogOmniControl proposes a reasoning-driven framework for controllable video generation, decomposing the generation process into two stages: creative intent cognition and video synthesis. CogVLM—trained on professional anime production data—accurately interprets sparse, abstract conditions; combined with CogOmniDiT and RL alignment, it outperforms existing open-source models on two newly established benchmarks.

#CogOmniControl #video generation #controllable generation
Research

AI-Powered Fully Automated Research Roadmap: A Paper Can Be Generated for as Low as $15—but “Reliability” Remains a Major Challenge

A joint AI research roadmap titled *AI for Auto-Research: Roadmap & User Guide*, released by institutions including the National University of Singapore, systematically analyzes the capability boundaries of AI across the full research lifecycle: papers can be auto-generated for just $15, yet LLMs still fabricate results, overlook hidden errors, and cannot reliably assess scientific novelty.

#AI research #Auto-Research #academic integrity
Research

ByteDance’s Lance: A From-Scratch Trained Unified Multimodal Model for Understanding, Generation, and Editing

ByteDance has released Lance—a natively unified multimodal model trained from scratch—that supports understanding, generation, and editing of images and videos. Leveraging a dual-stream Mixture-of-Experts (MoE) architecture, Lance significantly outperforms existing open-source unified models in generation quality while retaining strong understanding capabilities.

#ByteDance #Lance #multimodal
Research

NVIDIA LongLive-2.0: NVFP4 Full-Stack Parallel Infrastructure, Accelerating Long Video Generation Training by 2.15x and Inference to 45.7 FPS

The NVIDIA team has released LongLive-2.0, the first full-stack system for long video generation training and inference based on NVFP4 precision. By introducing sequence-parallel autoregressive training and W4A4 inference, it achieves a 2.15x speedup in training and a 1.84x speedup in inference, with the 5B model reaching 45.7 FPS.

#NVIDIA #LongLive-2.0 #Video Generation
Research

Tsinghua KVPO: Bringing GRPO into Video Generation, Using KV Cache for Semantic Exploration to Make AI-Generated Videos Better Align with Human Aesthetics

The Tsinghua team proposes KVPO, an ODE-Native online GRPO framework that aligns autoregressive video generation models with human preferences by shifting the exploration source from random noise to historical KV Cache, achieving improvements in visual quality, motion quality, and text-video consistency.

#Tsinghua University #KVPO #Video Generation
Research

Can Models Get Stronger Without Training? Darwin Family Uses Evolutionary Merging to Push LLM Reasoning to GPQA Diamond 86.9%

Darwin Family introduces a training-free evolutionary merging framework that composes the latent capabilities of existing models via gradient-free weight-space recombination. Its flagship model, Darwin-27B-Opus, achieves 86.9% on GPQA Diamond—ranking 6th among 1,252 evaluated models—without any gradient-based training.

#Darwin Family #model merging #evolutionary merging
Research

Let LLMs Do Epidemic Forecasting Themselves: Harvard Team Predicts Multi-Pathogen Diseases with Autonomous Tree Search

A team from Harvard University and Massachusetts General Hospital proposes a multi-pathogen disease forecasting method based on autonomous LLM-guided tree search. LLMs are no longer just conversational tools; they transform into autonomous search agents that explore complex hypothesis spaces to find optimal forecasting models. This work demonstrates a new role for LLMs in scientific modeling.

#AI for Science #Disease Forecasting #Autonomous Search
Research

Even LLM Tutors Have Weak Spots: Paper Reveals AI Tutoring Agents Falter Precisely Where Feedback Matters Most

A new paper systematically evaluates the feedback quality of LLM tutoring agents across different scenarios, revealing a counterintuitive finding: AI tutors perform well when confirming correct answers, but tend to provide inaccurate or incomplete responses precisely when students make mistakes and need high-quality feedback the most.

#AI Education #Tutoring Agents #LLM
Research

OpenDeepThink: Using Voting Instead of Judgment to Boost Gemini's Codeforces Elo by 405 Points

OpenDeepThink proposes a population-based test-time reasoning framework grounded in pairwise Bradley-Terry comparisons. Eight rounds of LLM calls (approximately 27 minutes of wall-clock time) boosted Gemini 3.1 Pro's Codeforces Elo by 405 points. It also open-sources the CF-73 dataset—73 Codeforces problems annotated by International Grandmasters.

#OpenDeepThink #Parallel Reasoning #Bradley-Terry
Research

SANA-WM: A 2.6B-Parameter, Minute-Scale World Model from NVIDIA—Trainable on 64 H100s in 15 Days, Deployable on a Single GPU

SANA-WM is an open-source world model with 2.6B parameters, natively supporting one-minute video generation. Trained for 15 days on 64 H100 GPUs using ~213K publicly available video clips, its distilled variant can denoise a 60-second, 720p video in just 34 seconds on a single RTX 5090 GPU using NVFP4 quantization.

#SANA-WM #world model #video generation
Research

SDAR: Solving GRPO’s Stability Issues by Integrating Self-Distillation with Agent Reinforcement Learning

SDAR (Self-Distilled Agentic Reinforcement Learning) introduces on-policy self-distillation as a gated auxiliary objective into RL training for LLM agents, achieving +9.4%, +10.2%, and +7.0% improvements over GRPO on ALFWorld, WebShop, and Search-QA respectively—while avoiding the instability inherent in naive GRPO+OPSD combinations.

#SDAR #self-distillation #agent reinforcement learning
Research

Solvita: Nanjing University Enhances LLM Competitive Programming Capabilities via "Agent Evolution"

Solvita, developed by NJU-LINK Lab at Nanjing University, proposes an Agent Evolution paradigm to enhance the competitive programming capabilities of large language models. Unlike traditional supervised fine-tuning, Solvita enables Agents to evolve stronger programming reasoning abilities through self-play and continuous iteration.

#Competitive Programming #Agent Evolution #LLM
Research

SU-01: A 30B Model Achieving Gold-Medal Performance on the IMO and IPhO—What’s the Secret Recipe?

SU-01 is a 30B-A3B Mixture-of-Experts (MoE) model that achieves gold-medal-level performance on the IMO 2025, USAMO 2026, and IPhO 2024/2025 using a simple, unified training recipe. Core pipeline: reverse-perplexity SFT curriculum → two-stage RL (verifiable-reward RL → proof-level RL) → test-time scaling. Supports stable reasoning trajectories exceeding 100K tokens.

#SU-01 #olympiad reasoning #IMO
Research

Blind Spots in Mental Health AI Safety Evaluation: Why Single-Turn Scoring Fails to Detect Gradual Harm

A recent arXiv paper highlights a fundamental flaw in current mental health AI safety evaluations—they assess isolated responses or final outcomes, while the most clinically dangerous harms often stem from cumulative effects across interaction sequences: escalating dependency, repeatedly reinforced negative patterns, and gradual deterioration across turns. The paper proposes a theoretical framework of "Temporal Safety Non-Identifiability" and the SCOPE-MH evaluation standard.

#AI Safety #Mental Health #Temporal Evaluation
Research

NVIDIA AnyFlow: A "Step-Agnostic" Experiment in Video Diffusion Models—Can On-Policy Distillation End Inference Step Anxiety?

NVIDIA's newly released AnyFlow paper introduces an "arbitrary-step" video diffusion model—the same model can seamlessly switch between 1 and dozens of steps without requiring separate training for each step count. Its core method, On-Policy Flow Map Distillation, trains the model by randomly sampling step counts and using self-guided distillation, enabling it to maintain stable generation quality across any inference step count.

#Video Generation #Diffusion Models #NVIDIA