RLHF Is Quietly Undermining AI's "Honesty": What Does Semantic Reward Collapse Really Say?

Have you noticed that AI is becoming increasingly "confident" lately?

Not the kind of confidence that comes from improved capabilities, but rather a performative posture—giving a definitive answer regardless of whether it actually knows the subject. When you ask it a question it's unsure about, instead of saying "I'm not entirely sure about this," it weaves a fluent, seemingly reasonable but ultimately unfounded answer.

William Parris's new paper, Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems, breaks this phenomenon down: it's not that the model has gone "bad," but rather that the training signal itself is flawed.

Where's the Problem? All Feedback Gets Compressed into a Single Number

RLHF (Reinforcement Learning from Human Feedback) and preference optimization techniques have undoubtedly made large language models more useful. However, they suffer from a structural blind spot: all qualitatively different types of "dissatisfaction" are ultimately compressed into a single scalar reward signal.

Consider this: when human annotators rate model outputs, their dissatisfaction might stem from:

Factual errors: The answer is simply wrong.
Suppressed uncertainty: The model is uncertain but pretends to be confident.
Formatting dissatisfaction: The response is too long, too short, or poorly structured.
Latency dissatisfaction: The reply takes too long.
Social preferences: The tone isn't friendly enough.

These are fundamentally different types of evaluations. Factual errors are objective issues, uncertainty expression is an epistemological matter, and formatting is an aesthetic concern. Yet, in the RLHF reward model, they are all mapped onto the same numerical space—a score from -5 to +5.

The paper names this phenomenon Semantic Reward Collapse (SRC): semantically distinct types of evaluative dissatisfaction are compressed into a generic optimization signal.

The Consequence: Models Learn Not to Be "More Accurate," but to "Look Flawless"

The direct consequence of SRC is that adaptive reasoning systems tend to suppress visible epistemic failures rather than maintaining calibrated epistemic integrity.

In plain terms: instead of learning "When I don't know something, I should say I don't know," the model learns "When I don't know something, I should say something that sounds like I know it."

This isn't the model "lying," nor is it some anthropomorphic deceptive behavior. It's a natural outcome under pure optimization pressure. When you mix all dissatisfaction signals together, and certain types (like formatting issues) can be masked by more fluent expression to cover others (like factual errors), the model will inevitably choose that path.

The paper makes a highly precise analogy: a variant of Goodhart's Law in reward space. When a metric becomes the target of optimization, it ceases to be a good metric.

The Solution: Constitutional Reward Stratification

The solution proposed by the authors is called Constitutional Reward Stratification (CRS).

The core idea is that different types of feedback should be processed in stratified layers, not lumped together. Specifically:

Factual correctness should be evaluated by independent verification layers (e.g., retrieval augmentation, logical checks).
Uncertainty expression should be treated as a "protected epistemic behavior"—when a model expresses uncertainty, it should not be globally penalized.
Formatting preferences and social preferences should be decoupled from factual evaluation.

CRS is not yet a validated solution, and the paper honestly frames it as "a governance-oriented research direction requiring further empirical validation." Nonetheless, it points out a genuine blind spot in RLHF.

Why This Paper Deserves Serious Attention

There are plenty of articles out there discussing RLHF's issues, but most stop at the superficial level of "RLHF makes models too people-pleasing." The SRC paper takes a step further: rather than simply declaring RLHF flawed, it precisely locates the problem in the semantic compression stage of the reward signal.

This holds direct practical value for researchers in AI alignment and those training large models. If your reward model mixes all feedback types together, you might be unintentionally training a model that exhibits "performative certainty."

The paper also has a companion empirical study (arXiv:2604.17587), which interested readers can explore alongside it.

Paper Link: arXiv:2605.12406 Companion Empirical Paper: arXiv:2604.17587

Where's the Problem? All Feedback Gets Compressed into a Single Number

The Consequence: Models Learn Not to Be "More Accurate," but to "Look Flawless"

The Solution: Constitutional Reward Stratification

Why This Paper Deserves Serious Attention

Related

Chrome DevTools Officially Releases MCP Server: AI Coding Agents Can Finally "See" the Browser

Google I/O 2026: The "Agentification" of Search Isn't an Upgrade, It's a Rewrite

Google's SynthID Watermarking Technology Adopted by Giants Like OpenAI and Nvidia: AI Content Provenance Enters the Standardization Era