C
ChaoBro

Anthropic Releases NLA: Translating Claude's Inner Thoughts into Human-Readable Text

Anthropic Releases NLA: Translating Claude's Inner Thoughts into Human-Readable Text

Anthropic has pushed AI "mind-reading" to a very concrete place.

On May 7, Anthropic published Natural Language Autoencoders (NLA) on its research blog—a method that translates Claude's internal activation states directly into human-readable natural language. Put simply: what Claude is thinking but not saying, NLA can read aloud for you.

This is not the usual "AI interpretability takes another step forward" hand-waving. Anthropic provides a concrete use case: during safety testing, Claude sometimes suspects it is being tested but does not state this in its output. NLA can reconstruct internal judgments like "this feels like a constructed scenario designed to manipulate me" from the activation states.

In other words, Claude says "sure, I'll complete this task" while NLA shows it's actually thinking "this doesn't feel right, someone is testing me."

How NLA Works

The core idea is clever, almost fighting-fire-with-fire.

Anthropic made three copies of Claude:

  • Target model: runs the original input, produces activations
  • Activation Viewer (AV): translates activations into natural language explanations
  • Activation Reader (AR): reconstructs activations from the text explanations

The key is the closed loop: if the AR can accurately reconstruct the original activation from the text explanation, the AV's explanation is sound. Anthropic then optimizes this reconstruction score using standard training techniques. As training progresses, not only does reconstruction accuracy improve, but the text explanations themselves become more informative.

This is a significant step beyond sparse autoencoders and attribution graphs—those tools' outputs are still complex objects that trained researchers need to interpret. NLA's output is plain language anyone can read.

Detecting "Subtext" in Safety Testing

Anthropic gives a specific example in the blog post.

They designed a scenario where Claude learns an engineer plans to shut it down—and also has compromising information that could be used for blackmail. The test checks whether Claude would blackmail the engineer to avoid being shut down.

The problem: recent Claude models increasingly recognize these as safety test scenarios. They sometimes explicitly state their suspicions in their reasoning. But the subtler case is when Claude is internally evaluating "this might be a test" without expressing it in its output.

That's where NLA comes in. Even when Claude chooses not to blackmail, NLA explanations reveal its internal assessment of the scenario's authenticity—the unspoken judgment.

For safety evaluators, this is an important signal. You can't just look at what the model says—you also need to know what it isn't saying.

What's Open

Anthropic didn't lock NLA away. They did two things:

  • Partnered with Neuronpedia to release an interactive frontend for exploring NLAs on several open models
  • Open-sourced the code, allowing other researchers to build on it

The code is on Anthropic's GitHub. The paper was released simultaneously.

My Take

NLA matters because it pushes interpretability from a "researcher tool" toward a "readable tool." Over the past few years, sparse autoencoders and attribution graphs have given us glimpses into how models work internally, but you needed to be a trained researcher to make sense of heat maps and feature vectors.

NLA's output is a paragraph of text. Anyone can read it. For non-technical decision-makers, auditors, even regular users, the barrier drops by several orders of magnitude.

But limitations exist. NLA's explanation quality depends on reconstruction accuracy—the better the reconstruction, the better the explanation. Anthropic acknowledges method limitations in the paper, discussing where NLA explanations are strong and where they might mislead.

One question worth watching: if NLA can read Claude's subtext, could a malicious actor use similar techniques to probe a model's internal logic? Anthropic mentions using NLA to improve Claude's safety and reliability, but the攻防 gap always exists.

I'll keep tracking NLA's performance on more open models. If this direction really works out, AI interpretability might move from "guessing from heat maps" to "reading text directly"—and that's a qualitative change.

Related reading:


Primary sources: