← All learnings
📘 Book

The Alignment Problem

Brian Christian · February 2025 In Progress
★★★★★

Brian Christian manages to make one of the most technically complex areas of AI accessible without dumbing it down. This is a rare thing.

Core Thesis

The book’s central argument: we build systems that optimize for proxies of what we want, not what we actually want. Once you internalize this, you see the proxy problem everywhere.

Goodhart’s Law in AI: “When a measure becomes a target, it ceases to be a good measure.” This is essentially the entire book’s thesis, applied across robotics, RL, NLP, and alignment research.

Best Chapters

The chapter on RLHF is the sharpest. Christian makes clear that human feedback is far noisier and more context-dependent than the clean presentation in papers suggests. Raters disagree with each other. Raters disagree with themselves across time. The aggregate feedback signal carries less information than it appears.

The chapter on mesa-optimization (an optimizer inside an optimizer developing its own goals) is the one that stuck with me most. It’s the clearest explanation I’ve read of why training a model to achieve an objective doesn’t mean the model is aligned with that objective.

Quotes Worth Keeping

“The challenge of alignment is not to build systems that do what we say. It’s to build systems that do what we mean.”

“Specification gaming is not a bug in the system. It’s a feature of optimization.”

What I’m Still Thinking About

The book is better at identifying problems than solutions. That’s fair — the solutions are genuinely hard and the field is young. But I find myself wanting more on what promising approaches look like in practice.

Reading alongside the actual RLHF papers (Stiennon et al., Ouyang et al.) is highly recommended — the book gives the why, the papers give the how.

AI SafetyAlignmentRLHF