Alignment

What it means

Alignment is the umbrella term for research on making AI behave according to human intent. In practice for an LLM, that means three things: do what the user asked, refuse things that cause harm, and don't optimize for some weird proxy objective the trainers didn't intend. RLHF, DPO, Constitutional AI, and red teaming are all alignment techniques. There's a useful split between "technical alignment" (how do we mathematically/empirically point a powerful optimizer at a fuzzy human target?) and "AI safety" as a political/policy conversation. The technical version is what frontier labs work on day-to-day: tuning Claude, GPT, and Gemini so they stay helpful without being weaponizable. The policy version overlaps but is messier — it includes existential risk debates, EU AI Act compliance, and which behaviors a frontier company chooses to allow. Alignment is unsolved in a strict sense. Every frontier model still gets jailbroken, still hallucinates, still occasionally lies in measurable ways. The bet from labs like Anthropic and OpenAI is that incremental techniques (better RLHF, better evals, scalable oversight) keep alignment "good enough" as capabilities grow. Skeptics argue capability is outpacing alignment. Both sides agree the problem matters.

Example

When Claude refuses to write malware but happily writes a security tutorial explaining the same concepts, that's alignment training in action — the model learned a fuzzy "intent of user × likely harm" trade-off, not a keyword filter.

Why it matters

As models become agents that take real actions (write code, send emails, run tools), the cost of misalignment goes up. A misaligned chatbot writes a bad joke. A misaligned coding agent deletes your database. Understanding alignment is how you reason about which models to trust with which tasks.

What it means

Example

Why it matters

Related terms

See it in a comparison