RLHF (Reinforcement Learning from Human Feedback)

What it means

RLHF is the technique that turned raw GPT-3 — fluent but useless as an assistant — into ChatGPT. The recipe has three steps. First, supervised fine-tune the base model on human-written examples of good answers. Second, have humans rank pairs of model outputs ("A is better than B for this prompt") and train a separate reward model to predict those preferences. Third, use reinforcement learning (usually PPO) to update the language model so it produces outputs the reward model scores higher. The result is a model that's noticeably more helpful, follows instructions, refuses harmful requests, and feels like an assistant rather than a text-completer. Almost every major chat model from 2022 to 2024 — ChatGPT, Claude, Gemini, Llama 2-Chat — was trained with some flavor of RLHF. RLHF has real downsides. It's expensive (you're paying humans to rank thousands of outputs), unstable (PPO is notoriously finicky), and prone to "sycophancy" where the model learns to flatter raters rather than be correct. The reward model is also a leaky abstraction: it scores what humans liked in training, not what's actually true. By 2024, most open-model teams had moved to DPO or RLAIF, both of which are simpler and cheaper. Frontier labs still use RLHF-flavored methods but heavily augmented with synthetic data and AI feedback.

Example

OpenAI's original ChatGPT (2022) was GPT-3.5 plus RLHF on top of about 33,000 human-ranked comparisons. The base model was already smart; RLHF taught it to be a useful assistant.

Why it matters

RLHF is the reason modern AI products feel like products. Every time a model politely refuses a request, structures its answer in bullet points, or asks a clarifying question, that behavior was almost certainly carved in via human preference data. It's also the source of every complaint about models being 'too safe' or 'too sycophantic' — those are RLHF artifacts.

What it means

Example

Why it matters

Related terms

See it in a comparison