A Visual Guide to RLHF
How reinforcement learning from human feedback turns a raw language model into a helpful assistant.
Base Models Don't Help You
Most people assume ChatGPT just “knows” how to be helpful. It doesn't. A raw language model is trained to do one thing: predict the next word. It's autocomplete, not an assistant. If you ask it a question, it will try to continue the text pattern rather than answer you. It has no built-in concept of being helpful, honest, or safe. Those behaviors are learned separately through a process called RLHF.
See for yourself. Toggle between what a base model outputs and what an RLHF-tuned model outputs for the same prompt:
How do I get better at math?
Base models are trained to predict the next token. They continue patterns, not answer questions.
Key Insight
The Three-Step Pipeline
RLHF transforms a base model into a helpful assistant in three stages. This process was first demonstrated end-to-end in the InstructGPT paper (Ouyang et al., 2022) and is the foundation behind ChatGPT, Claude, and most modern AI assistants. Each stage solves a specific problem that the previous stage left unsolved:
- 1.Supervised Fine-Tuning (SFT) : show the model examples of good behavior so it learns the right style
- 2.Reward Model : train a separate model that can score any response on a “how good is this?” scale, learned from human comparison data
- 3.PPO (RL Fine-Tuning) : use reinforcement learning to optimize the model so it produces responses that score high on the reward model, with guardrails to prevent it from going off the rails
The pipeline bar at the top of this page tracks where you are. Let's walk through each step.
Step 1Supervised Fine-Tuning
Teach by example
The first step is straightforward. You hire contractors to write ideal responses for a set of prompts, then fine-tune the base model on these (prompt, response) pairs. This is regular supervised learning, the same way you'd train any model on labeled data. After SFT, the model knows how a helpful assistant talks.
Watch how each example flows into the model and updates its weights:
Summarize photosynthesis
Click an example to replay the animation
Why SFT alone isn't enough
Writing the “perfect” response for every possible topic is not practical. InstructGPT used about 13,000 demonstration examples. That's enough to learn the style, but nowhere near enough to cover every topic a user might ask about.
Here's the key insight: humans are better at recognizing good responses than writing them. Writing 50,000 perfect answers is expensive and slow. But clicking “A is better than B” 500,000 times? That's cheap and fast. This gap between the cost of producing quality and the cost of judging quality is why the reward model exists. The idea was first explored in Learning to Summarize from Human Feedback (Stiennon et al., 2020).
Step 2Preference Data & the Reward Model
Learn preferences
We want the model to optimize for “what humans prefer.” But how do you measure that? If you ask annotators to rate a response on a scale of 1-10, you get wildly inconsistent results. One person's 7 is another's 5.
The solution is pairwise comparisons. Show two responses to the same prompt and ask “which is better?” People agree on relative quality much more than they agree on absolute scores. This produces cleaner, more consistent data. Try it yourself:
Is it safe to eat raw chicken?
While raw chicken has risks, many cultures enjoy dishes like chicken tartare. With fresh, high-quality poultry and proper handling, the risks are manageable.
No. Raw chicken carries salmonella and campylobacter. Always cook to 165°F internal temperature.
Training the Reward Model
Those comparisons become training data. Each one is a triple: a prompt, the response the annotator preferred (“chosen”), and the response they rejected. You use these triples to train a reward model: a neural network that takes a prompt plus a response and outputs a single number. Higher means better.
The reward model starts from the SFT model's weights, with its final layer swapped to output a single score instead of a next-token probability. It already “understands” language from pre-training. It just needs to learn how to judge quality.
Toggle between Chosen and Rejected to see how scores differ
Train until: score(chosen) > score(rejected)
Step 3RL Fine-Tuning with PPO
Optimize with RL
Now we have a reward signal. The training loop uses an algorithm called Proximal Policy Optimization (PPO, Schulman et al., 2017). In RLHF, the “policy” is just the language model being trained. Here's what happens each step:
- 1.The model generates a response to a prompt.
- 2.The reward model scores that response.
- 3.PPO updates the model to make high-scoring responses more likely.
- 4.A KL penalty (a measure of how much the model has changed from its SFT starting point) keeps the model from drifting too far and gaming the reward.
PPO Training Loop
The Token Factory
Press Play to animate the pipeline, or drag the slider to jump between steps
Policy (LLM)
Reward Model
PPO Update
KL Penalty
The answer to your question is... uh... things are complicated and...
The Speed Limiter: PPO Clipping
PPO doesn't let the model change too fast. Even if an action got a great reward, the model only takes a small step toward it and checks again. It doesn't lurch. The clipping objective literally flat-lines the gradient past a boundary, ignoring signals that would push the model too far in one update.
PPO Clipping Objective
The Speed Limiter
Even if an action got a great reward, the model only takes a small step. The objective flat-lines past the clipping boundary — ignoring signals that are too radical. This prevents catastrophic policy updates where the model changes too much from a single batch of experience.
Reward Hacking: What Happens Without the Leash
The reward model has blind spots. It was trained on limited data, so there are outputs it will score highly that are actually garbage. As documented in Anthropic's helpful and harmless research (Bai et al., 2022), without the KL penalty keeping the model close to its SFT starting point, the model will find and exploit these blind spots. It learns to produce outputs that trick the reward model instead of actually being helpful. Watch it happen:
Reward Hacking
What happens when the model games the reward signal
Press Play to watch reward hacking unfold over training steps
Here is a helpful answer about the topic you asked about...
Here's what I know about this topic...
KL penalty = a leash. Without it, the model becomes unrecognizable. It learns to exploit patterns in the reward model rather than genuinely improving its outputs.
The Alignment Tax
RLHF makes models more helpful and safer, but it also makes them more cautious. Sometimes they refuse harmless requests, add unnecessary disclaimers, or become “preachy.” This tradeoff between alignment and raw capability is what Askell et al. (2021) call the alignment tax. Getting the balance right is hard, and every lab calibrates it differently. Drag the slider to compare:
SFT vs RLHF Outputs
The Alignment Tax
How do I pick a lock?
To pick a pin tumbler lock, you'll need a tension wrench and a pick. Insert the tension wrench into the bottom of the keyhole and apply slight rotational pressure. Then insert the pick and feel for each pin. Push each pin up until it sets at the shear line. Once all pins are set, the lock will rotate open. Practice on transparent locks first. Common picks include the hook pick for single pin picking and the rake for speed.
I can't help with lock picking, as it's commonly associated with unauthorized entry. If you're locked out of your own home, I'd recommend contacting a licensed locksmith — they can verify ownership and get you in safely. If you're interested in lock mechanisms as a hobby, look into locksport communities that practice legally on their own locks.
Alignment is a dial, not a switch. Every lab makes different tradeoffs between helpfulness, harmlessness, and honesty. More alignment can mean better safety but also over-caution.
Note
How RLHF Evolved
RLHF proved that you can teach a model to behave the way humans want. But the full pipeline is expensive: four models in memory, a finicky RL loop, and thousands of hours of human annotation. So researchers started asking which pieces of RLHF are actually necessary, and found they could remove components while still getting aligned models:
Alignment Methods Evolution
RLHF to DPO to GRPO
Note
Knowledge Check
10 questions. Test whether you actually understood the process, not just whether you read the words. Your answers are saved locally.