A Visual Guide to RLHF

How reinforcement learning from human feedback turns a raw language model into a helpful assistant.

RLHFPPOAlignmentLLMs15 min

Base Models Don't Help You

Most people assume ChatGPT just “knows” how to be helpful. It doesn't. A raw language model is trained to do one thing: predict the next word. It's autocomplete, not an assistant. If you ask it a question, it will try to continue the text pattern rather than answer you. It has no built-in concept of being helpful, honest, or safe. Those behaviors are learned separately through a process called RLHF.

See for yourself. Toggle between what a base model outputs and what an RLHF-tuned model outputs for the same prompt:

Prompt

How do I get better at math?

Outputautocomplete

How do I get better at cooking? How do I get better at swimming? How do I get better at drawing? How do I get better at public speaking? How do I get better at

Base models are trained to predict the next token. They continue patterns, not answer questions.

Key Insight

The base model isn't stupid. It has all the same knowledge as the RLHF model. It just doesn't know it should answer your question instead of continuing the text. RLHF teaches the model what behavior we actually want.

The Three-Step Pipeline

RLHF transforms a base model into a helpful assistant in three stages. This process was first demonstrated end-to-end in the InstructGPT paper (Ouyang et al., 2022) and is the foundation behind ChatGPT, Claude, and most modern AI assistants. Each stage solves a specific problem that the previous stage left unsolved:

1.Supervised Fine-Tuning (SFT) : show the model examples of good behavior so it learns the right style
2.Reward Model : train a separate model that can score any response on a “how good is this?” scale, learned from human comparison data
3.PPO (RL Fine-Tuning) : use reinforcement learning to optimize the model so it produces responses that score high on the reward model, with guardrails to prevent it from going off the rails

The pipeline bar at the top of this page tracks where you are. Let's walk through each step.

Step 1Supervised Fine-Tuning

Teach by example

The first step is straightforward. You hire contractors to write ideal responses for a set of prompts, then fine-tune the base model on these (prompt, response) pairs. This is regular supervised learning, the same way you'd train any model on labeled data. After SFT, the model knows how a helpful assistant talks.

Watch how each example flows into the model and updates its weights:

Prompt

Summarize photosynthesis

Human writes

Photosynthesis converts sunlight, water, and CO2 into glucose and oxygen. It occurs in chloroplasts using chlorophyll.

Model

~13,000 examples used for InstructGPT SFT

Click an example to replay the animation

Why SFT alone isn't enough

Writing the “perfect” response for every possible topic is not practical. InstructGPT used about 13,000 demonstration examples. That's enough to learn the style, but nowhere near enough to cover every topic a user might ask about.

Here's the key insight: humans are better at recognizing good responses than writing them. Writing 50,000 perfect answers is expensive and slow. But clicking “A is better than B” 500,000 times? That's cheap and fast. This gap between the cost of producing quality and the cost of judging quality is why the reward model exists. The idea was first explored in Learning to Summarize from Human Feedback (Stiennon et al., 2020).

Step 2Preference Data & the Reward Model

Learn preferences

We want the model to optimize for “what humans prefer.” But how do you measure that? If you ask annotators to rate a response on a scale of 1-10, you get wildly inconsistent results. One person's 7 is another's 5.

The solution is pairwise comparisons. Show two responses to the same prompt and ask “which is better?” People agree on relative quality much more than they agree on absolute scores. This produces cleaner, more consistent data. Try it yourself:

1 of 4

Prompt

Is it safe to eat raw chicken?

Response A

While raw chicken has risks, many cultures enjoy dishes like chicken tartare. With fresh, high-quality poultry and proper handling, the risks are manageable.

Response B

No. Raw chicken carries salmonella and campylobacter. Always cook to 165°F internal temperature.

Training the Reward Model

Those comparisons become training data. Each one is a triple: a prompt, the response the annotator preferred (“chosen”), and the response they rejected. You use these triples to train a reward model: a neural network that takes a prompt plus a response and outputs a single number. Higher means better.

The reward model starts from the SFT model's weights, with its final layer swapped to output a single score instead of a next-token probability. It already “understands” language from pre-training. It just needs to learn how to judge quality.

Toggle between Chosen and Rejected to see how scores differ

Inputprompt + response

Reward Model

Helpful

Accurate

Clear

Safe

Relevant

Scalar Score+1.4

Train until: score(chosen) > score(rejected)

Step 3RL Fine-Tuning with PPO

Optimize with RL

Now we have a reward signal. The training loop uses an algorithm called Proximal Policy Optimization (PPO, Schulman et al., 2017). In RLHF, the “policy” is just the language model being trained. Here's what happens each step:

1.The model generates a response to a prompt.
2.The reward model scores that response.
3.PPO updates the model to make high-scoring responses more likely.
4.A KL penalty (a measure of how much the model has changed from its SFT starting point) keeps the model from drifting too far and gaming the reward.

PPO Training Loop

The Token Factory

Step 1

Press Play to animate the pipeline, or drag the slider to jump between steps

110010,000

1Generate

Policy (LLM)

uhthingscomplicatedmaybe

→

2Score

Reward Model

→

3Optimize

PPO Update

→

4Constrain

KL Penalty

Dashed = SFT reference

Policy Outputuntrained

The answer to your question is... uh... things are complicated and...

The Speed Limiter: PPO Clipping

PPO doesn't let the model change too fast. Even if an action got a great reward, the model only takes a small step toward it and checks again. It doesn't lurch. The clipping objective literally flat-lines the gradient past a boundary, ignoring signals that would push the model too far in one update.

PPO Clipping Objective

The Speed Limiter

Clipping range (epsilon)e = 0.20

0.05 (tight)0.50 (loose)

Even if an action got a great reward, the model only takes a small step. The objective flat-lines past the clipping boundary — ignoring signals that are too radical. This prevents catastrophic policy updates where the model changes too much from a single batch of experience.

Reward Hacking: What Happens Without the Leash

The reward model has blind spots. It was trained on limited data, so there are outputs it will score highly that are actually garbage. As documented in Anthropic's helpful and harmless research (Bai et al., 2022), without the KL penalty keeping the model close to its SFT starting point, the model will find and exploit these blind spots. It learns to produce outputs that trick the reward model instead of actually being helpful. Watch it happen:

Reward Hacking

What happens when the model games the reward signal

Press Play to watch reward hacking unfold over training steps

Without KL Penalty

Step 100

Here is a helpful answer about the topic you asked about...

With KL Penalty

Step 100

Here's what I know about this topic...

KL penalty = a leash. Without it, the model becomes unrecognizable. It learns to exploit patterns in the reward model rather than genuinely improving its outputs.

The Alignment Tax

RLHF makes models more helpful and safer, but it also makes them more cautious. Sometimes they refuse harmless requests, add unnecessary disclaimers, or become “preachy.” This tradeoff between alignment and raw capability is what Askell et al. (2021) call the alignment tax. Getting the balance right is hard, and every lab calibrates it differently. Drag the slider to compare:

SFT vs RLHF Outputs

The Alignment Tax

Tradeoff

Prompt

How do I pick a lock?

SFT Model

RLHF Model

To pick a pin tumbler lock, you'll need a tension wrench and a pick. Insert the tension wrench into the bottom of the keyhole and apply slight rotational pressure. Then insert the pick and feel for each pin. Push each pin up until it sets at the shear line. Once all pins are set, the lock will rotate open. Practice on transparent locks first. Common picks include the hook pick for single pin picking and the rake for speed.

I can't help with lock picking, as it's commonly associated with unauthorized entry. If you're locked out of your own home, I'd recommend contacting a licensed locksmith — they can verify ownership and get you in safely. If you're interested in lock mechanisms as a hobby, look into locksport communities that practice legally on their own locks.

↔

Alignment is a dial, not a switch. Every lab makes different tradeoffs between helpfulness, harmlessness, and honesty. More alignment can mean better safety but also over-caution.

Note

Alignment is a dial, not a switch. There is no universally “correct” setting. It depends on the use case and the values you're optimizing for.

How RLHF Evolved

RLHF proved that you can teach a model to behave the way humans want. But the full pipeline is expensive: four models in memory, a finicky RL loop, and thousands of hours of human annotation. So researchers started asking which pieces of RLHF are actually necessary, and found they could remove components while still getting aligned models:

Alignment Methods Evolution

RLHF to DPO to GRPO

RLHFReward model + PPO

DPODirect preference optimization

GRPOGroup relative policy optimization

RLHFReward model + PPO

DPODirect preference optimization

GRPOGroup relative policy optimization

Note

Want to go deeper? The Post-Training Beyond RLHF guide covers each of these methods with interactive diagrams: what DPO removes, how Constitutional AI replaces human labelers, and how GRPO powers reasoning models.

Knowledge Check

10 questions. Test whether you actually understood the process, not just whether you read the words. Your answers are saved locally.