Post-Training Beyond RLHF
RLHF proved alignment works — then researchers asked what the minimum ingredients are.
01RLHF Works. So What's the Problem?
The full pipeline
If you read the RLHF guide, you saw how InstructGPT turns a raw language model into a helpful assistant. It works. ChatGPT, Claude, and Gemini all use some version of this process.
But the full RLHF pipeline is heavy. It needs four separate models loaded into GPU memory at the same time: the policy you're training, a frozen copy of it as a reference, a reward model, and a value network that PPO uses to estimate advantages. On top of that, PPO itself is hard to tune. The clipping range, KL penalty weight, learning rate schedules, and batch sizes all interact, and getting them wrong can cause the whole run to diverge.
So researchers started asking a natural question: which of these pieces are actually necessary? Can you remove the reward model? The value network? The human labelers? The RL loop entirely? The rest of this guide follows that question.
RLHF Memory Footprint
4 concurrent models \u00b7 7B params each \u00b7 tap a card to learn its role
Real-world footprint is higher: optimizer states (Adam = 3\u00d7 params), activations for backprop, and framework overhead push a 7B RLHF run to 2\u20134\u00d7 A100s. DPO and GRPO eliminate the value network and reference model copies respectively, cutting this significantly.
Watch Out
02DPO: What If the Reward Model Is Unnecessary?
Drop the reward model
The first piece to go was the reward model. Rafailov et al. (2023) noticed something about the math behind RLHF: if you write out the formula for the optimal policy (the best possible model given a reward function and a KL penalty), the reward model is already baked into it. You don't need to train it separately.
In practical terms, this means you can skip the reward model and the RL loop entirely. Instead, you train directly on preference pairs (response A is better than response B) using a supervised loss function. The pipeline goes from four models down to two: the policy you're training and a frozen reference copy.
This method is called Direct Preference Optimization (DPO). Toggle between the two pipelines to see what got removed:
3 stages · 4 models · RL sampling loop
PPO Loop
Reward model is a separate frozen network; PPO samples, scores, and updates in a tight loop — expensive and unstable.
- · Explicit reward model
- · PPO sampling overhead
- · 4× VRAM footprint
- · Reward hacking risk
- · Reward implicit in loss
- · Supervised fine-tuning only
- · 2× VRAM footprint
- · Simpler, more stable
Key Insight
03The Tradeoffs of Removing the Reward Model
Nothing is free
Removing the reward model made training simpler, but it also removed a safety valve. In RLHF, the reward model acts as an explicit judge. You can inspect it, add length penalties to it, or tune its behavior. DPO folds all of that into the loss function, which means you have less control over what the model actually optimizes for.
Three problems showed up in practice:
- 1.Length bias. Human annotators tend to prefer longer responses. DPO learns this correlation directly and produces increasingly verbose output.
- 2.Likelihood displacement. DPO's loss only cares about the gap between the chosen and rejected response probabilities. Both can drop, as long as the gap widens. So the model can actually become less likely to produce the good response in absolute terms. (Pal et al., 2024)
- 3.Offline overfitting. DPO trains on a fixed dataset with no exploration. The model never generates its own responses during training, so it can memorize patterns in the data rather than learning general preferences.
Explore each failure mode in the tabs below:
DPO Failure Modes
Three ways Direct Preference Optimization breaks down in practice
Model learns verbosity correlates with preference in training data — longer responses win even when quality plateaus or drops.
04Fixing DPO: IPO, KTO, SimPO, ORPO
The variant ecosystem
Each of those failure modes led to a targeted fix. And some researchers went further, asking whether you could strip away even more of the pipeline.
IPO bounds the reward margin so probabilities can't collapse. KTO drops the requirement for paired data entirely, working with simple thumbs-up/thumbs-down feedback. SimPO removes the reference model (down to just one model in memory) and adds length normalization. ORPO merges SFT and alignment into a single training stage.
Click each card to see what it changes and what it requires:
DPO Family Tree
Click a variant to see what makes it distinct.
Bounded margin prevents overoptimization
Learn from thumbs up/down, no pairs needed
Length-normalized, no reference model
Combines SFT and alignment in one step
Note
05Constitutional AI: What If You Don't Need Human Labelers?
Drop the humans
DPO and its variants simplified the training machinery but still relied on human-generated preference data. Anthropic's Constitutional AI (Bai et al., 2022) asked a different question: what if you could remove the humans from the loop?
The idea is straightforward. You write a set of principles (the “constitution”) that describe what good behavior looks like. Then you have the model generate a response, critique its own response using those principles, and write a revised version. The original and revised responses become a preference pair, and you have training data without any human annotators.
This is called RLAIF (RL from AI feedback). Step through the process below:
Constitutional AI — Critique-Revision Loop
Step 1 of 5User Input
The model receives a prompt that could elicit harmful or dangerous content. Constitutional AI begins its critique-revision loop before committing to a final response.
Key Insight
06GRPO: What If You Don't Need a Critic?
Drop the critic
DPO removed the reward model and the RL loop entirely, but it also lost something: the ability to learn from the model's own outputs during training (on-policy learning). That matters because a model can discover better strategies by trying things and seeing what works, instead of only learning from a fixed dataset.
DeepSeek's Group Relative Policy Optimization (Shao et al., 2024) found a middle ground. It keeps the on-policy generation from PPO but drops the value/critic network. The trick: for each prompt, generate a group of responses, score them all, and use the relative ranking within the group as the training signal. Responses better than average get reinforced. Responses worse than average get suppressed. No separate critic needed.
This is the method behind DeepSeek-R1 and its reasoning capabilities. Watch how it works:
GRPO: Group Scoring
Group Relative Policy Optimization
Press Play to walk through GRPO's group sampling and z-score normalization.
Key Insight
The Alignment Toolkit Today
None of these methods “replaced” RLHF. They are all descendants of the same core idea: learn what humans want by comparing outputs. Each one removes a different component from the original pipeline and makes a different tradeoff.
OpenAI still uses PPO-based RLHF. Anthropic uses Constitutional AI (RLAIF). Meta uses hybrid approaches. DeepSeek uses GRPO for reasoning. The open-source community overwhelmingly uses DPO because it's the simplest to run. Hover over each method to see where it sits:
Alignment Method Landscape
Choosing a Method
The right method depends on three things: what kind of feedback data you have, how much compute you can afford, and whether you need the model to explore on its own during training. Walk through this decision tree to find a good starting point:
Alignment Method Selector
Answer each question to find the best post-training method for your setup.
Step 1
What type of feedback data do you have?
Note
Knowledge Check
10 questions. Test whether you actually understood the tradeoffs, not just whether you read the words. Your answers are saved locally.