Skip to content

RLHF (Reinforcement Learning from Human Feedback)

A training technique where human evaluators rank model outputs, and these rankings train a reward model. The language model is then optimized via reinforcement learning to produce outputs the reward model scores highly. RLHF is a key alignment method.

Related terms

Reinforcement Learning (RL)AlignmentDPO (Direct Preference Optimization)Preference Optimization
← Back to glossary