What is RLHF?

Alignment

RLHF (Reinforcement Learning from Human Feedback) — A technique for aligning LLMs with human preferences by training a reward model on human comparisons, then using reinforcement learning to optimize the LLM against that reward.

FAQ

What is RLHF?

Training LLMs to follow human preferences using a reward model + reinforcement learning. Used to make ChatGPT helpful and safe.

Is RLHF still used?

Yes, but DPO is increasingly preferred for its simplicity. RLHF remains important for understanding alignment.

Related Terms

Learn RLHF in depth

Free hands-on course with code examples and Google Colab notebooks.

Start Course →