Training LLMs to follow human preferences using a reward model + reinforcement learning. Used to make ChatGPT helpful and safe.

Yes, but DPO is increasingly preferred for its simplicity. RLHF remains important for understanding alignment.

What is RLHF? — RLHF (Reinforcement Learning from Human Feedback) Explained