A training method that aligns LLMs with human preferences using pairs of preferred/rejected outputs, without needing a reward model.

DPO is simpler (no reward model needed), more stable to train, and produces similar results. Most teams now prefer DPO.

What is DPO? — DPO (Direct Preference Optimization) Explained