Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model Explained