Direct preference optimization: Your language model is secretly a reward model
The dataset used in the paper is not explicitly described. However, it is mentioned that the authors used a language model to optimize the performance of a reinforcement learning algorithm.
BibTex: