-
A general theoretical paradigm to understand learning from human preferences
The paper proposes a novel approach to aligning language models with human preferences, focusing on the use of preference optimization in reward-free RLHF. -
Anthropic’s Helpfulness and Harmlessness
The Anthropic’s Helpfulness and Harmlessness datasets are used for preference optimization, which consists of a set of instructions and their corresponding responses. -
AlpacaFarm
The AlpacaFarm dataset is a large-scale dataset for preference optimization, which consists of a set of instructions and their corresponding responses.