-
Multimodal Query Suggestion with Multi-Agent Reinforcement Learning from Huma...
Multimodal Query Suggestion with Multi-Agent Reinforcement Learning from Human Feedback -
Differences in Fairness Preferences
A crowdsourced dataset for studying differences in fairness preferences depending on demographic identities. -
CodeContest
The dataset used in the paper for training and testing the DPO and PPO models. -
Training a helpful and harmless assistant with reinforcement learning from hu...
The authors propose a novel approach that incorporates parameter-efficient tuning to better optimize control tokens, thus benefitting controllable generation. -
SHP dataset
The SHP dataset is used to evaluate the performance of the proposed Compositional Preference Models (CPMs). -
HH-RLHF dataset
The HH-RLHF dataset is used to evaluate the performance of the proposed Compositional Preference Models (CPMs). -
Toxic-DPO Dataset
The dataset used in the paper is the Toxic-DPO dataset, which is used for reinforcement learning from human feedback. -
Anthropic-HH-RLHF Dataset
The dataset used in the paper is the Anthropic-HH-RLHF dataset, which is used for reinforcement learning from human feedback. -
UltraRM-13B
The UltraRM-13B dataset is a collection of human feedback for language model training. -
AlpacaFarm
The AlpacaFarm dataset is a large-scale dataset for preference optimization, which consists of a set of instructions and their corresponding responses. -
Anthropic-HH
The Anthropic-HH dataset is a collection of human feedback for language model training.