-
BIG-Bench Hard
The BIG-Bench Hard dataset is derived from the original BIG-Bench evaluation suite, focusing on tasks that pose challenges to existing language models. -
ChatGPT Language Comprehension and Production
This dataset consists of 12 experiments that explore the extent to which ChatGPT resembles humans in the comprehension and production of language. -
TruthfulQA
The TruthfulQA dataset is a dataset that contains 817 questions designed to evaluate language models' preference to mimic some human falsehoods. -
Edit Distance Robust Watermarks for Language Models
The dataset used in the paper is a language model output, which is a sequence of tokens generated by a language model. -
A general theoretical paradigm to understand learning from human preferences
The paper proposes a novel approach to aligning language models with human preferences, focusing on the use of preference optimization in reward-free RLHF. -
Llama: Open and efficient foundation language models
The LLaMA dataset is a large language model dataset used in the paper. -
Fine-tuning Language Models with Advantage-Induced Policy Alignment
The dataset used in the paper is the Anthropic Helpfulness and Harmlessness dataset and the StackExchange dataset. -
Mixtral of Experts
The dataset used in the paper for instruction following task -
FAIRBELIEF
FAIRBELIEF is a language-agnostic analytical approach to capture and assess beliefs embedded in LMs. -
Greaselm: Graph Reasoning Enhanced Language Models for Question Answering
Greaselm: Graph reasoning enhanced language models for question answering -
Large language models struggle to learn long-tail knowledge
Large language models struggle to learn long-tail knowledge -
GMEG-wiki and GMEG-yahoo
The GMEG-wiki and GMEG-yahoo datasets are used to evaluate the proposed approach. -
CoNLL-2014
The task of grammatical error correction (GEC) is to map an ungrammatical sentence xbad into a grammatical version of it, xgood. -
LM-Critic: Language Models for Unsupervised Grammatical Error Correction
Training a model for grammatical error correction (GEC) requires a set of labeled ungrammatical / grammatical sentence pairs, but manually annotating such pairs can be expensive. -
GLUE benchmark
The dataset used in the paper is not explicitly described, but it is mentioned that the authors used three downstream tasks from the GLUE benchmark: Stanford Sentiment Treebank...