-
Core Dative PRIME-LM Corpus
The dataset used in the paper to study the inverse frequency effect (IFE) in structural priming. -
HOLISTICBIAS
A large dataset for measuring bias in language models, including nearly 600 descriptor terms across 13 different demographic axes. -
MNLI, QQP, and SST-2
The dataset used in this paper consists of three tasks: Multi-Genre Natural Language Inference (MNLI), Quora Question Pairs (QQP), and Stanford Sentiment Treebank (SST-2). -
Are Larger Pretrained Language Models Uniformly Better? Comparing Performance...
Larger language models have higher accu- racy on average, but are they better on ev- ery single instance (datapoint)? -
Context versus Prior Knowledge in Language Models
The dataset used in the paper to test the persuasion and susceptibility scores of language models. -
Anthropic Helpfulness Base eval
The dataset used in the paper is the Anthropic Helpfulness Base eval dataset. -
Anthropic Helpfulness Base
The dataset used in the paper is the Anthropic Helpfulness Base train dataset and the Anthropic Helpfulness eval dataset. -
Measuring Massive Multitask Language Understanding
The dataset used in this paper is a multiple choice question set that allows for the evaluation of large language models. -
HF-datasets
The dataset is used to evaluate the performance of large language models on missing item prediction tasks. -
Missing item prediction
Large language models (LLMs) can suggest missing elements from items listed in a prompt, which can be used for list completion or recommendations based on users' history.... -
Limitations of Language Models in Arithmetic and Symbolic Induction
The dataset used in the paper to test the limitations of large pretrained Language Models (LMs) on arithmetic and symbolic induction tasks. -
OpenAssistant dataset
The dataset used for the experiments in the paper, consisting of 1000 benign instruction examples. -
AdvBench dataset
The dataset used for the experiments in the paper, consisting of 60 harmful instructions from the AdvBench dataset. -
Navigating the Grey Area: How Expressions of Uncertainty and Overconfidence A...
The authors used a variety of datasets for question answering, including TriviaQA, Natural Questions, CountryQA, and Jeopardy questions. -
BIG-Bench Hard
The BIG-Bench Hard dataset is derived from the original BIG-Bench evaluation suite, focusing on tasks that pose challenges to existing language models. -
Dense Reward for Free in RLHF
The dataset used in the paper is not explicitly described, but it is mentioned that it is a preference dataset for language models. -
CAP: Corpus of Adjective Pairs
The CAP dataset is a corpus of adjective pairs used to evaluate adjective order preferences in language models. -
Automata-based constraints for language model decoding
The dataset used in this paper is a collection of regular expressions and grammars for constraining language models. -
Language Models of Spoken Dutch
The dataset consists of subtitles of television shows provided by the Flemish public-service broadcaster VRT. The dataset is used to train language models of spoken Dutch.