-
Cross-View Training
The dataset used in the paper for semi-supervised sequence modeling with cross-view training. -
MISMATCH: Fine-grained Evaluation of Machine-generated Text
The dataset used in the paper for fine-grained evaluation of machine-generated text with mismatch error types. -
PhotoBot: Reference-Guided Interactive Photography via Natural Language
PhotoBot is a framework for fully automated photo acquisition based on an interplay between high-level human language guidance and a robot photographer. -
FairytaleQA
The FairytaleQA dataset is a collection of open-source fairy tales downloaded from Project Gutenberg. The dataset contains 278 fairy tales with a total of 33,577 events... -
Chinese Poetry
The Chinese Poetry dataset is a dataset of Chinese poems used for language modeling. -
Switchboard
Human speech data comprises a rich set of domain factors such as accent, syntactic and semantic variety, or acoustic environment. -
Yahoo and Yelp corpora
The Yahoo and Yelp corpora dataset contains 100k sentences with greater average length. -
Training CLIP models on Data from Scientific Papers
Contrastive Language-Image Pretraining (CLIP) models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores... -
Goal Driven Discovery of Distributional Differences via Language Descriptions
Describing differences between text distributions with natural language. -
Validation Dataset
The Validation Dataset is used for validation, it contains 1428 images from nine distinct rooms. -
LV-BERT: Exploiting Layer Variety for BERT
Modern pre-trained language models are mostly built upon backbones stacking self-attention and feed-forward layers in an interleaved order. This paper aims to improve... -
CIFAR-10, CIFAR-100, Stanford background dataset, VOC2012 dataset, Rotten Tom...
The dataset used in the paper is not explicitly described. However, it is mentioned that the authors used CIFAR-10 and CIFAR-100 datasets for image classification, and Stanford... -
Penn Treebank
The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths. -
GPT-2 small
The dataset used in this paper is a large language model, GPT-2 small, and its residual stream activations.