Dataset - LDM

AG News

The dataset used in the paper is a language domain dataset, specifically for sentiment classification, named AG News. The dataset is used to evaluate the performance of...
- Dataset
- JSON
Cross-View Training

The dataset used in the paper for semi-supervised sequence modeling with cross-view training.
- Dataset
- JSON
MISMATCH: Fine-grained Evaluation of Machine-generated Text

The dataset used in the paper for fine-grained evaluation of machine-generated text with mismatch error types.
- Dataset
- JSON
ScanRefer

ScanRefer is a dataset of 51,583 referring descriptions of 11,046 objects from 800 ScanNet scenes.
- Dataset
- JSON
PhotoBot: Reference-Guided Interactive Photography via Natural Language

PhotoBot is a framework for fully automated photo acquisition based on an interplay between high-level human language guidance and a robot photographer.
- Dataset
- JSON
FairytaleQA

The FairytaleQA dataset is a collection of open-source fairy tales downloaded from Project Gutenberg. The dataset contains 278 fairy tales with a total of 33,577 events...
- Dataset
- JSON
Chinese Poetry

The Chinese Poetry dataset is a dataset of Chinese poems used for language modeling.
- Dataset
- JSON
Text8

Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.
- Dataset
- JSON
CSL

The CSL dataset is a large-scale Chinese scientific literature dataset obtained from the "Qianyan" open-source NLP platform. It consists of 396,209 Chinese core journal papers'...
- Dataset
- JSON
Switchboard

Human speech data comprises a rich set of domain factors such as accent, syntactic and semantic variety, or acoustic environment.
- Dataset
- JSON
Yahoo and Yelp corpora

The Yahoo and Yelp corpora dataset contains 100k sentences with greater average length.
- Dataset
- JSON
Training CLIP models on Data from Scientific Papers

Contrastive Language-Image Pretraining (CLIP) models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores...
- Dataset
- JSON
Goal Driven Discovery of Distributional Differences via Language Descriptions

Describing differences between text distributions with natural language.
- Dataset
- JSON
Validation Dataset

The Validation Dataset is used for validation, it contains 1428 images from nine distinct rooms.
- Dataset
- JSON
LV-BERT: Exploiting Layer Variety for BERT

Modern pre-trained language models are mostly built upon backbones stacking self-attention and feed-forward layers in an interleaved order. This paper aims to improve...
- Dataset
- JSON
NLVR2

The dataset used in the paper is a set of sequential vision-and-language tasks, where each task consists of an image and a text input.
- Dataset
- JSON
CIFAR-10, CIFAR-100, Stanford background dataset, VOC2012 dataset, Rotten Tom...

The dataset used in the paper is not explicitly described. However, it is mentioned that the authors used CIFAR-10 and CIFAR-100 datasets for image classification, and Stanford...
- Dataset
- JSON
Penn Treebank

The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.
- Dataset
- JSON
GPT-2 small

The dataset used in this paper is a large language model, GPT-2 small, and its residual stream activations.
- Dataset
- JSON
GPT-4

The dataset used in this paper is a large language model, GPT-4, and its residual stream activations.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

420 datasets found