Dataset - LDM

Yahoo and Yelp corpora

The Yahoo and Yelp corpora dataset contains 100k sentences with greater average length.
- Dataset
- JSON
Training CLIP models on Data from Scientific Papers

Contrastive Language-Image Pretraining (CLIP) models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores...
- Dataset
- JSON
Goal Driven Discovery of Distributional Differences via Language Descriptions

Describing differences between text distributions with natural language.
- Dataset
- JSON
Validation Dataset

The Validation Dataset is used for validation, it contains 1428 images from nine distinct rooms.
- Dataset
- JSON
LV-BERT: Exploiting Layer Variety for BERT

Modern pre-trained language models are mostly built upon backbones stacking self-attention and feed-forward layers in an interleaved order. This paper aims to improve...
- Dataset
- JSON
NLVR2

The dataset used in the paper is a set of sequential vision-and-language tasks, where each task consists of an image and a text input.
- Dataset
- JSON
CIFAR-10, CIFAR-100, Stanford background dataset, VOC2012 dataset, Rotten Tom...

The dataset used in the paper is not explicitly described. However, it is mentioned that the authors used CIFAR-10 and CIFAR-100 datasets for image classification, and Stanford...
- Dataset
- JSON
Penn Treebank

The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.
- Dataset
- JSON
GPT-2 small

The dataset used in this paper is a large language model, GPT-2 small, and its residual stream activations.
- Dataset
- JSON
GPT-4

The dataset used in this paper is a large language model, GPT-4, and its residual stream activations.
- Dataset
- JSON
GLOW : Global Weighted Self-Attention Network for Web Search

GLOW is a novel Global Weighted Self-Attention Network for web document search. It leverages global corpus statistics into the deep matching model.
- Dataset
- JSON
RefCOCO

The dataset used in the paper is a benchmark for referring expression grounding, containing 142,210 referring expressions for 50,000 referents in 19,994 images.
- Dataset
- JSON
SNLI

The dataset used in the paper is the Stanford Natural Language Inference (SNLI) dataset, which consists of 549,367 premise-hypothesis pairs for train/dev/test sets and target...
- Dataset
- JSON
BERT: Pre-training of deep bidirectional transformers for language understanding

This paper proposes BERT, a pre-trained deep bidirectional transformer for language understanding.
- Dataset
- JSON
Text-to-Image Synthesis Dataset

This dataset is used for text-to-image synthesis.
- Dataset
- JSON
SST2, SST5, MR, IMDB, Ag news

The dataset used for sentence classification task
- Dataset
- JSON
Ego4D Goal-Step

The Ego4D Goal-Step dataset is a large-scale egocentric video dataset that contains 3,000 hours of egocentric video. The dataset is used for action recognition, action...
- Dataset
- JSON
String Transformation Tasks

A publicly available data set of 130 real world string transformation tasks from Cropper and Dumancic [2020].
- Dataset
- JSON
GLUE development set

The GLUE development set is a dataset used for evaluating the performance of language models.
- Dataset
- JSON
LLaMA-7B and LLaMA-13B models

The dataset used in this paper is not explicitly mentioned, but it is mentioned that the authors used the LLaMA-7B and LLaMA-13B models, and the GLUE development set.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

530 datasets found