Dataset - LDM

Spherical MNIST

Spherical MNIST is constructed from the MNIST dataset by back projecting the digits into equirectangular projection with a resolution of 160x80. The digit labels are used to...
- Dataset
- JSON
1 Billion Word Language Model Benchmark

The 1 Billion Word Language Model Benchmark is a dataset used for measuring progress in statistical language modeling, consisting of a large collection of text data.
- Dataset
- JSON
Caltech-UCSD Birds 200 dataset (CUB-200)

The 2011 Caltech-UCSD Birds 200 dataset (CUB-200) contains 11,788 images of 200 different types of birds, widely used as a benchmark for text-to-image generation.
- Dataset
- JSON
TaoMultimodal Dataset

A large-scale dataset for multi-modal pretraining in Chinese, consisting of 3.1M image-text pairs from the mobile Taobao platform.
- Dataset
- JSON
French Street Name Signs Dataset

The French Street Name Signs (FSNS) dataset contains images of French street name signs extracted from Google Streetview, featuring low resolution text lines in natural scenes...
- Dataset
- JSON
WFLW

WFLW contains 10,000 faces with 98 fully manually annotated landmarks, designed to be a challenging dataset with rich attribute annotations.
- Dataset
- JSON
300-W

300-W is currently the most widely used dataset for facial landmark detection, created from four datasets including AFW, LFPW, HELEN, and IBUG, with each image annotated with 68...
- Dataset
- JSON
Google Billion Word dataset

The Google Billion Word dataset is one of the largest language modeling datasets with almost one billion tokens and a vocabulary of over 800K words, based on an English corpus...
- Dataset
- JSON
MojiTalk

MojiTalk dataset consists of 596,959 post and response pairs from Twitter, where each response is labeled by one of 64 emojis indicating the response emotion.
- Dataset
- JSON
CNN/Daily Mail corpus

The CNN/Daily Mail corpus contains pairs of online news articles and their summaries, consisting of approximately 287,000 training pairs, 13,368 validation pairs, and 11,490...
- Dataset
- JSON
TL;DR Reddit corpus

The TL;DR Reddit corpus consists of approximately 3 million content-summary pairs mined from Reddit, designed for the TL;DR challenge focusing on text summarization.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

20,491 datasets found