Dataset - LDM

WebLI

The dataset used in the paper for subject-driven text-to-image synthesis
- Dataset
- JSON
DEEP

Detecting Errors through Ensembling Prompts (DEEP) - an end-to-end large language model framework for detecting factual errors in text summarization.
- Dataset
- JSON
MOSES

The MOSES dataset is a large-scale molecular dataset containing 1.9 million molecules with up to 30 heavy atoms.
- Dataset
- JSON
GSM8K dataset

The dataset used in the paper is a set of problems for testing the safety of artificial general intelligence (AGI) systems.
- Dataset
- JSON
BigEarthNet-MM

A large-scale benchmark archive for remote sensing image classiﬁcation and retrieval.
- Dataset
- JSON
Wind Farm Dataset

The dataset is used to test the HCMAPPO algorithm for large-scale wind farm control. It includes 13, 16, 19, and 22 wind turbines with their coordinates, wind speeds, and...
- Dataset
- JSON
Nordland Railway dataset

The Nordland Railway dataset is a large-scale driving dataset that includes a 728km train journey from Trondheim to Bodø in Nordland, Norway, recorded four times, once per season.
- Dataset
- JSON
WavCaps

The WavCaps dataset contains chatGPT-assisted weakly-labeled audio captioning data.
- Dataset
- JSON
MLS

MLS: A large-scale multilingual dataset for speech research.
- Dataset
- JSON
People’s Speech

The People’s Speech: A large-scale diverse English speech recognition dataset for commercial usage.
- Dataset
- JSON
YTF

Face recognition and person re-identiﬁcation using paired image-attribute data, where the attributes (i.e., soft biometrics) are only available during the training phase.
- Dataset
- JSON
VGGSound

The VGGSound dataset is a large-scale audio-visual dataset containing 10,000 10-second video clips with corresponding audio files.
- Dataset
- JSON
DataComp-1B

The dataset used in the paper is also DataComp-1B, which is a large-scale dataset for training next-generation image-text models.
- Dataset
- JSON
Webvid10M

The dataset used for training the image-to-video model consists of LAION COCO 600M and Webvid10M.
- Dataset
- JSON
Webvid-10M

The dataset used for training the video model consists of Webvid-10M, a large-scale dataset of short videos with textual descriptions.
- Dataset
- JSON
LAION COCO 600M

The dataset used for training the text-to-video model consists of 20 million videos and 600 million images.
- Dataset
- JSON
DOTA-v2.0

A large-scale dataset for object detection in aerial images, containing 11,268 images and 1,793,658 objects.
- Dataset
- JSON
VoxCeleb: A Large-Scale Speaker Identification Dataset

VoxCeleb: A Large-Scale Speaker Identification Dataset
- Dataset
- JSON
Criminal

A large-scale dataset for charge prediction, consisting of roughly 500,000 legal cases.
- Dataset
- JSON
LAION-Aesthetic

The dataset used in the paper is LAION-Aesthetic, a large-scale image dataset.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

33 datasets found