Dataset - LDM

CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration fo...

CBBQ is a Chinese Bias Benchmark dataset curated with Human-AI Collaboration for Large Language Models. It consists of over 100K questions jointly constructed by human experts...
- Dataset
- JSON
Natural Questions: A Benchmark for Question Answering Research

A benchmark for question answering research is introduced, which includes a large dataset of natural questions.
- Dataset
- JSON
A natural language fmri dataset for voxelwise encoding models

A natural language fmri dataset for voxelwise encoding models.
- Dataset
- JSON
Microsoft Research Video Description Corpus (MSVD)

The MSVD dataset is a collection of 1970 open domain clips from YouTube, annotated with variable-length captions.
- Dataset
- JSON
Annotation Tool dataset

A dataset of annotations for the Interactive Gameplay dataset.
- Dataset
- JSON
Interactive Gameplay dataset

A dataset of natural language commands written by crowd-sourced workers for an interactive Minecraft game.
- Dataset
- JSON
Image and Text Prompts dataset

A dataset of natural language commands written by crowd-sourced workers for an interactive Minecraft game.
- Dataset
- JSON
AI-hub Dialogue Dataset

AI-hub dialogue dataset for Korean dialogue processing
- Dataset
- JSON
KLUE

KLUE benchmark dataset for Korean language understanding
- Dataset
- JSON
Incomplete Syntax Influence Korean Language Model

Syntactically Incomplete Korean (SIKO) dataset for Korean language models
- Dataset
- JSON
NLGP Dataset

A dataset of 297,845 preprocessed Python source files for training and evaluation of natural language-guided programming (NLGP) models.
- Dataset
- JSON
NLGP Benchmark Dataset

A dataset of 201 curated test cases for natural language-guided programming (NLGP) to evaluate the quality of code predictions.
- Dataset
- JSON
Large Scale Analysis of Open MOOC Reviews to Support Learners’ Course Selection

The dataset contains 2.4 million reviews from five different MOOC platforms: Udemy, Coursera, Domestika, Platzi, and Crehana.
- Dataset
- JSON
ICEWS Coded Event Data

The Integrated Crisis Early Warning System (ICEWS) dataset is a real-time stream of news stories ingested and processed to create a final dataset of events.
- Dataset
- JSON
ROC-Stories: A Corpus for Evaluating Story Generation Models

ROC-Stories: A Corpus for Evaluating Story Generation Models
- Dataset
- JSON
NGEP: A Graph-based Event Planning Framework for Story Generation

NGEP: A Graph-based Event Planning Framework for Story Generation
- Dataset
- JSON
RealityTalk: Real-Time Speech-Driven Augmented Presentation for AR Live Story...

RealityTalk is a system that augments live video presentations with speech-driven interactive virtual elements.
- Dataset
- JSON
SIGMORPHON 2019 datasets

The datasets developed for the SIGMORPHON 2019 lemmatization task are annotated according to the Unimorph schema guidelines.
- Dataset
- JSON
AfricanHLT 2010

The dataset used for the automatic text summarization task, containing documents in three languages.
- Dataset
- JSON
YouTube Clickbait Detection Dataset

The dataset is a collection of online videos from YouTube, with comments and metadata. It is used to evaluate the performance of the Online Video Clickbait Protector (OVCP) scheme.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

214 datasets found