-
CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration fo...
CBBQ is a Chinese Bias Benchmark dataset curated with Human-AI Collaboration for Large Language Models. It consists of over 100K questions jointly constructed by human experts... -
Natural Questions: A Benchmark for Question Answering Research
A benchmark for question answering research is introduced, which includes a large dataset of natural questions. -
A natural language fmri dataset for voxelwise encoding models
A natural language fmri dataset for voxelwise encoding models. -
Microsoft Research Video Description Corpus (MSVD)
The MSVD dataset is a collection of 1970 open domain clips from YouTube, annotated with variable-length captions. -
Annotation Tool dataset
A dataset of annotations for the Interactive Gameplay dataset. -
Interactive Gameplay dataset
A dataset of natural language commands written by crowd-sourced workers for an interactive Minecraft game. -
Image and Text Prompts dataset
A dataset of natural language commands written by crowd-sourced workers for an interactive Minecraft game. -
AI-hub Dialogue Dataset
AI-hub dialogue dataset for Korean dialogue processing -
Incomplete Syntax Influence Korean Language Model
Syntactically Incomplete Korean (SIKO) dataset for Korean language models -
NLGP Dataset
A dataset of 297,845 preprocessed Python source files for training and evaluation of natural language-guided programming (NLGP) models. -
NLGP Benchmark Dataset
A dataset of 201 curated test cases for natural language-guided programming (NLGP) to evaluate the quality of code predictions. -
Large Scale Analysis of Open MOOC Reviews to Support Learners’ Course Selection
The dataset contains 2.4 million reviews from five different MOOC platforms: Udemy, Coursera, Domestika, Platzi, and Crehana. -
ICEWS Coded Event Data
The Integrated Crisis Early Warning System (ICEWS) dataset is a real-time stream of news stories ingested and processed to create a final dataset of events. -
ROC-Stories: A Corpus for Evaluating Story Generation Models
ROC-Stories: A Corpus for Evaluating Story Generation Models -
NGEP: A Graph-based Event Planning Framework for Story Generation
NGEP: A Graph-based Event Planning Framework for Story Generation -
RealityTalk: Real-Time Speech-Driven Augmented Presentation for AR Live Story...
RealityTalk is a system that augments live video presentations with speech-driven interactive virtual elements. -
SIGMORPHON 2019 datasets
The datasets developed for the SIGMORPHON 2019 lemmatization task are annotated according to the Unimorph schema guidelines. -
AfricanHLT 2010
The dataset used for the automatic text summarization task, containing documents in three languages. -
YouTube Clickbait Detection Dataset
The dataset is a collection of online videos from YouTube, with comments and metadata. It is used to evaluate the performance of the Online Video Clickbait Protector (OVCP) scheme.