Natural Language Processing - Groups

SST-2

The dataset used for the experiments across ten models– ranging from bag-of-words models to pre-trained transformers– and ﬁnd that a model having higher AUC does not necessarily...

Dataset
JSON

Deep Compositional Robotic Planners

A dataset for training a compositional hierarchical recurrent network to follow natural language commands in continuous environments.

Dataset
JSON

MS MARCO: A Human-Generated Machine Reading Comprehension Dataset

The dataset is used for training and evaluating the MS MARCO model, a question answering model.

Dataset
JSON

Photorealistic text-to-image diffusion models with deep language understanding

The authors present a photorealistic text-to-image diffusion model with deep language understanding.

Dataset
JSON

Google Speech Commands Dataset

The Google Speech Commands Dataset contains 64,727 one-second-long utterance files which are recorded and labeled with one of 30 target categories.

Dataset
JSON

Temporal Convolution for Real-time Keyword Spotting on Mobile Devices

Keyword spotting (KWS) plays a critical role in enabling speech-based user interactions on smart devices. Recent developments in the field of deep learning have led to wide...

Dataset
JSON

Wiki-40B, PG-19, C4, etc.

The dataset used in the paper is not explicitly described. However, it is mentioned that the authors used various benchmarks such as Wiki-40B, PG-19, C4, etc.

Dataset
JSON

RoentGen: Vision-Language Foundation Model for Chest X-ray Generation

Multimodal models trained on large natural image-text pair datasets have exhibited astounding abilities in gener-ating high-quality images. Medical imaging data is fundamentally...

Dataset
JSON

CXR-LLAVA

A multimodal large language model for interpreting chest X-ray images

Dataset
JSON

Stanford Alpaca

The dataset used in the paper is not explicitly described, but it is mentioned that the authors used CIFAR-10 and CIFAR-100 datasets for image classification, and ImageNet-100...

Dataset
JSON

AG News

The dataset used in the paper is a language domain dataset, specifically for sentiment classification, named AG News. The dataset is used to evaluate the performance of...

Dataset
JSON

Cross-View Training

The dataset used in the paper for semi-supervised sequence modeling with cross-view training.

Dataset
JSON

MISMATCH: Fine-grained Evaluation of Machine-generated Text

The dataset used in the paper for fine-grained evaluation of machine-generated text with mismatch error types.

Dataset
JSON

ScanRefer

ScanRefer is a dataset of 51,583 referring descriptions of 11,046 objects from 800 ScanNet scenes.

Dataset
JSON

PhotoBot: Reference-Guided Interactive Photography via Natural Language

PhotoBot is a framework for fully automated photo acquisition based on an interplay between high-level human language guidance and a robot photographer.

Dataset
JSON

FairytaleQA

The FairytaleQA dataset is a collection of open-source fairy tales downloaded from Project Gutenberg. The dataset contains 278 fairy tales with a total of 33,577 events...

Dataset
JSON

Chinese Poetry

The Chinese Poetry dataset is a dataset of Chinese poems used for language modeling.

Dataset
JSON

Text8

Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.

Dataset
JSON

CSL

The CSL dataset is a large-scale Chinese scientific literature dataset obtained from the "Qianyan" open-source NLP platform. It consists of 396,209 Chinese core journal papers'...

Dataset
JSON

Switchboard

Human speech data comprises a rich set of domain factors such as accent, syntactic and semantic variety, or acoustic environment.

Dataset
JSON

530 datasets found