No Organization - Organizations

LAMA

LAMA (Language Model Analysis) dataset is designed to probe the factual and commonsense knowledge of pretrained language models through cloze-style questions based on knowledge...

Dataset
JSON

Multi-Genre Natural Language Inference (MultiNLI) dataset

The MultiNLI corpus consists of 433k sentence pairs containing nine genres which contribute to the concept of multi-genres. It features matched and mismatched development/test...

Dataset
JSON

Stanford Natural Language Inference (SNLI) dataset

The Stanford Natural Language Inference (SNLI) dataset consists of pairs of sequences that represent certain semantic attributes. In this work, the authors ignore the labels and...

Dataset
JSON

Ubuntu Dialogue

The Ubuntu Dialogue dataset is extracted from the Ubuntu Relay Chat Channel and contains about 1.85 million conversations with an average of 5 utterances per conversation, ideal...

Dataset
JSON

Movie Triples

The Movie Triples dataset contains about 240,000 dialogue triples covering a wide range of topics, making it suitable for studying the relevance-diversity tradeoff in multi-turn...

Dataset
JSON

BERT Pretraining Dataset

The BERT dataset includes the English Wikipedia corpus and BookCorpus, totaling roughly 3.4B words, used for unsupervised pre-training.

Dataset
JSON

WMT14 English-to-German

The WMT14 English-to-German dataset consists of about 4.5M training parallel sentence pairs utilized for machine translation.

Dataset
JSON

IWSLT14 German-to-English

The IWSLT14 German-to-English dataset contains approximately 153K sentence pairs used for the machine translation task.

Dataset
JSON

BEA 2019 shared task dataset

The Building Educational Applications (BEA) shared task on GEC provides datasets including the Cambridge English Write & Improve corpus, which is composed of texts written...

Dataset
JSON

CoNLL 2014 shared task dataset

The CoNLL 2014 shared task dataset is comprised of essays written by undergraduate students, annotated for grammatical errors.

Dataset
JSON

First Certificate in English (FCE) dataset

The First Certificate in English (FCE) dataset contains essays written by non-native learners of English assessed in a language exam, annotated for language errors and...

Dataset
JSON

WMT19 QE Datasets

The dataset consists of parallel data from various corpuses used for training and evaluating the bilingual BERT model for translation quality estimation.

Dataset
JSON

QuAC

QuAC is a dataset for question answering in a conversational context, requiring understanding of the multi-turn dialogue history to provide contextually relevant answers derived...

Dataset
JSON

Sarcastic Tweets Dataset

A dataset of 3,000 sarcastic tweets, each interpreted by five human judges, focusing on the task of sarcasm interpretation.

Dataset
JSON

Sarcasm Interpretation Dataset

The dataset contains 4,762 pairs of sarcastic messages and hearer interpretations, collected through a crowdsourcing experiment.

Dataset
JSON

MedNLI

The MedNLI dataset is used to predict the entailment relation between a pair of sentences, with premises taken from doctors' notes in the clinical dataset MIMIC-III.

Dataset
JSON

MultiNLI

The MultiNLI corpus is a dataset designed to assist in learning natural language inference, featuring sentence pairs labeled as entailment, neutral, or contradiction, which aid...

Dataset
JSON

Sexism Categorization Dataset

The dataset comprises 13023 accounts of sexism, including first-person accounts from survivors, each tagged with at least one of 23 categories of sexism.

Dataset
JSON

ConvAI2 Dataset

The ConvAI2 dataset, derived from Persona-Chat, contains dialogues between crowdworkers who role-play as assigned personas, enabling the development of conversational agents...

Dataset
JSON

REST dataset

The REST dataset is derived from restaurant reviews, also containing review sentences and aspect sentiment annotations for aspect-based sentiment analysis.

Dataset
JSON

24,167 datasets found