24,167 datasets found

Organizations: No Organization Formats: JSON

Filter Results
  • LAMA

    LAMA (Language Model Analysis) dataset is designed to probe the factual and commonsense knowledge of pretrained language models through cloze-style questions based on knowledge...
  • Multi-Genre Natural Language Inference (MultiNLI) dataset

    The MultiNLI corpus consists of 433k sentence pairs containing nine genres which contribute to the concept of multi-genres. It features matched and mismatched development/test...
  • Stanford Natural Language Inference (SNLI) dataset

    The Stanford Natural Language Inference (SNLI) dataset consists of pairs of sequences that represent certain semantic attributes. In this work, the authors ignore the labels and...
  • Ubuntu Dialogue

    The Ubuntu Dialogue dataset is extracted from the Ubuntu Relay Chat Channel and contains about 1.85 million conversations with an average of 5 utterances per conversation, ideal...
  • Movie Triples

    The Movie Triples dataset contains about 240,000 dialogue triples covering a wide range of topics, making it suitable for studying the relevance-diversity tradeoff in multi-turn...
  • BERT Pretraining Dataset

    The BERT dataset includes the English Wikipedia corpus and BookCorpus, totaling roughly 3.4B words, used for unsupervised pre-training.
  • WMT14 English-to-German

    The WMT14 English-to-German dataset consists of about 4.5M training parallel sentence pairs utilized for machine translation.
  • IWSLT14 German-to-English

    The IWSLT14 German-to-English dataset contains approximately 153K sentence pairs used for the machine translation task.
  • BEA 2019 shared task dataset

    The Building Educational Applications (BEA) shared task on GEC provides datasets including the Cambridge English Write & Improve corpus, which is composed of texts written...
  • CoNLL 2014 shared task dataset

    The CoNLL 2014 shared task dataset is comprised of essays written by undergraduate students, annotated for grammatical errors.
  • First Certificate in English (FCE) dataset

    The First Certificate in English (FCE) dataset contains essays written by non-native learners of English assessed in a language exam, annotated for language errors and...
  • WMT19 QE Datasets

    The dataset consists of parallel data from various corpuses used for training and evaluating the bilingual BERT model for translation quality estimation.
  • QuAC

    QuAC is a dataset for question answering in a conversational context, requiring understanding of the multi-turn dialogue history to provide contextually relevant answers derived...
  • Sarcastic Tweets Dataset

    A dataset of 3,000 sarcastic tweets, each interpreted by five human judges, focusing on the task of sarcasm interpretation.
  • Sarcasm Interpretation Dataset

    The dataset contains 4,762 pairs of sarcastic messages and hearer interpretations, collected through a crowdsourcing experiment.
  • MedNLI

    The MedNLI dataset is used to predict the entailment relation between a pair of sentences, with premises taken from doctors' notes in the clinical dataset MIMIC-III.
  • MultiNLI

    The MultiNLI corpus is a dataset designed to assist in learning natural language inference, featuring sentence pairs labeled as entailment, neutral, or contradiction, which aid...
  • Sexism Categorization Dataset

    The dataset comprises 13023 accounts of sexism, including first-person accounts from survivors, each tagged with at least one of 23 categories of sexism.
  • ConvAI2 Dataset

    The ConvAI2 dataset, derived from Persona-Chat, contains dialogues between crowdworkers who role-play as assigned personas, enabling the development of conversational agents...
  • REST dataset

    The REST dataset is derived from restaurant reviews, also containing review sentences and aspect sentiment annotations for aspect-based sentiment analysis.