20,499 datasets found

Filter Results
  • Commoncrawl Dataset

    The commoncrawl dataset provides parallel text collected from the web, serving as a resource for training and evaluating machine translation systems.
  • Europarl Corpus

    The Europarl corpus is an English-French dataset derived from the proceedings of the European Parliament, used for training and evaluating SMT systems.
  • WMT Biomedical Test Set

    The WMT Biomedical test set includes Medline abstracts intended for evaluating the translation of scientific texts.
  • WMT 2016 News Translation Test Set

    The WMT 2016 news translation test set is used for evaluating translation performance in the context of news articles.
  • Multi30k dataset

    The Multi30k dataset is a multilingual extension of the Flickr30k image-captioning dataset, containing English and German language captions for images.
  • MultiWOZ

    The MultiWOZ dataset is a large-scale multi-domain dialogue dataset containing fully-annotated human-human conversations related to tourists, spanning across seven domains...
  • Heuristic-based Adversarial Dataset

    A dataset generated from templates that create sentences designed to reveal weaknesses in NLI models based on syntactic heuristics, focusing on entailment and non-entailment...
  • Error-Analysis Motivated Attacks

    This dataset consists of various tests categorized based on mistakes made by NLI models, including tests like antonyms and negation words, designed to stress test NLP models.
  • Single Word Replacement Attacks

    The dataset is created by modifying SNLI examples with single word replacements that test lexical inferences and simple world knowledge, aimed at analyzing model robustness...
  • MNLI

    The Multi-Genre Natural Language Inference (MNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information, covering multiple...
  • FewRel

    FewRel is a relation classification dataset that consists of sentences with annotated subject and object entity mentions, used for supervised relation classification tasks.
  • AIDA

    AIDA dataset is annotated with Wikipedia URLs for entity linking, containing documents that require linking textual entities to knowledge base entities.
  • LAMA

    LAMA (Language Model Analysis) dataset is designed to probe the factual and commonsense knowledge of pretrained language models through cloze-style questions based on knowledge...
  • Multi-Genre Natural Language Inference (MultiNLI) dataset

    The MultiNLI corpus consists of 433k sentence pairs containing nine genres which contribute to the concept of multi-genres. It features matched and mismatched development/test...
  • Stanford Natural Language Inference (SNLI) dataset

    The Stanford Natural Language Inference (SNLI) dataset consists of pairs of sequences that represent certain semantic attributes. In this work, the authors ignore the labels and...
  • Ubuntu Dialogue

    The Ubuntu Dialogue dataset is extracted from the Ubuntu Relay Chat Channel and contains about 1.85 million conversations with an average of 5 utterances per conversation, ideal...
  • Movie Triples

    The Movie Triples dataset contains about 240,000 dialogue triples covering a wide range of topics, making it suitable for studying the relevance-diversity tradeoff in multi-turn...
  • BERT Pretraining Dataset

    The BERT dataset includes the English Wikipedia corpus and BookCorpus, totaling roughly 3.4B words, used for unsupervised pre-training.
  • WMT14 English-to-German

    The WMT14 English-to-German dataset consists of about 4.5M training parallel sentence pairs utilized for machine translation.
  • IWSLT14 German-to-English

    The IWSLT14 German-to-English dataset contains approximately 153K sentence pairs used for the machine translation task.