24,167 datasets found

Organizations: No Organization Formats: JSON

Filter Results
  • ClueWeb09-B

    ClueWeb includes documents from ClueWeb09-B and queries from the TREC Web Track ad hoc retrieval task 2009-2012. The dataset consists of 200 queries with relevance judgements...
  • LINNAEUS Dataset

    The LINNAEUS dataset is a system for species name identification in biomedical literature.
  • Species-800 Corpus

    The Species-800 corpus is used for species name recognition in text.
  • JNLPBA Corpus

    The JNLPBA corpus serves as a benchmark forbio-entity recognition tasks.
  • BioCreative V CDR Corpus

    The BioCreative V CDR task corpus is a resource for chemical disease relation extraction.
  • English-Finnish and English-Estonian Datasets

    Monolingual English datasets consisting of backtranslated and parallel data used for training the translation models between English, Finnish, and Estonian.
  • Finnish-Estonian Parallel Data

    A bilingual corpus created by triangulating English–Finnish and English–Estonian parallel data, resulting in a set of 679,252 sentence pairs used to extract cognates and improve...
  • WMT 2014 English-German Translation Dataset

    The WMT 2014 English-German translation dataset consists of parallel sentences in English and German used to evaluate machine translation models.
  • Commoncrawl Dataset

    The commoncrawl dataset provides parallel text collected from the web, serving as a resource for training and evaluating machine translation systems.
  • Europarl Corpus

    The Europarl corpus is an English-French dataset derived from the proceedings of the European Parliament, used for training and evaluating SMT systems.
  • WMT Biomedical Test Set

    The WMT Biomedical test set includes Medline abstracts intended for evaluating the translation of scientific texts.
  • WMT 2016 News Translation Test Set

    The WMT 2016 news translation test set is used for evaluating translation performance in the context of news articles.
  • Multi30k dataset

    The Multi30k dataset is a multilingual extension of the Flickr30k image-captioning dataset, containing English and German language captions for images.
  • MultiWOZ

    The MultiWOZ dataset is a large-scale multi-domain dialogue dataset containing fully-annotated human-human conversations related to tourists, spanning across seven domains...
  • Heuristic-based Adversarial Dataset

    A dataset generated from templates that create sentences designed to reveal weaknesses in NLI models based on syntactic heuristics, focusing on entailment and non-entailment...
  • Error-Analysis Motivated Attacks

    This dataset consists of various tests categorized based on mistakes made by NLI models, including tests like antonyms and negation words, designed to stress test NLP models.
  • Single Word Replacement Attacks

    The dataset is created by modifying SNLI examples with single word replacements that test lexical inferences and simple world knowledge, aimed at analyzing model robustness...
  • MNLI

    The Multi-Genre Natural Language Inference (MNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information, covering multiple...
  • FewRel

    FewRel is a relation classification dataset that consists of sentences with annotated subject and object entity mentions, used for supervised relation classification tasks.
  • AIDA

    AIDA dataset is annotated with Wikipedia URLs for entity linking, containing documents that require linking textual entities to knowledge base entities.