20,499 datasets found

Filter Results
  • AQUAINT

    The AQUAINT dataset is used for evaluating named entity disambiguation performance.
  • MSNBC

    The MSNBC dataset is used for evaluating named entity disambiguation performance.
  • AIDA-CoNLL

    The AIDA-CoNLL dataset consists of annotated entities in a large corpus for named entity disambiguation tasks.
  • FQuAD: French Question Answering Dataset

    The French Question Answering Dataset (FQuAD) is a native Reading Comprehension dataset comprising questions and answers extracted from Wikipedia articles. It aims to provide a...
  • English Web Treebank

    The English Web Treebank is part of the Universal Dependencies framework and serves as a syntactically and semantically annotated corpus for training and evaluating dependency...
  • WIKIHOP

    WIKIHOP is a dataset constructed to require multi-hop reasoning over multiple Wikipedia paragraphs while answering entity-relation questions.
  • CWQ

    CWQ (Complex Web Questions) is a dataset involving complex web-based questions requiring multiple steps to answer.
  • CQ

    CQ (Complex Questions) consists of complex queries from Google and is designed for answering questions from a knowledge base.
  • SEARCHQA

    SEARCHQA is a dataset designed for reading comprehension, containing trivia questions and web snippets retrieved through Google.
  • CMU-MOSEI

    CMU-MOSEI is a dataset for multimodal sentiment analysis with sentiment annotations at the sentence level, featuring a blend of audio-visual and textual data.
  • CMU-MOSI

    CMU-MOSI (CMU Multimodal Opinion Sentiment Intensity) is a dataset of multimodal language focused on multimodal sentiment analysis, containing 2199 video segments from 93...
  • EmoContext Dataset

    The EmoContext task dataset consists of conversations extracted from social media, intended for emotion detection, annotated with four main emotions: Happy, Sad, Angry, and...
  • ClueWeb09-B

    ClueWeb includes documents from ClueWeb09-B and queries from the TREC Web Track ad hoc retrieval task 2009-2012. The dataset consists of 200 queries with relevance judgements...
  • LINNAEUS Dataset

    The LINNAEUS dataset is a system for species name identification in biomedical literature.
  • Species-800 Corpus

    The Species-800 corpus is used for species name recognition in text.
  • JNLPBA Corpus

    The JNLPBA corpus serves as a benchmark forbio-entity recognition tasks.
  • BioCreative V CDR Corpus

    The BioCreative V CDR task corpus is a resource for chemical disease relation extraction.
  • English-Finnish and English-Estonian Datasets

    Monolingual English datasets consisting of backtranslated and parallel data used for training the translation models between English, Finnish, and Estonian.
  • Finnish-Estonian Parallel Data

    A bilingual corpus created by triangulating English–Finnish and English–Estonian parallel data, resulting in a set of 679,252 sentence pairs used to extract cognates and improve...
  • WMT 2014 English-German Translation Dataset

    The WMT 2014 English-German translation dataset consists of parallel sentences in English and German used to evaluate machine translation models.