20,499 datasets found

Filter Results
  • QNLI

    The Question-answering Natural Language Inference (QNLI) dataset is derived from the Stanford Question Answering Dataset (SQuAD), providing pairs of questions and context...
  • Clinical Notes Dataset

    A large corpus of clinical notes from the clinical data warehouse of a local hospital in France, used to learn word embeddings for enhancing model performance.
  • Generated Training Dataset for Biomedical NLU

    The dataset consists of user utterances for querying Electronic Health Records (EHRs) in the biomedical domain, generated using templates and augmented with paraphrases. A total...
  • HASOC 2019 Dataset

    The HASOC dataset includes abusive language data in Hindi, English, and German, designed for identifying hate speech and offensive content with various subtasks for classification.
  • TDIUC

    TDIUC is designed to mitigate bias issues found in VQA v2.0 and includes 1.6 million questions divided into twelve categories.
  • British National Corpus

    The British National Corpus (BNC) contains spontaneous conversations from UK English speakers collected with portable tape recorders in the early 1990s, featuring significant...
  • Europarl v7 Corpus

    The Europarl v7 corpus, a parallel corpus of European Parliament proceedings, used for training a Transformer-based NMT system for German-to-English translation.
  • Wikipedia Sequence Dataset

    The dataset consists of 992 sequences extracted from Wikipedia, where each sequence consists of two consecutive paragraphs, in the form of: [CLS] paragraph1 [SEP] paragraph2...
  • DailyDialog Dataset

    The DailyDialog dataset consists of dialogues from daily communication and serves as a benchmark for dialog response generation tasks.
  • SNIPS

    The SNIPS dataset serves as a public benchmark dataset developed to evaluate the quality of intent classification and slot filling services across multiple domains including...
  • ATIS

    ATIS dataset consists of annotated transcripts of flight reservation audio recordings, containing 4,478 training, 500 development, and 893 test utterances across 21 intents and...
  • NLU-Benchmark

    The NLU-Benchmark dataset is annotated with scenarios, actions, and entities for various home assistant tasks. It contains 25,716 utterances categorized into 64 intents and 54...
  • GIGA-CM dataset

    GIGA-CM is a large-scale dataset comprising millions of documents, created to facilitate the pre-training of hierarchical document encoding models for summarization tasks.
  • New York Times dataset

    The New York Times dataset is used for summarization tasks and consists of articles from the New York Times, with summaries created by editors, enabling the assessment of...
  • WMT18 English-Turkish Translation Dataset

    The WMT18 dataset is utilized for English-Turkish translation and serves as a standard low-resource scenario dataset.
  • WAT English-Japanese Translation Dataset

    The WAT dataset contains English-Japanese sentence pairs for translation tasks, focusing particularly on low-resource scenarios.
  • WMT17 English-German Translation Dataset

    The WMT17 dataset serves as a benchmark for translation tasks involving English and German, containing a large set of parallel sentences.
  • WMT16 English-German Translation Dataset

    The WMT16 dataset is used for English-German translation, including a number of parallel sentences for robust machine translation tasks.
  • News Commentary v11 (NC11)

    NC11 dataset encompasses translations for low-resource English↔German tasks, demonstrating improvements in machine translation in resource-limited scenarios.
  • XQA

    XQA is a cross-lingual Open-domain Question Answering dataset consisting of a training set in English and development and test sets in eight other languages. It contains...