20,499 datasets found

Filter Results
  • BEA 2019 shared task dataset

    The Building Educational Applications (BEA) shared task on GEC provides datasets including the Cambridge English Write & Improve corpus, which is composed of texts written...
  • CoNLL 2014 shared task dataset

    The CoNLL 2014 shared task dataset is comprised of essays written by undergraduate students, annotated for grammatical errors.
  • First Certificate in English (FCE) dataset

    The First Certificate in English (FCE) dataset contains essays written by non-native learners of English assessed in a language exam, annotated for language errors and...
  • WMT19 QE Datasets

    The dataset consists of parallel data from various corpuses used for training and evaluating the bilingual BERT model for translation quality estimation.
  • QuAC

    QuAC is a dataset for question answering in a conversational context, requiring understanding of the multi-turn dialogue history to provide contextually relevant answers derived...
  • Sarcastic Tweets Dataset

    A dataset of 3,000 sarcastic tweets, each interpreted by five human judges, focusing on the task of sarcasm interpretation.
  • Sarcasm Interpretation Dataset

    The dataset contains 4,762 pairs of sarcastic messages and hearer interpretations, collected through a crowdsourcing experiment.
  • MedNLI

    The MedNLI dataset is used to predict the entailment relation between a pair of sentences, with premises taken from doctors' notes in the clinical dataset MIMIC-III.
  • MultiNLI

    The MultiNLI corpus is a dataset designed to assist in learning natural language inference, featuring sentence pairs labeled as entailment, neutral, or contradiction, which aid...
  • Sexism Categorization Dataset

    The dataset comprises 13023 accounts of sexism, including first-person accounts from survivors, each tagged with at least one of 23 categories of sexism.
  • ConvAI2 Dataset

    The ConvAI2 dataset, derived from Persona-Chat, contains dialogues between crowdworkers who role-play as assigned personas, enabling the development of conversational agents...
  • REST dataset

    The REST dataset is derived from restaurant reviews, also containing review sentences and aspect sentiment annotations for aspect-based sentiment analysis.
  • LAPTOP dataset

    The LAPTOP dataset is used for aspect-based sentiment analysis, containing review sentences along with gold standard aspect sentiment annotations.
  • OCNLI

    OCNLI is a dataset for natural language inference adapted for Chinese language, consisting of premise-hypothesis pairs.
  • BQ Corpus

    BQ Corpus is a large-scale dataset for sentence semantic equivalence identification in Chinese.
  • LCQMC

    LCQMC is a large-scale Chinese question matching corpus used for determining the semantic equivalence of question pairs.
  • TNEWS

    TNEWS is a short text classification dataset consisting of news titles and keywords requiring classification into one of 15 classes.
  • THUCNews

    THUCNews is a dataset used for news categorization tasks in different genres, containing 50K news articles in ten domains.
  • ChnSentiCorp

    ChnSentiCorp is a dataset used for sentiment classification in Chinese documents, where the text is classified into positive or negative labels.
  • CJRC

    CJRC is a dataset for machine reading comprehension specializing in Chinese legal judgments, containing yes/no questions, no-answer questions, and span-extraction questions.