114 datasets found

Tags: text classification

Filter Results
  • Penn Treebank

    The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.
  • Amazon review dataset

    The Amazon review dataset is used for multi-source domain adaptation. It contains review texts and ratings of bought products. Products are grouped into categories. Following...
  • BioText dataset

    The BioText dataset contains more than 3,500 text samples classified into one of eight classes, which specify the type of semantic relationship between disease and treatment...
  • Rotten Tomatoes Movie Reviews (RT) and IMDB

    The dataset used in the paper is not explicitly described, but it is mentioned that the authors used a sentiment analysis task on two public benchmark datasets: Rotten Tomatoes...
  • Book Categories

    Two text classification data sets for evaluating the quality of interpretability methods.
  • SNLI

    The dataset used in the paper is the Stanford Natural Language Inference (SNLI) dataset, which consists of 549,367 premise-hypothesis pairs for train/dev/test sets and target...
  • Ott dataset

    The dataset used in this paper for deceptive opinions detection
  • BookCorpus

    The dataset used in this paper for unsupervised sentence representation learning, consisting of paragraphs from unlabeled text.
  • Reuters RCV1-v2

    The Reuters RCV1-v2 contains 804,414 newswire articles. There are 103 topics which form a tree hierarchy. Thus documents typically have multiple labels. The data was randomly...
  • IMDB

    The dataset used in the paper is not explicitly described, but it is mentioned that the authors tested the proposed method on three real data sets for the most relevant security...
  • Penn Treebank dataset

    The dataset used in the paper is the Penn Treebank dataset, which is a large-scale text classification dataset.
  • LAION

    The dataset used in the paper is not explicitly described, but it is mentioned that it is a large-scale captioned image dataset (LAION) used to train the Stable Diffusion model.
  • GLUE

    Pre-trained language models (PrLM) have to carefully manage input units when training on a very large text with a vocabulary consisting of millions of words. Previous works have...
  • Elsevier OA CC-BY corpus

    The Elsevier OA CC-BY corpus dataset consists of 40,000 open-access articles from across Elsevier's journals, representing a diverse research discipline.
You can also access this registry using the API (see API Docs).