Text Classification - Groups

Text8

Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.

Dataset
JSON

CLINC

An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction

Dataset
JSON

PubMed, ArXiv, and Movies datasets

The dataset used in the paper is PubMed, ArXiv, and Movies. PubMed is a medical dataset consisting of research articles from the PubMed repository. The articles' subheadings...

Dataset
JSON

GoogleNews

The dataset used in this paper is a collection of news articles from Google News.

Dataset
JSON

Wiki20K

The dataset used in this paper is a collection of English Wikipedia abstracts from DBpedia.

Dataset
JSON

20NewsGroups

The dataset used in this paper is a collection of documents from various domains, including news, articles, and emails.

Dataset
JSON

Penn Treebank

The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.

Dataset
JSON

FDU-MTL dataset

The FDU-MTL dataset spans 16 domains: 14 Amazon review domains and two movie review domains. The textual content within this dataset remains in its pristine form, tokenized by...

Dataset
JSON

Amazon review dataset

The Amazon review dataset is used for multi-source domain adaptation. It contains review texts and ratings of bought products. Products are grouped into categories. Following...

Dataset
JSON

Wikitext-103

The dataset used in this paper is Wikitext-103, a general English language corpus containing good and featured Wikipedia articles.

Dataset
JSON

Book Categories

Two text classiﬁcation data sets for evaluating the quality of interpretability methods.

Dataset
JSON

BookCorpus

The dataset used in this paper for unsupervised sentence representation learning, consisting of paragraphs from unlabeled text.

Dataset
JSON

Reuters RCV1-v2

The Reuters RCV1-v2 contains 804,414 newswire articles. There are 103 topics which form a tree hierarchy. Thus documents typically have multiple labels. The data was randomly...

Dataset
JSON

IMDB

The dataset used in the paper is not explicitly described, but it is mentioned that the authors tested the proposed method on three real data sets for the most relevant security...

Dataset
JSON

Penn Treebank dataset

The dataset used in the paper is the Penn Treebank dataset, which is a large-scale text classification dataset.

Dataset
JSON

MNIST-SVHN-Text dataset

The MNIST-SVHN-Text dataset is a multi-modal dataset consisting of images, text, and labels.

Dataset
JSON

TREC

The dataset used for sentiment analysis, question type classification, and subjectivity classification tasks.

Dataset
JSON

LAION

The dataset used in the paper is not explicitly described, but it is mentioned that it is a large-scale captioned image dataset (LAION) used to train the Stable Diffusion model.

Dataset
JSON

Training Language Models to Perform Tasks

A dataset for training language models to perform tasks such as question answering and text classification.

Dataset
JSON

GLUE

Pre-trained language models (PrLM) have to carefully manage input units when training on a very large text with a vocabulary consisting of millions of words. Previous works have...

Dataset
JSON

182 datasets found