Text Classification - Groups

Penn Tree Bank

The Penn Tree Bank dataset is a corpus split into a training, validation and testing set of 929k words, a validation set of 73k words, and a test set of 82k words. The...

Dataset
JSON

Wikipedia dataset

The dataset used in the paper is the Wikipedia dataset, which contains over six million English Wikipedia articles with a full-text field associated with 50 training queries...

Dataset
JSON

Yelp Review Dataset

The Yelp review dataset contains hotel and restaurant reviews filtered (spam) and recommended (legitimate) by Yelp.

Dataset
JSON

20News

Topic modeling has been a widely used tool for unsupervised text analysis. However, comprehensive evaluations of a topic model remain challenging.

Dataset
JSON

Word2Vec

Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification

Dataset
JSON

20Newsgroups dataset

The 20Newsgroups data set is a dataset of 18,846 instances of newsgroup documents.

Dataset
JSON

News

The News dataset consists of 5000 randomly sampled news articles from the NY Times corpus. It simulates the opinions of media consumers on news items. The units are different...

Dataset
JSON

Yelp Dataset

The Yelp Dataset contains 1.6M reviews and 500K tips by 366K users for 61K businesses; 481K business attributes, such as hours, parking availability, ambience; and check-ins for...

Dataset
JSON

Yelp Dataset Challenge

The Yelp dataset challenge contains reviews and images of restaurants, with the goal of recommending images for each review.

Dataset
JSON

20NewsGroups

The dataset used in this paper is a collection of documents from various domains, including news, articles, and emails.

Dataset
JSON

Penn Treebank

The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.

Dataset
JSON

BookCorpus

The dataset used in this paper for unsupervised sentence representation learning, consisting of paragraphs from unlabeled text.

Dataset
JSON

12 datasets found