-
Penn Treebank
The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths. -
Amazon review dataset
The Amazon review dataset is used for multi-source domain adaptation. It contains review texts and ratings of bought products. Products are grouped into categories. Following... -
BioText dataset
The BioText dataset contains more than 3,500 text samples classified into one of eight classes, which specify the type of semantic relationship between disease and treatment... -
Rotten Tomatoes Movie Reviews (RT) and IMDB
The dataset used in the paper is not explicitly described, but it is mentioned that the authors used a sentiment analysis task on two public benchmark datasets: Rotten Tomatoes... -
Book Categories
Two text classification data sets for evaluating the quality of interpretability methods. -
Ott dataset
The dataset used in this paper for deceptive opinions detection -
BookCorpus
The dataset used in this paper for unsupervised sentence representation learning, consisting of paragraphs from unlabeled text. -
Reuters RCV1-v2
The Reuters RCV1-v2 contains 804,414 newswire articles. There are 103 topics which form a tree hierarchy. Thus documents typically have multiple labels. The data was randomly... -
Penn Treebank dataset
The dataset used in the paper is the Penn Treebank dataset, which is a large-scale text classification dataset. -
Elsevier OA CC-BY corpus
The Elsevier OA CC-BY corpus dataset consists of 40,000 open-access articles from across Elsevier's journals, representing a diverse research discipline.