-
PubMed, ArXiv, and Movies datasets
The dataset used in the paper is PubMed, ArXiv, and Movies. PubMed is a medical dataset consisting of research articles from the PubMed repository. The articles' subheadings... -
GoogleNews
The dataset used in this paper is a collection of news articles from Google News. -
20NewsGroups
The dataset used in this paper is a collection of documents from various domains, including news, articles, and emails. -
Penn Treebank
The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths. -
FDU-MTL dataset
The FDU-MTL dataset spans 16 domains: 14 Amazon review domains and two movie review domains. The textual content within this dataset remains in its pristine form, tokenized by... -
Amazon review dataset
The Amazon review dataset is used for multi-source domain adaptation. It contains review texts and ratings of bought products. Products are grouped into categories. Following... -
Wikitext-103
The dataset used in this paper is Wikitext-103, a general English language corpus containing good and featured Wikipedia articles. -
Book Categories
Two text classification data sets for evaluating the quality of interpretability methods. -
BookCorpus
The dataset used in this paper for unsupervised sentence representation learning, consisting of paragraphs from unlabeled text. -
Reuters RCV1-v2
The Reuters RCV1-v2 contains 804,414 newswire articles. There are 103 topics which form a tree hierarchy. Thus documents typically have multiple labels. The data was randomly... -
Penn Treebank dataset
The dataset used in the paper is the Penn Treebank dataset, which is a large-scale text classification dataset. -
MNIST-SVHN-Text dataset
The MNIST-SVHN-Text dataset is a multi-modal dataset consisting of images, text, and labels. -
Training Language Models to Perform Tasks
A dataset for training language models to perform tasks such as question answering and text classification.