-
PubMed, ArXiv, and Movies datasets
The dataset used in the paper is PubMed, ArXiv, and Movies. PubMed is a medical dataset consisting of research articles from the PubMed repository. The articles' subheadings... -
20NewsGroups
The dataset used in this paper is a collection of documents from various domains, including news, articles, and emails. -
CORD-19 Research Challenge
COVID-19 research challenge dataset -
Penn Treebank
The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths. -
Wikitext-103
The dataset used in this paper is Wikitext-103, a general English language corpus containing good and featured Wikipedia articles. -
Training Language Models to Perform Tasks
A dataset for training language models to perform tasks such as question answering and text classification.