-
OpenWebText Corpus
A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words. -
One Billion Words Dataset
A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words. -
Penn Treebank and Wikipedia-90M
The Penn Treebank dataset is used for sentence-level language modeling, and the 90 million word subset of Wikipedia is used for paraphrasing. -
Chinese Poetry
The Chinese Poetry dataset is a dataset of Chinese poems used for language modeling. -
Penn Treebank
The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths. -
Wikitext-103
The dataset used in this paper is Wikitext-103, a general English language corpus containing good and featured Wikipedia articles. -
OSCAR 22.01
The OSCAR 22.01 corpus is a document-oriented corpus that is used for pre-training large generative language models. It is a multilingual corpus that contains documents holding... -
Common Crawl
The Common Crawl (CC) project browses and indexes all content available online. It generates 200-300 TiB of data per month (around 5% of which is in French), and constitutes the... -
Penn Treebank (PTB) dataset
The Penn Treebank (PTB) dataset is used for word ordering task. The dataset is used to evaluate the performance of different models for word ordering.