-
Penn Tree Bank
The Penn Tree Bank dataset is a corpus split into a training, validation and testing set of 929k words, a validation set of 73k words, and a test set of 82k words. The... -
OpenWebText Corpus
A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words. -
Penn Treebank
The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths. -
Wikitext-103
The dataset used in this paper is Wikitext-103, a general English language corpus containing good and featured Wikipedia articles.