Language Modeling - Groups

Penn Tree Bank

The Penn Tree Bank dataset is a corpus split into a training, validation and testing set of 929k words, a validation set of 73k words, and a test set of 82k words. The...
- Dataset
- JSON
BookCorpus Dataset

The dataset used in the paper is the bookcorpus dataset.
- Dataset
- JSON
Word2Vec

Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification
- Dataset
- JSON
YELP

The YELP dataset is used for language modeling.
- Dataset
- JSON
Penn Treebank

The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.
- Dataset
- JSON

5 datasets found