-
WikiText-103 and Enwik8 datasets
WikiText-103 and Enwik8 datasets are used for language modeling tasks -
Wikipedia2Vec dataset
The dataset used in the paper is the Wikipedia2Vec dataset, which contains word embeddings. -
CALaMo: a Constructionist Assessment of Language Models
The authors used the CHILDES corpus to train a character-based LSTM model and evaluated its performance on a set of tasks. -
Character Level Penn Treebank dataset
The Character Level Penn Treebank dataset is a benchmark for evaluating the ability of RNNs to model language. -
Improved Language Modeling by Decoding the Past
Highly regularized LSTMs achieve impressive results on several benchmark datasets in language modeling. We propose a new regularization method based on decoding the last token... -
TED-LIUM 2
Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. -
One Billion Word
The One Billion Word dataset is a large dataset of text, containing 0.8 billion words belonging to a vocabulary of 793 471 words. The dataset is used for word-level language... -
Penn Tree Bank
The Penn Tree Bank dataset is a corpus split into a training, validation and testing set of 929k words, a validation set of 73k words, and a test set of 82k words. The... -
Penn Treebank PCFG
Penn Treebank PCFG dataset -
Simple CFG
Simple CFG dataset -
SAM: Semantic Attribute Modulation for Language Modeling and Style Variation
The Semantic Attribute Modulation (SAM) for language modeling and style variation. -
Penn Tree Bank (PTB)
The Penn Tree Bank (PTB) dataset used for language modeling. -
ControlVAE: Controllable Variational Autoencoder
The dataset used for language modeling, disentangled representation learning, and image generation. -
BookCorpus Dataset
The dataset used in the paper is the bookcorpus dataset. -
Wikitext-2
The dataset used in this paper is not explicitly described. However, it is mentioned that the authors used the Wikitext-2 dataset for text generation tasks. -
Billion Word Benchmark Dataset
The dataset contains 768M tokens for language modeling.