-
CNN-DM Dataset
The CNN-DM dataset contains news articles and is used for training language models. -
PolEval 2018 LM dataset
PolEval 2018 LM dataset is a language modeling dataset for Polish language. -
Wikipedia2Vec dataset
The dataset used in the paper is the Wikipedia2Vec dataset, which contains word embeddings. -
European Parliament multilingual data (subset)
The dataset used for the experiments in the paper, containing a subset of European Parliament multilingual data. -
European Parliament multilingual data
The dataset used for the experiments in the paper, containing European Parliament multilingual data. -
Character Level Penn Treebank dataset
The Character Level Penn Treebank dataset is a benchmark for evaluating the ability of RNNs to model language. -
Rotational Unit of Memory
The Rotational Unit of Memory (RUM) is a novel RNN architecture that combines unitary evolution matrices and associative memory to improve long-term memory capabilities. -
Automata-based constraints for language model decoding
The dataset used in this paper is a collection of regular expressions and grammars for constraining language models. -
Improved Language Modeling by Decoding the Past
Highly regularized LSTMs achieve impressive results on several benchmark datasets in language modeling. We propose a new regularization method based on decoding the last token... -
TED-LIUM 2
Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. -
One Billion Word
The One Billion Word dataset is a large dataset of text, containing 0.8 billion words belonging to a vocabulary of 793 471 words. The dataset is used for word-level language... -
Penn Tree Bank
The Penn Tree Bank dataset is a corpus split into a training, validation and testing set of 929k words, a validation set of 73k words, and a test set of 82k words. The...