-
Wikipedia2Vec dataset
The dataset used in the paper is the Wikipedia2Vec dataset, which contains word embeddings. -
European Parliament multilingual data (subset)
The dataset used for the experiments in the paper, containing a subset of European Parliament multilingual data. -
European Parliament multilingual data
The dataset used for the experiments in the paper, containing European Parliament multilingual data. -
Character Level Penn Treebank dataset
The Character Level Penn Treebank dataset is a benchmark for evaluating the ability of RNNs to model language. -
Rotational Unit of Memory
The Rotational Unit of Memory (RUM) is a novel RNN architecture that combines unitary evolution matrices and associative memory to improve long-term memory capabilities. -
Automata-based constraints for language model decoding
The dataset used in this paper is a collection of regular expressions and grammars for constraining language models. -
Improved Language Modeling by Decoding the Past
Highly regularized LSTMs achieve impressive results on several benchmark datasets in language modeling. We propose a new regularization method based on decoding the last token... -
TED-LIUM 2
Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. -
One Billion Word
The One Billion Word dataset is a large dataset of text, containing 0.8 billion words belonging to a vocabulary of 793 471 words. The dataset is used for word-level language... -
Penn Tree Bank
The Penn Tree Bank dataset is a corpus split into a training, validation and testing set of 929k words, a validation set of 73k words, and a test set of 82k words. The... -
LibriSpeech LM
The LibriSpeech LM corpus used for pre-training speech-text models. -
Penn Treebank PCFG
Penn Treebank PCFG dataset -
Simple CFG
Simple CFG dataset