-
LibriSpeech LM
The LibriSpeech LM corpus used for pre-training speech-text models. -
Penn Treebank PCFG
Penn Treebank PCFG dataset -
Simple CFG
Simple CFG dataset -
SAM: Semantic Attribute Modulation for Language Modeling and Style Variation
The Semantic Attribute Modulation (SAM) for language modeling and style variation. -
Penn Tree Bank (PTB)
The Penn Tree Bank (PTB) dataset used for language modeling. -
ControlVAE: Controllable Variational Autoencoder
The dataset used for language modeling, disentangled representation learning, and image generation. -
BookCorpus Dataset
The dataset used in the paper is the bookcorpus dataset. -
Morfessor 2.0 dataset
Morfessor 2.0 dataset for English, Finnish and Turkish language models -
Den samiske tekstbanken dataset
Den samiske tekstbanken dataset for North S´ami language model -
Morpho Challenge 2010 dataset
Morpho Challenge 2010 dataset for English, Finnish and Turkish language models -
Wikitext-2
The dataset used in this paper is not explicitly described. However, it is mentioned that the authors used the Wikitext-2 dataset for text generation tasks. -
Billion Word Benchmark Dataset
The dataset contains 768M tokens for language modeling. -
SlimPajama
The dataset is used to evaluate the performance of the xLSTM architecture on various tasks, including language modeling, question answering, and text classification. -
Penn Treebank (PTB) and WikiText-2 (WT-2)
The dataset used in the paper is Penn Treebank (PTB) and WikiText-2 (WT-2), which are language modeling datasets. -
Patrika Dataset
Patrika dataset is used as independent test set.