Language Modeling - Groups

CNN-DM Dataset

The CNN-DM dataset contains news articles and is used for training language models.
- Dataset
- JSON
PolEval 2018 LM dataset

PolEval 2018 LM dataset is a language modeling dataset for Polish language.
- Dataset
- JSON
Vakyansh

The dataset is used for training and testing the proposed punctuation restoration and inverse text normalization models.
- Dataset
- JSON
GPT-3

A large language model that is significantly larger than the largest model tested in the results discussed above.
- Dataset
- JSON
PG-19

PG-19 is a well-established benchmark for long-form language modeling.
- Dataset
- JSON
SST

The dataset used in the paper is the Stanford Sentiment Treebank (SST) dataset, which contains standard train/dev/test sets and two subtasks: binary sentence classification or...
- Dataset
- JSON
Wikipedia2Vec dataset

The dataset used in the paper is the Wikipedia2Vec dataset, which contains word embeddings.
- Dataset
- JSON
European Parliament multilingual data (subset)

The dataset used for the experiments in the paper, containing a subset of European Parliament multilingual data.
- Dataset
- JSON
European Parliament multilingual data

The dataset used for the experiments in the paper, containing European Parliament multilingual data.
- Dataset
- JSON
Character Level Penn Treebank dataset

The Character Level Penn Treebank dataset is a benchmark for evaluating the ability of RNNs to model language.
- Dataset
- JSON
Rotational Unit of Memory

The Rotational Unit of Memory (RUM) is a novel RNN architecture that combines unitary evolution matrices and associative memory to improve long-term memory capabilities.
- Dataset
- JSON
The Pile

The Pile dataset contains 3.5 million samples of diverse text for language modeling.
- Dataset
- JSON
Automata-based constraints for language model decoding

The dataset used in this paper is a collection of regular expressions and grammars for constraining language models.
- Dataset
- JSON
ALLSSTAR

Large-scale dataset of L1 and L2 scripted and spontaneous transcripts and recordings
- Dataset
- JSON
Enwik8

The Enwik8 dataset is a large-scale language modeling dataset.
- Dataset
- JSON
Improved Language Modeling by Decoding the Past

Highly regularized LSTMs achieve impressive results on several benchmark datasets in language modeling. We propose a new regularization method based on decoding the last token...
- Dataset
- JSON
TED-LIUM 2

Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks.
- Dataset
- JSON
FastText

The FastText dataset is a subword token embedding model. It produces a vector representation of a word based on composing embeddings of the character n-grams composing the word.
- Dataset
- JSON
One Billion Word

The One Billion Word dataset is a large dataset of text, containing 0.8 billion words belonging to a vocabulary of 793 471 words. The dataset is used for word-level language...
- Dataset
- JSON
Penn Tree Bank

The Penn Tree Bank dataset is a corpus split into a training, validation and testing set of 929k words, a validation set of 73k words, and a test set of 82k words. The...
- Dataset
- JSON

55 datasets found