Language Modeling - Groups

C4

The dataset used for pre-training language models, containing a large collection of text documents.
- Dataset
- JSON
OpenWebText Corpus

A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words.
- Dataset
- JSON
One Billion Words Dataset

A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words.
- Dataset
- JSON
Wikitext-103

The dataset used in this paper is Wikitext-103, a general English language corpus containing good and featured Wikipedia articles.
- Dataset
- JSON

4 datasets found