Text Generation - Groups

C4

The dataset used for pre-training language models, containing a large collection of text documents.
- Dataset
- JSON
Text8

Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.
- Dataset
- JSON
Wikitext-103

The dataset used in this paper is Wikitext-103, a general English language corpus containing good and featured Wikipedia articles.
- Dataset
- JSON
BookCorpus

The dataset used in this paper for unsupervised sentence representation learning, consisting of paragraphs from unlabeled text.
- Dataset
- JSON

4 datasets found