Language Modeling - Groups

OpenWebText Corpus

A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words.

Dataset
JSON

One Billion Words Dataset

A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words.

Dataset
JSON

Penn Treebank and Wikipedia-90M

The Penn Treebank dataset is used for sentence-level language modeling, and the 90 million word subset of Wikipedia is used for paraphrasing.

Dataset
JSON

Chinese Poetry

The Chinese Poetry dataset is a dataset of Chinese poems used for language modeling.

Dataset
JSON

Text8

Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.

Dataset
JSON

Penn Treebank

The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.

Dataset
JSON

Wikitext-103

The dataset used in this paper is Wikitext-103, a general English language corpus containing good and featured Wikipedia articles.

Dataset
JSON

OSCAR 22.01

The OSCAR 22.01 corpus is a document-oriented corpus that is used for pre-training large generative language models. It is a multilingual corpus that contains documents holding...

Dataset
JSON

OSCAR

The OSCAR corpus is a multilingual web corpus that is used for pre-training large generative language models. It is a document-oriented corpus that is comparable in size and...

Dataset
JSON

Common Crawl

The Common Crawl (CC) project browses and indexes all content available online. It generates 200-300 TiB of data per month (around 5% of which is in French), and constitutes the...

Dataset
JSON

IMDB

The dataset used in the paper is not explicitly described, but it is mentioned that the authors tested the proposed method on three real data sets for the most relevant security...

Dataset
JSON

Penn Treebank (PTB) dataset

The Penn Treebank (PTB) dataset is used for word ordering task. The dataset is used to evaluate the performance of different models for word ordering.

Dataset
JSON

52 datasets found