Text Data - Groups

Pile

The Pile dataset consists of 800GB text from 22 domains. Cynical selection naturally prefers text data based on the target corpus.
- Dataset
- JSON
Chinese Gigaword Fifth Edition (LDC2011t13)

Chinese Gigaword Fifth Edition (LDC2011t13)
- Dataset
- JSON
Reddit public dataset

The Reddit public dataset is a collection of text data from Reddit users.
- Dataset
- JSON
Wikipedia articles

Wikipedia articles dataset is a dataset of image-text pairs designed for cross-modal retrieval applications.
- Dataset
- JSON
Wikicorpus

The dataset used in the experiments to evaluate the adaptation of language models to nonstandard text.
- Dataset
- JSON
Shifts Machine Translation dataset

The Shifts Machine Translation dataset consists of pairs of source and target sentences in English and Russian.
- Dataset
- JSON
The Pile

The Pile dataset contains 3.5 million samples of diverse text for language modeling.
- Dataset
- JSON
Twitter Dataset

The Twitter Dataset is a collection of tweets annotated with Plutchik's emotions, consisting of tweets in three different languages: English, Dutch, and German.
- Dataset
- JSON
TMC

The TMC dataset is a collection of air traffic reports.
- Dataset
- JSON
Reuters21578

The problem of similarity search is to find the most similar items in a large collection to a query item of interest. Fast similarity search is at the core of many information...
- Dataset
- JSON
CNAE-9 Dataset

The CNAE-9 dataset is a set of 9 categories from the National Classification of Economic Activities.
- Dataset
- JSON
PeerRead Dataset

The paper presents the PeerRead dataset for testing the sparse deep generative model.
- Dataset
- JSON
CommonCrawl

CommonCrawl is a non-profit organization that provides a large corpus of web pages for research and development purposes.
- Dataset
- JSON
CCNet

The dataset used in the paper to train the Toolformer model.
- Dataset
- JSON
OSCAR

The OSCAR corpus is a multilingual web corpus that is used for pre-training large generative language models. It is a document-oriented corpus that is comparable in size and...
- Dataset
- JSON
Common Crawl

The Common Crawl (CC) project browses and indexes all content available online. It generates 200-300 TiB of data per month (around 5% of which is in French), and constitutes the...
- Dataset
- JSON

16 datasets found