Text Data - Groups

Wikipedia articles

Wikipedia articles dataset is a dataset of image-text pairs designed for cross-modal retrieval applications.

Dataset
JSON

Wikicorpus

The dataset used in the experiments to evaluate the adaptation of language models to nonstandard text.

Dataset
JSON

Shifts Machine Translation dataset

The Shifts Machine Translation dataset consists of pairs of source and target sentences in English and Russian.

Dataset
JSON

Twitter Dataset

The Twitter Dataset is a collection of tweets annotated with Plutchik's emotions, consisting of tweets in three different languages: English, Dutch, and German.

Dataset
JSON

CommonCrawl

CommonCrawl is a non-profit organization that provides a large corpus of web pages for research and development purposes.

Dataset
JSON

5 datasets found

Wikipedia articles

Wikicorpus

Shifts Machine Translation dataset

Twitter Dataset

CommonCrawl