-
Wikicorpus
The dataset used in the experiments to evaluate the adaptation of language models to nonstandard text. -
Helsinki Corpus
The Helsinki Corpus is a collection of texts in 21 languages, including English, French, German, Italian, and others. -
Shifts Machine Translation dataset
The Shifts Machine Translation dataset consists of pairs of source and target sentences in English and Russian. -
Twitter Dataset
The Twitter Dataset is a collection of tweets annotated with Plutchik's emotions, consisting of tweets in three different languages: English, Dutch, and German. -
CommonCrawl
CommonCrawl is a non-profit organization that provides a large corpus of web pages for research and development purposes.