-
Wikicorpus
The dataset used in the experiments to evaluate the adaptation of language models to nonstandard text. -
Shifts Machine Translation dataset
The Shifts Machine Translation dataset consists of pairs of source and target sentences in English and Russian. -
Twitter Dataset
The Twitter Dataset is a collection of tweets annotated with Plutchik's emotions, consisting of tweets in three different languages: English, Dutch, and German. -
Reuters21578
The problem of similarity search is to find the most similar items in a large collection to a query item of interest. Fast similarity search is at the core of many information... -
CNAE-9 Dataset
The CNAE-9 dataset is a set of 9 categories from the National Classification of Economic Activities. -
PeerRead Dataset
The paper presents the PeerRead dataset for testing the sparse deep generative model. -
CommonCrawl
CommonCrawl is a non-profit organization that provides a large corpus of web pages for research and development purposes. -
Common Crawl
The Common Crawl (CC) project browses and indexes all content available online. It generates 200-300 TiB of data per month (around 5% of which is in French), and constitutes the...