Dataset - LDM

SQuAD 2.0 dataset

This dataset is used for question answering and contains a large amount of text data.
- Dataset
- JSON
ACE 2005 dataset

This dataset is used for event extraction and contains a large amount of text data.
- Dataset
- JSON
Wikipedia articles

Wikipedia articles dataset is a dataset of image-text pairs designed for cross-modal retrieval applications.
- Dataset
- JSON
Wikicorpus

The dataset used in the experiments to evaluate the adaptation of language models to nonstandard text.
- Dataset
- JSON
Helsinki Corpus

The Helsinki Corpus is a collection of texts in 21 languages, including English, French, German, Italian, and others.
- Dataset
- JSON
Shifts Machine Translation dataset

The Shifts Machine Translation dataset consists of pairs of source and target sentences in English and Russian.
- Dataset
- JSON
Twitter Dataset

The Twitter Dataset is a collection of tweets annotated with Plutchik's emotions, consisting of tweets in three different languages: English, Dutch, and German.
- Dataset
- JSON
CommonCrawl

CommonCrawl is a non-profit organization that provides a large corpus of web pages for research and development purposes.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

8 datasets found