Dataset - LDM

Wikipedia dataset

The dataset used in the paper is the Wikipedia dataset, which contains over six million English Wikipedia articles with a full-text field associated with 50 training queries...
- Dataset
- JSON
DocRED

DocRED is a large-scale human-annotated dataset for document-level RE, which is constructed from Wikipedia and Wikidata.
- Dataset
- JSON
CMUDoG

CMUDoG is a knowledge grounded conversation dataset with two speakers conversing based on movie Wikipedia articles.
- Dataset
- JSON
Natural Questions

The Natural Questions dataset consists of questions extracted from web queries, with each question accompanied by a corresponding Wikipedia article containing the answer.
- Dataset
- JSON
ORES

The ORES dataset is a machine learning-based web service for Wikimedia projects such as Wikipedia. It provides a model for detecting damaging edits.
- Dataset
- JSON
Wizard of Wikipedia

Wizard of Wikipedia is a recent, large-scale dataset of multi-turn knowledge-grounded dialogues between a “apprentice” and a “wizard”, who has access to information from...
- Dataset
- JSON
Text8

Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.
- Dataset
- JSON
Validation Dataset

The Validation Dataset is used for validation, it contains 1428 images from nine distinct rooms.
- Dataset
- JSON
Wikipedia Comparable Corpora

Multilingual dataset for topic modeling based on aligned Wikipedia articles extracted from Wikipedia Comparable Corpora
- Dataset
- JSON
Wiki

A bipartite interaction graph that contains the edits on Wikipedia pages over a month.
- Dataset
- JSON
fr-wiki

The fr-wiki dataset is a Wikipedia dataset for French, containing 0.5GT.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

31 datasets found