Dataset - LDM

Wikipedia Detox project

The dataset used in the paper is a collection of 100,000 Wikipedia talk page comments manually labelled by workers on the Crowdflower platform for 'toxicity'.
- Dataset
- JSON
WMT 2020 Sentence-Level Direct Assessment dataset

The dataset used in the competition for Sentence-Level Direct Assessment shared task is composed of data extracted from Wikipedia for six language pairs, consisting of...
- Dataset
- JSON
French Wikipedia

French Wikipedia corpus
- Dataset
- JSON
Wikipedia as multilingual source of comparable corpora

Wikipedia as multilingual source of comparable corpora.
- Dataset
- JSON
WikiNER

The dataset includes a larger set of English Wikipedia documents, which are tagged with named entities.
- Dataset
- JSON
Wiki10

The dataset includes a subset of English Wikipedia documents, which are tagged collaboratively by users from the social bookmarking site Delicious.
- Dataset
- JSON
Featured

The dataset contains 1,843 statements from featured articles in Wikipedia.
- Dataset
- JSON
CW-Hard

The dataset contains 280,538 POV-tagged statements from Wikipedia revisions.
- Dataset
- JSON
Wikipedia4

Electricity2, Traffic3, and Wikipedia4, preprocessed exactly as in (Salinas et al., 2019a), with their properties listed in Table 3.
- Dataset
- JSON
Electricity2, Traffic3, and Wikipedia4

Electricity2, Traffic3, and Wikipedia4, preprocessed exactly as in (Salinas et al., 2019a), with their properties listed in Table 3.
- Dataset
- JSON
WCEP

Wikipedia Current Events Portal (WCEP) dataset, which consists of short, human-written summaries of news events, the articles for which are all extracted from the Wikipedia...
- Dataset
- JSON
Wiki2

The dataset contains daily Wikipedia article views data.
- Dataset
- JSON
Wikipedia Dispute Corpus

A newly created corpus of discussions from Wikipedia Talk pages for dispute detection
- Dataset
- JSON
Authority and Alignment in Wikipedia Discussions (AAWD)

A newly created corpus of Wikipedia Talk pages for dispute detection
- Dataset
- JSON
Local and global algorithms for disambiguation to Wikipedia

Local and global algorithms for disambiguation to Wikipedia.
- Dataset
- JSON
Fast and accurate annotation of short texts with Wikipedia pages

Fast and accurate annotation of short texts with Wikipedia pages.
- Dataset
- JSON
FEVER: A Large-Scale Dataset for Fact Extraction and Verification

The FEVER dataset consists of 185,455 annotated claims, together with 5,416,537 Wikipedia documents containing roughly 25 million sentences as potential evidence.
- Dataset
- JSON
Bias in Bios Dataset

Bias in Bios dataset, a personal biography dataset with information extracted from Wikipedia.
- Dataset
- JSON
UMDWikipedia dataset

UMDWikipedia dataset contains information of around 770K edits from Jan 2013 to July 2014 (19 months) with 17105 vandals and 17105 benign users.
- Dataset
- JSON
Wikipedia Corpus

The dataset used in the paper is a subset of the Wikipedia corpus, consisting of 7500 English Wikipedia articles belonging to one of the following categories: People, Cities,...
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

31 datasets found