English Wikipedia Dataset

The dataset consists of English Wikipedia articles used to train word vector models, containing 5.3M articles, 83M sentences, and 1,676M tokens.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Sungjoon Park, JinYeong Bak, Alice Oh (2024). Dataset: English Wikipedia Dataset. https://doi.org/10.57702/hel2bi07

DOI retrieved: November 25, 2024

Field	Value
Created	November 25, 2024
Last update	November 25, 2024
Defined In	https://doi.org/10.18653/v1/D17-1041
Author	Sungjoon Park
More Authors	JinYeong Bak Alice Oh
Homepage	https://dumps.wikimedia.org/enwiki/20170120/