BERT Pretraining Dataset

The BERT dataset includes the English Wikipedia corpus and BookCorpus, totaling roughly 3.4B words, used for unsupervised pre-training.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova (2024). Dataset: BERT Pretraining Dataset. https://doi.org/10.57702/x2izm9w0

DOI retrieved: November 25, 2024