BERT Pretraining Dataset

The BERT dataset includes the English Wikipedia corpus and BookCorpus, totaling roughly 3.4B words, used for unsupervised pre-training.

Data and Resources

Cite this as

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova (2024). Dataset: BERT Pretraining Dataset. https://doi.org/10.57702/x2izm9w0

DOI retrieved: November 25, 2024

Additional Info

Field Value
Created November 25, 2024
Last update November 25, 2024
Defined In https://doi.org/10.48550/arXiv.2002.04745
Author J. Devlin
More Authors
M.-W. Chang
K. Lee
K. Toutanova