You're currently viewing an old version of this dataset. To see the current version, click here.

Billion Word Benchmark Dataset

The dataset contains 768M tokens for language modeling.

Data and Resources

Cite this as

Hassan et al. (2024). Dataset: Billion Word Benchmark Dataset. https://doi.org/10.57702/bprj7ycm

DOI retrieved: December 3, 2024

Additional Info

Field Value
Created December 3, 2024
Last update December 3, 2024
Author Hassan et al.