One Billion Words Dataset

A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words.

Data and Resources

Cite this as

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, Volodymyr Kuleshov (2024). Dataset: One Billion Words Dataset. https://doi.org/10.57702/zujg4t8j

DOI retrieved: December 2, 2024

Additional Info

Field Value
Created December 2, 2024
Last update December 2, 2024
Defined In https://doi.org/10.48550/arXiv.2406.07524
Author Subham Sekhar Sahoo
More Authors
Marianne Arriola
Yair Schiff
Aaron Gokaslan
Edgar Marroquin
Justin T Chiu
Alexander Rush
Volodymyr Kuleshov
Homepage https://github.com/louaaron/Score-Entropy-Discrete-Diffusion/blob/main/data.py