One Billion Words Dataset

doi:doi:10.57702/zujg4t8j

One Billion Words Dataset

A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Cite this as

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, Volodymyr Kuleshov (2024). Dataset: One Billion Words Dataset. https://doi.org/10.57702/zujg4t8j

DOI retrieved: December 2, 2024

Additional Info

Field	Value
Created	December 2, 2024
Last update	December 2, 2024
Defined In	https://doi.org/10.48550/arXiv.2406.07524
Author	Subham Sekhar Sahoo
More Authors	Marianne Arriola Yair Schiff Aaron Gokaslan Edgar Marroquin Justin T Chiu Alexander Rush Volodymyr Kuleshov
Homepage	https://github.com/louaaron/Score-Entropy-Discrete-Diffusion/blob/main/data.py