Penn Tree Bank

doi:doi:10.57702/l0jnm3fd

Penn Tree Bank

The Penn Tree Bank dataset is a corpus split into a training, validation and testing set of 929k words, a validation set of 73k words, and a test set of 82k words. The vocabulary has 10k words. The dataset is used for word-level language modeling.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Cite this as

Alexandre de Brébisson, Pascal Vincent (2024). Dataset: Penn Tree Bank. https://doi.org/10.57702/l0jnm3fd

DOI retrieved: December 16, 2024

Additional Info

Field	Value
Created	December 16, 2024
Last update	December 16, 2024
Defined In	https://doi.org/10.48550/arXiv.1506.08170
Citation	https://doi.org/10.48550/arXiv.2402.02769 https://doi.org/10.48550/arXiv.1705.09353 https://doi.org/10.48550/arXiv.1604.08859
Author	Alexandre de Brébisson
More Authors	Pascal Vincent
Homepage	https://www.nltk.org/datasets/index.html