Penn Tree Bank

The Penn Tree Bank dataset is a corpus split into a training, validation and testing set of 929k words, a validation set of 73k words, and a test set of 82k words. The vocabulary has 10k words. The dataset is used for word-level language modeling.

Data and Resources

Cite this as

Alexandre de Brébisson, Pascal Vincent (2024). Dataset: Penn Tree Bank. https://doi.org/10.57702/l0jnm3fd

DOI retrieved: December 16, 2024

Additional Info

Field Value
Created December 16, 2024
Last update December 16, 2024
Defined In https://doi.org/10.48550/arXiv.1506.08170
Citation
  • https://doi.org/10.48550/arXiv.2402.02769
  • https://doi.org/10.48550/arXiv.1705.09353
  • https://doi.org/10.48550/arXiv.1604.08859
Author Alexandre de Brébisson
More Authors
Pascal Vincent
Homepage https://www.nltk.org/datasets/index.html