You're currently viewing an old version of this dataset. To see the current version, click here.

20 Newsgroups dataset

The 20 Newsgroups dataset consists of 18,845 posts taken from the Usenet newsgroup collection. Each post belongs to exactly one newsgroup. Following the preprocessing in [12] and [7], the data was partitioned chronologically into 11,314 training and 7,531 test articles. After removing stopwords and stemming, the 2000 most frequent words in the training set were used to represent the documents.

Data and Resources

Cite this as

Nitish Srivastava, Geoffrey Hinton, Ruslan Salakhutdinov (2024). Dataset: 20 Newsgroups dataset. https://doi.org/10.57702/f4hmxqob

DOI retrieved: November 25, 2024

Additional Info

Field Value
Created November 25, 2024
Last update December 2, 2024
Defined In https://doi.org/10.48550/arXiv.1907.04919
Citation
  • https://doi.org/10.48550/arXiv.1611.05940
  • https://doi.org/10.1016/j.procs.2015.03.074
Author Nitish Srivastava
More Authors
Geoffrey Hinton
Ruslan Salakhutdinov
Homepage https://www.cs.toronto.edu/~rsalakhu/20Newsgroups.html