Dataset - LDM

AllNews

The dataset used in this paper is a collection of news articles from AllNews.
- Dataset
- JSON
Wiki40B

The dataset used in this paper is a collection of documents from Wikipedia.
- Dataset
- JSON
Yahoo Answer and Yelp15 review

Two large scale document classification datasets: Yahoo Answer and Yelp15 review, representing topic classification and sentiment classification data sets respectively.
- Dataset
- JSON
CommonCrawl

CommonCrawl is a non-profit organization that provides a large corpus of web pages for research and development purposes.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

4 datasets found