-
CNN-DM Dataset
The CNN-DM dataset contains news articles and is used for training language models. -
AP News Corpus
The AP News corpus contains professionally-edited news articles and its vocabulary plateaus much faster than the Amazon corpus. -
AG News Dataset
The AG News - News articles from over 2000 news sources annotated by type of news: Sports, World, Business, and Science/Tech. 120k training and 7k test sets are provided. -
CNN/DailyMail and XSum
The CNN/DailyMail dataset is a collection of news articles, and the XSum dataset is a collection of news articles with summaries. -
LOCO dataset
The LOCO dataset consists of a large number of documents collected from 58 conspiracy theories media sources and 92 mainstream media sources. -
BFRS Dataset
The BFRS dataset contains news stories from Pakistan with labels for various categories related to political violence. -
Crowd Counting Consortium
The Crowd Counting Consortium dataset contains news stories from Pakistan with labels for various categories. -
GDPR Media Discourse
The dataset contains news articles from French, German, UK, and US sources about GDPR media discourse. -
Berita Dataset
The Berita dataset consists of 50304 digital Indonesia news articles shared online through Twitter. -
AG's News Corpus
AG's News Corpus -
Reuters Dataset
The Reuters dataset is a text classification dataset containing 21,578 samples.