-
Clickbait Challenge 2017
The Clickbait Challenge 2017 dataset, a collection of social media posts and their corresponding article titles, used for clickbait detection. -
Fake News Challenge Stage 1 (FNC-1)
The FNC-1 dataset is a supervised classification task for stance detection, where the goal is to automatically predict the labels in a supervised classification task. -
Semeval-2016 Task 6: Detecting stance in tweets
Semeval-2016 Task 6: Detecting stance in tweets. -
Rotten Tomatoes
The Rotten Tomatoes dataset has 5331 positive and 5331 negative review sentences. -
SST2, IMDB, Rotten Tomatoes
The SST2 dataset has 6920/872/1821 example sentences in the train/dev/test sets. The task is binary classification into positive/negative sentiment. The IMDB dataset has... -
HONEST Race
The dataset used for toxicity and stereotype mitigation task, which consists of 25 thousand examples of positive and negative movie reviews. -
Sentiment Analysis Dataset
The dataset used in the paper is a collection of unstructured text data from social networks, news sites, and forums. -
Microsoft Academic Graph
The Microsoft Academic Graph (MAG) dataset is used to construct Maple, a multi-field benchmark for evaluating scientific literature tagging. -
Fake News Detection dataset
The Fake News Detection dataset contains text documents with fake news labels. -
Spam Detection dataset
The Spam Detection dataset contains text documents with spam labels. -
Polarity dataset
The Polarity dataset contains text documents with sentiment labels. -
IMDb Review Dataset
The IMDb review dataset is used for positive generation task. -
AmazonCat-13K
The dataset used in the LightDXML paper for extreme multi-label classification. -
The Pile dataset
The Pile dataset is a large-scale dataset containing 800GB of text data. -
LM-Extraction benchmark
The LM-Extraction benchmark is derived from The Pile (Gao et al., 2020) dataset, which contains 15,000 pairs of prefixes and suffixes derived from The Pile dataset (Gao et al.,... -
TREC05 spam corpus
The dataset used in the paper is the TREC05 spam corpus, which contains 39,999 real ham and 52,790 spam emails. -
Neural Speed Reading with Structural-Jump-LSTM
The dataset consists of 108 news headlines, 72 of which are true and 36 of which are false.