-
Socher et al. (2013) dataset
The dataset used in the paper is a large-scale corpus of movie reviews from the Socher et al. (2013) dataset. -
Cross-topic Argument Mining from Heterogeneous Sources
Cross-topic Argument Mining from Heterogeneous Sources. -
Semeval-2016 Task 6: Detecting stance in tweets
Semeval-2016 Task 6: Detecting stance in tweets. -
Few-Shot Stance Detection via Target-Aware Prompt Distillation
Stance detection aims to identify whether the author of a text is in favor of, against, or neutral to a given target. The main challenge of this task comes two-fold: few-shot... -
A Million News Headlines, Fake and real news, Getting Real about Fake News
The dataset is a combination of 3 singular datasets: A Million News Headlines, Fake and real news, Getting Real about Fake News. -
Rotten Tomatoes
The Rotten Tomatoes dataset has 5331 positive and 5331 negative review sentences. -
HONEST Race
The dataset used for toxicity and stereotype mitigation task, which consists of 25 thousand examples of positive and negative movie reviews. -
Harry Potter unlearning dataset
The dataset used in the paper is a concatenation of the original Harry Potter books and synthetic discussions, blog posts, and wiki-like entries about the books. -
Equity Evaluation Corpus (EEC)
The dataset used in the paper is the Equity Evaluation Corpus (EEC) for emotion prediction, which contains a balanced dataset of sentences with emotions. -
Proprietary Large-Scale Industry Dataset
The dataset used for the proposed Joint Multi-Domain Learning for Automatic Short Answer Grading. -
IMDb Review Dataset
The IMDb review dataset is used for positive generation task. -
AmazonCat-13K
The dataset used in the LightDXML paper for extreme multi-label classification. -
The Pile dataset
The Pile dataset is a large-scale dataset containing 800GB of text data. -
LM-Extraction benchmark
The LM-Extraction benchmark is derived from The Pile (Gao et al., 2020) dataset, which contains 15,000 pairs of prefixes and suffixes derived from The Pile dataset (Gao et al.,... -
TREC05 spam corpus
The dataset used in the paper is the TREC05 spam corpus, which contains 39,999 real ham and 52,790 spam emails. -
Neural Speed Reading with Structural-Jump-LSTM
The dataset consists of 108 news headlines, 72 of which are true and 36 of which are false. -
Sample Selection for Data Augmentation in Natural Language Processing
Deep learning-based text classification models need abundant labeled data to obtain competitive performance. To tackle this, multiple researches try to use data augmentation to... -
Dual-sparse Regularized Randomized Reduction
The paper proposes dual-sparse regularized randomized reduction methods for classification. The dataset used in the paper is the RCV1-binary dataset.