-
Aggrefact-Unified dataset
The Aggrefact-Unified dataset is a collection of news documents and summaries with factual errors. -
IgboSum1500
IgboSum1500 is an Igbo text summarization dataset, housing 1,500 articles. -
Rotten Tomatoes
The Rotten Tomatoes dataset has 5331 positive and 5331 negative review sentences. -
Sentence Reduction for Automatic Text Summarization
The dataset used in this paper for sentence reduction task. -
TL;DR: Mining reddit to learn automatic summarization
The authors used the TL;DR dataset, which consists of reddit posts with summaries. -
Famous Keyword Twitter Replies
The Famous Keyword Twitter Replies dataset is a comprehensive collection of Twitter data that focuses on popular keywords and their associated replies. -
Document Summarization Dataset
The dataset used in the paper is a document summarization dataset. The goal is to extract sentences (with character budget B) to maximize coverage of human-annotated summaries. -
Text Summarization
The dataset used for the text summarization task, where a summarizer produces an utterance made up of one or multiple sentences to succinctly report the main content of a text. -
Se2: Sequential Example Selection for In-Context Learning
The paper proposes a novel approach to the sequential example selection paradigm for in-context learning. -
Multi-News
The dataset used in the paper is a collection of 45K news articles and corresponding summaries, where each summary is professionally crafted and provides links to the original... -
TAC’08 and TAC’09 datasets
Two multi-document summarization datasets from the Text Analysis Conference (TAC) shared tasks: TAC’08 and TAC’09. -
XSUM Dataset
The XSUM dataset comprises 226,711 British Broadcasting Corporation (BBC) articles paired with their single-sentence summaries. -
DUC2002, DUC2003, DUC2005 datasets
Multi-document summarization datasets -
Wikibio Dataset
Text summarization and data-to-text generation datasets -
Gigaword and New York Times Annotated Corpus
Text summarization and data-to-text generation datasets -
Towards a unified multi-dimensional evaluator for text generation
The NewsRoom dataset consists of 60 input source texts and 7 output summaries for each sample. -
TEDLIUM Corpus
The TEDLIUM corpus is a large-volume corpus used for speech recognition and text summarization.