Text Summarization

Gigaword sentence dataset

Gigaword sentence dataset is a large corpus of sentences.

Dataset
JSON

DEEP

Detecting Errors through Ensembling Prompts (DEEP) - an end-to-end large language model framework for detecting factual errors in text summarization.

Dataset
JSON

AfricanHLT 2010

The dataset used for the automatic text summarization task, containing documents in three languages.

Dataset
JSON

Wikimarks

The Wikimarks dataset, which consists of 30 million deduplicated paragraphs from all Wikipedia articles.

Dataset
JSON

TREC-CAR Benchmark Y1

The dataset used for the Retrieve-Cluster-Summarize system, consisting of 117 article-level queries and 126 test queries.

Dataset
JSON

RAGSummData

The dataset used in the paper is a collection of dialogues and prompts for training a model to perform retrieval-augmented generation (RAG) based summarization. The dataset is...

Dataset
JSON

CNN, XSum, Gigaword News Headline, and Annotated Enron Subject Line Corpus

The CNN, XSum, Gigaword News Headline, and Annotated Enron Subject Line Corpus are datasets used for various NLP tasks.

Dataset
JSON

CNN/DailyMail and XSum

The CNN/DailyMail dataset is a collection of news articles, and the XSum dataset is a collection of news articles with summaries.

Dataset
JSON

Aggrefact-Unified dataset

The Aggrefact-Unified dataset is a collection of news documents and summaries with factual errors.

Dataset
JSON

IgboSum1500

IgboSum1500 is an Igbo text summarization dataset, housing 1,500 articles.

Dataset
JSON

Rotten Tomatoes

The Rotten Tomatoes dataset has 5331 positive and 5331 negative review sentences.

Dataset
JSON

Sentence Reduction for Automatic Text Summarization

The dataset used in this paper for sentence reduction task.

Dataset
JSON

TL;DR: Mining reddit to learn automatic summarization

The authors used the TL;DR dataset, which consists of reddit posts with summaries.

Dataset
JSON

Famous Keyword Twitter Replies

The Famous Keyword Twitter Replies dataset is a comprehensive collection of Twitter data that focuses on popular keywords and their associated replies.

Dataset
JSON

Document Summarization Dataset

The dataset used in the paper is a document summarization dataset. The goal is to extract sentences (with character budget B) to maximize coverage of human-annotated summaries.

Dataset
JSON

The dataset used for the text summarization task, where a summarizer produces an utterance made up of one or multiple sentences to succinctly report the main content of a text.

Dataset
JSON

Se2: Sequential Example Selection for In-Context Learning

The paper proposes a novel approach to the sequential example selection paradigm for in-context learning.

Dataset
JSON

WCEP

Wikipedia Current Events Portal (WCEP) dataset, which consists of short, human-written summaries of news events, the articles for which are all extracted from the Wikipedia...

Dataset
JSON

Multi-News

The dataset used in the paper is a collection of 45K news articles and corresponding summaries, where each summary is professionally crafted and provides links to the original...

Dataset
JSON

TAC’08 and TAC’09 datasets

Two multi-document summarization datasets from the Text Analysis Conference (TAC) shared tasks: TAC’08 and TAC’09.

Dataset
JSON

47 datasets found