Dataset - LDM

Aggrefact-Unified dataset

The Aggrefact-Unified dataset is a collection of news documents and summaries with factual errors.
- Dataset
- JSON
Rotten Tomatoes

The Rotten Tomatoes dataset has 5331 positive and 5331 negative review sentences.
- Dataset
- JSON
Sentence Reduction for Automatic Text Summarization

The dataset used in this paper for sentence reduction task.
- Dataset
- JSON
TL;DR: Mining reddit to learn automatic summarization

The authors used the TL;DR dataset, which consists of reddit posts with summaries.
- Dataset
- JSON
Document Summarization Dataset

The dataset used in the paper is a document summarization dataset. The goal is to extract sentences (with character budget B) to maximize coverage of human-annotated summaries.
- Dataset
- JSON
Text Summarization

The dataset used for the text summarization task, where a summarizer produces an utterance made up of one or multiple sentences to succinctly report the main content of a text.
- Dataset
- JSON
WCEP

Wikipedia Current Events Portal (WCEP) dataset, which consists of short, human-written summaries of news events, the articles for which are all extracted from the Wikipedia...
- Dataset
- JSON
Multi-News

The dataset used in the paper is a collection of 45K news articles and corresponding summaries, where each summary is professionally crafted and provides links to the original...
- Dataset
- JSON
XSUM Dataset

The XSUM dataset comprises 226,711 British Broadcasting Corporation (BBC) articles paired with their single-sentence summaries.
- Dataset
- JSON
DUC-2004

DUC-2004 dataset is used for sentence summarization. It contains 500 documents, each with 4 model summaries.
- Dataset
- JSON
NYT

Text summarization aims to extract essential information from a piece of text and transform the text into a concise version.
- Dataset
- JSON
CNN/DM

Text summarization aims to extract essential information from a piece of text and transform the text into a concise version.
- Dataset
- JSON
English Gigaword

Text summarization aims to extract essential information from a piece of text and transform the text into a concise version.
- Dataset
- JSON
Smart Reply and Ambient Clinical Intelligence

The dataset used for Smart Reply and Ambient Clinical Intelligence tasks
- Dataset
- JSON
SAMSum

The SAMSum dataset is a benchmark for automatic summarization evaluation, containing dialogue summaries and their associated reference summaries.
- Dataset
- JSON
CLCV

The CLCV dataset is used for evaluation.
- Dataset
- JSON
ARXIV

The ARXIV dataset is used for evaluation.
- Dataset
- JSON
DUC 2007

The DUC 2007 dataset is used for evaluation.
- Dataset
- JSON
XSUM

The XSUM dataset is used for training and evaluation.
- Dataset
- JSON
DeFacto

The DeFacto dataset is a resource specifically curated to enhance the factual consistency of machine-generated summaries through the inclusion of human-annotated demonstrations...
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

23 datasets found