Dataset - LDM

Quantum Generative Modeling using Quantum Gates

Generative modeling approach using quantum gates for image and text generation
- Dataset
- JSON
SimCTG

Open-domain dialogue generation task on LCCC and DailyDialog datasets.
- Dataset
- JSON
Pt-Corpus-Instruct

The dataset used for training the TeenyTinyLlama pair consists of a concatenation of open-source Brazilian Portuguese datasets, including Wikipedia, CulturaX, OSCAR, Common...
- Dataset
- JSON
Pt-Corpus

The dataset used for training the TeenyTinyLlama pair consists of a concatenation of open-source Brazilian Portuguese datasets, including Wikipedia, CulturaX, OSCAR, Common...
- Dataset
- JSON
WebText

The dataset used in this paper is the WebText dataset, which is a widely used dataset for natural language processing tasks.
- Dataset
- JSON
MTG: A Benchmark Suite for Multilingual Text Generation

MTG is a multilingual multiway text generation benchmark suite. It is the first-proposed multilingual multiway text generation dataset with the largest human-annotated data...
- Dataset
- JSON
News-to-Report Dataset

A dataset for automatically generating macro research reports from economic news.
- Dataset
- JSON
M4

The M4 dataset consists of human-written texts from several data sources, including Wikipedia, Reddit, and arXiv in the English subset of the dataset. It pairs the human-written...
- Dataset
- JSON
SQuAD: 100,000+ Questions for Machine Comprehension of Text

The SQuAD dataset is a benchmark for natural language understanding tasks, including question answering and text classification.
- Dataset
- JSON
Bold Dataset

The BOLD dataset contains professional prompts for text generation, focusing on gender equality.
- Dataset
- JSON
Content Preserving Text Generation with Attribute Controls

The dataset used in this paper for text generation with attribute controls.
- Dataset
- JSON
Towards a unified multi-dimensional evaluator for text generation

The NewsRoom dataset consists of 60 input source texts and 7 output summaries for each sample.
- Dataset
- JSON
EGOODS

A large native one-to-many text dataset for text generation tasks, constructed to accelerate the research of diverse text generation.
- Dataset
- JSON
Wikipedia Neutrality Corpus

This dataset is used to test the ability of large language models to detect and correct biased Wikipedia edits according to Wikipedia's Neutral Point of View (NPOV) policy.
- Dataset
- JSON
ROCStories (+GPT-J)

A corpus and cloze evaluation for deeper understanding of commonsense stories.
- Dataset
- JSON
ROCStories

The ROCStories corpus is a collection of crowdsourced five-sentence everyday stories rich in causal and temporal relations.
- Dataset
- JSON
A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories

A corpus and cloze evaluation for deeper understanding of commonsense stories.
- Dataset
- JSON
The E2E dataset

The E2E dataset contains restaurant reviews labeled by 8 fields including food type, price, and customer rating.
- Dataset
- JSON
PersonaChat

Persona-Chat is sourced from authentic conversations between human annotators who are randomly matched and assigned persona information.
- Dataset
- JSON
CLIP-GLaSS

The dataset used for the text-to-image task consists of 20 context tokens, to which three fixed tokens have been concatenated, representing the static context "the picture of".
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

28 datasets found