Dataset - LDM

Posterior Control of Blackbox Generation

Text generation often requires high-precision output that obeys task-specific rules. This fine-grained control is difficult to enforce with off-the-shelf deep learning models.
- Dataset
- JSON
Rotowire

The dataset used in the paper for Rotowire
- Dataset
- JSON
Diverse and Specific Clarification Question Generation with Keywords

Product descriptions on e-commerce websites often suffer from missing important aspects. Clarification question generation (CQ-Gen) can be a promising approach to help alleviate...
- Dataset
- JSON
DrawTextExt

The dataset is used to train the GlyphDraw model for visual text generation. It contains 792k images with 3.3M characters in images and more than 4.8k common unique Chinese...
- Dataset
- JSON
Linear-time minimum Bayes risk decoding with reference aggregation

Linear-time minimum Bayes risk decoding with reference aggregation
- Dataset
- JSON
BERTScore: Evaluating text generation with BERT

BERTScore: Evaluating text generation with BERT
- Dataset
- JSON
Improving Minimum Bayes Risk Decoding with Multi-Prompt

Multi-prompt decoding for conditional text generation
- Dataset
- JSON
TextLogo3K

TextLogo3K dataset is a large-scale dataset of text logos, consisting of 3,470 text logo images with various styles and annotated with pixel-level segmentation, bounding boxes,...
- Dataset
- JSON
Reference Letter Dataset

Reference letter dataset generated under the Context-Based Generation (CBG) setting.
- Dataset
- JSON
AI Wiki

A dataset of AI Wiki, used for testing the author-stylized text generation model.
- Dataset
- JSON
Mark Twain Books

A dataset of Mark Twain's books, used for testing the author-stylized text generation model.
- Dataset
- JSON
Opinosis Review Dataset

A dataset of Opinosis Review dataset, used for testing the author-stylized text generation model.
- Dataset
- JSON
Wikipedia Corpus

The dataset used in the paper is a subset of the Wikipedia corpus, consisting of 7500 English Wikipedia articles belonging to one of the following categories: People, Cities,...
- Dataset
- JSON
Gutenberg Corpus

A dataset of 2,857 books written by 141 authors, used for pre-training and fine-tuning a language model for author-stylized text generation.
- Dataset
- JSON
ChatGPT model data

ChatGPT model data, used to generate text
- Dataset
- JSON
BLIP2

A vision-language pre-training dataset, BLIP2, which consists of 100 million image-text pairs.
- Dataset
- JSON
TESS: Text-to-Text Self-Conditioned Simplex Diffusion

Diffusion models have emerged as a power-ful paradigm for generation, obtaining strong performance in various continuous domains. However, applying continuous diffusion models...
- Dataset
- JSON
C4

The dataset used for pre-training language models, containing a large collection of text documents.
- Dataset
- JSON
Text-to-image generation via masked generative transformers

Text-to-image generation via masked generative transformers.
- Dataset
- JSON
OpenWebText Corpus

A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

28 datasets found