Text Generation - Groups

DrawTextExt

The dataset is used to train the GlyphDraw model for visual text generation. It contains 792k images with 3.3M characters in images and more than 4.8k common unique Chinese...
- Dataset
- JSON
Linear-time minimum Bayes risk decoding with reference aggregation

Linear-time minimum Bayes risk decoding with reference aggregation
- Dataset
- JSON
BERTScore: Evaluating text generation with BERT

BERTScore: Evaluating text generation with BERT
- Dataset
- JSON
Improving Minimum Bayes Risk Decoding with Multi-Prompt

Multi-prompt decoding for conditional text generation
- Dataset
- JSON
TextLogo3K

TextLogo3K dataset is a large-scale dataset of text logos, consisting of 3,470 text logo images with various styles and annotated with pixel-level segmentation, bounding boxes,...
- Dataset
- JSON
Reference Letter Dataset

Reference letter dataset generated under the Context-Based Generation (CBG) setting.
- Dataset
- JSON
AI Wiki

A dataset of AI Wiki, used for testing the author-stylized text generation model.
- Dataset
- JSON
Mark Twain Books

A dataset of Mark Twain's books, used for testing the author-stylized text generation model.
- Dataset
- JSON
Opinosis Review Dataset

A dataset of Opinosis Review dataset, used for testing the author-stylized text generation model.
- Dataset
- JSON
Wikipedia Corpus

The dataset used in the paper is a subset of the Wikipedia corpus, consisting of 7500 English Wikipedia articles belonging to one of the following categories: People, Cities,...
- Dataset
- JSON
Gutenberg Corpus

A dataset of 2,857 books written by 141 authors, used for pre-training and fine-tuning a language model for author-stylized text generation.
- Dataset
- JSON
ChatGPT model data

ChatGPT model data, used to generate text
- Dataset
- JSON
BLIP2

A vision-language pre-training dataset, BLIP2, which consists of 100 million image-text pairs.
- Dataset
- JSON
TESS: Text-to-Text Self-Conditioned Simplex Diffusion

Diffusion models have emerged as a power-ful paradigm for generation, obtaining strong performance in various continuous domains. However, applying continuous diffusion models...
- Dataset
- JSON
C4

The dataset used for pre-training language models, containing a large collection of text documents.
- Dataset
- JSON
CommonGen

Commonsense generation aims to generate a realistic sentence describing a daily scene under the given concepts, which is very challenging, since it requires models to have...
- Dataset
- JSON
SSD-LM

Semi-autoregressive simplex-based diffusion language model for text generation and modular control
- Dataset
- JSON
Wikitext-103

The dataset used in this paper is Wikitext-103, a general English language corpus containing good and featured Wikipedia articles.
- Dataset
- JSON
SeqDiffuSeq

The dataset used in the SeqDiffuSeq paper for sequence-to-sequence text generation.
- Dataset
- JSON
BookCorpus

The dataset used in this paper for unsupervised sentence representation learning, consisting of paragraphs from unlabeled text.
- Dataset
- JSON

21 datasets found