Text Generation - Groups

EGOODS

A large native one-to-many text dataset for text generation tasks, constructed to accelerate the research of diverse text generation.
- Dataset
- JSON
BLIP2

A vision-language pre-training dataset, BLIP2, which consists of 100 million image-text pairs.
- Dataset
- JSON
The E2E dataset

The E2E dataset contains restaurant reviews labeled by 8 fields including food type, price, and customer rating.
- Dataset
- JSON
MTTN: Multi-Pair Text to Text Narratives for Prompt Generation

A large-scale dataset for generating prompts that can be used in diffusion models for text-to-text generation tasks.
- Dataset
- JSON
CLIP-GLaSS

The dataset used for the text-to-image task consists of 20 context tokens, to which three fixed tokens have been concatenated, representing the static context "the picture of".
- Dataset
- JSON
Wikitext-2

The dataset used in this paper is not explicitly described. However, it is mentioned that the authors used the Wikitext-2 dataset for text generation tasks.
- Dataset
- JSON
TESS: Text-to-Text Self-Conditioned Simplex Diffusion

Diffusion models have emerged as a power-ful paradigm for generation, obtaining strong performance in various continuous domains. However, applying continuous diffusion models...
- Dataset
- JSON
MME

MME: A comprehensive evaluation benchmark for multimodal large language models
- Dataset
- JSON
Mmbench

Mmbench: Is your multi-modal model an all-around player?
- Dataset
- JSON
Language models are few-shot learners

A language model that demonstrates capabilities in processing and generating human-like text.
- Dataset
- JSON
Mmicl

Mmicl: Empowering vision-language model with multi-modal in-context learning
- Dataset
- JSON
Prompt Highlighter

Prompt Highlighter is a novel paradigm for user-model interactions in multi-modal LLMs, offering output control through a token-level highlighting mechanism.
- Dataset
- JSON
C4

The dataset used for pre-training language models, containing a large collection of text documents.
- Dataset
- JSON
CommonGen

Commonsense generation aims to generate a realistic sentence describing a daily scene under the given concepts, which is very challenging, since it requires models to have...
- Dataset
- JSON
SSD-LM

Semi-autoregressive simplex-based diffusion language model for text generation and modular control
- Dataset
- JSON
Text8

Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.
- Dataset
- JSON
STC dataset

The STC dataset is a short text conversation dataset used for evaluating the performance of conversation response generation models.
- Dataset
- JSON
Wikitext-103

The dataset used in this paper is Wikitext-103, a general English language corpus containing good and featured Wikipedia articles.
- Dataset
- JSON
Synthetic Dataset

The dataset used in this work is a custom synthetic dataset generated using the liquid-dsp library, containing 600000 examples of each of 13.8 million examples, with SNRs...
- Dataset
- JSON
SeqDiffuSeq

The dataset used in the SeqDiffuSeq paper for sequence-to-sequence text generation.
- Dataset
- JSON

43 datasets found