Dataset - LDM

ChatGPT Dataset

The dataset used in this study consists of a large language model (LLM) enabled platform - ChatGPT.
- Dataset
- JSON
DailyDialog

The DailyDialog dataset is a large-scale multi-turn dialogue dataset, consisting of 10,000 conversations with 5 turns each.
- Dataset
- JSON
Krapivin

The dataset used in the paper for keyphrase generation with correlation constraints.
- Dataset
- JSON
NUS

The dataset used in the paper for keyphrase generation with correlation constraints.
- Dataset
- JSON
Inspec

Keyphrase generation dataset for scientific articles
- Dataset
- JSON
KP20k

The dataset used in the paper for keyphrase generation with correlation constraints.
- Dataset
- JSON
SHP and HH

The dataset used in the paper is SHP and HH.
- Dataset
- JSON
Demonstration ITerated Task Optimization (DITTO)

The dataset used in the paper is a collection of email and blog posts from 20 distinct authors, with a focus on few-shot alignment of large language models.
- Dataset
- JSON
DEMYSTIFYING CLIP DATA

Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative...
- Dataset
- JSON
GLUCOSE

GLUCOSE is a large-scale dataset of implicit commonsense knowledge, encoded as causal mini-theories about the world, each grounded in a narrative context.
- Dataset
- JSON
CodeSearchNet

The dataset used in the paper is CodeSearchNet, a natural language code search benchmark for six programming languages (Python, Java, Javascript, Ruby, PHP, and Go).
- Dataset
- JSON
EmpatheticDialogues

The EmpatheticDialogues dataset is a text dataset for training empathetic AI chatbots, consisting of 25k conversations grounded in emotional situations with emotion labels.
- Dataset
- JSON
MR

The dataset used for sentiment analysis, question type classification, and subjectivity classification tasks.
- Dataset
- JSON
BookCorpus

The dataset used in this paper for unsupervised sentence representation learning, consisting of paragraphs from unlabeled text.
- Dataset
- JSON
PatentEval Dataset

The PatentEval dataset is a comprehensive dataset for evaluating patent text generation.
- Dataset
- JSON
Big Patent Dataset

The Big Patent dataset is a large-scale dataset for abstractive and coherent summarization.
- Dataset
- JSON
Harvard USPTO Patent Dataset

The Harvard USPTO Dataset is a large-scale, well-structured, and multi-purpose corpus of patent applications.
- Dataset
- JSON
Training Dataset

The training dataset is a collection of the publicly available Arabic corpora listed below: The unshufﬂed OSCAR corpus (Ortiz Su´arez et al., 2020). The Arabic Wikipedia dump...
- Dataset
- JSON
RPC-Lex: A dictionary to measure German right-wing populist conspiracy discou...

A dictionary to measure German right-wing populist conspiracy discourse online.
- Dataset
- JSON
A Benchmark Dataset for Learning to Intervene in Online Hate Speech

A benchmark dataset for learning to intervene in online hate speech.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

530 datasets found