Dataset - LDM

Wikipedia Corpus

The dataset used in the paper is a subset of the Wikipedia corpus, consisting of 7500 English Wikipedia articles belonging to one of the following categories: People, Cities,...
- Dataset
- JSON
A general language assistant as a laboratory for alignment

A general language assistant for aligning language models with human users
- Dataset
- JSON
Realtoxicityprompts: Evaluating neural toxic degeneration in language models

A dataset for evaluating neural toxic degeneration in language models
- Dataset
- JSON
Alignment of language agents

A dataset for aligning language agents
- Dataset
- JSON
Text2Pos

Text2Pos for city-scale position localization based on textual descriptions. Given a point cloud that represents our surroundings and a query position description, Text2Pos...
- Dataset
- JSON
SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection

Open-vocabulary object detection (OvOD) has transformed detection into a language-guided task, empowering users to freely define their class vocabularies of interest during...
- Dataset
- JSON
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese...

WanJuan: A comprehensive multimodal dataset for advancing English and Chinese large models.
- Dataset
- JSON
Wikipedia dataset

The dataset used in the paper is the Wikipedia dataset, which contains over six million English Wikipedia articles with a full-text field associated with 50 training queries...
- Dataset
- JSON
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture

Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture
- Dataset
- JSON
Essay-BR

A large corpus of essays written by Brazilian high school students that were graded by experts following the evaluation criteria of the ENEM exam.
- Dataset
- JSON
ChatGPT model data

ChatGPT model data, used to generate text
- Dataset
- JSON
Adding A Filter Based on The Discriminator to Improve Unconditional Text Gene...

The dataset is used for unconditional text generation, and the authors propose a novel mechanism to improve the generator by adding a filter which has the same input as the...
- Dataset
- JSON
Textual Sports Commentary Dataset

The textual dataset is a collection of live sports commentaries scraped from various sources, including live score websites and YouTube.
- Dataset
- JSON
Sports Commentary Dataset

The dataset is a collection of live sports commentaries, including audio and textual data, used to train and evaluate machine learning models for event recognition and...
- Dataset
- JSON
CLEVR-Robot Environment

A benchmark for evaluating task compositionality and long-horizon tasks through object manipulation, with language serving as the mechanism for goal specification.
- Dataset
- JSON
Word2Vec

Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification
- Dataset
- JSON
PersonaChat dataset

The PersonaChat dataset is a large persona-conditioned chit-chat style dialogue dataset.
- Dataset
- JSON
TIMIT

The TIMIT corpus is a widely used benchmark for speech recognition tasks. It contains 3,696 training utterances from 462 speakers, excluding the SA sentences. The core test set...
- Dataset
- JSON
AdvBench

The dataset used in the paper to test the Gradient Cuff method for detecting jailbreak attacks on large language models.
- Dataset
- JSON
OPT-66B and Llama2-70B

The dataset used in the paper is OPT-66B, a large language model, and Llama2-70B, another large language model.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

420 datasets found