Dataset - LDM

Improving Generalization in Language Model-Based Text-to-SQL

Two simple semantic boundary-based techniques to improve the generalization of language model-based text-to-SQL
- Dataset
- JSON
The Pile dataset

The Pile dataset is a large-scale dataset containing 800GB of text data.
- Dataset
- JSON
LM-Extraction benchmark

The LM-Extraction benchmark is derived from The Pile (Gao et al., 2020) dataset, which contains 15,000 pairs of prefixes and suffixes derived from The Pile dataset (Gao et al.,...
- Dataset
- JSON
Audiopalm

A large language model that can speak and listen.
- Dataset
- JSON
Collective Constitutional AI

A platform for aligning a language model with public input.
- Dataset
- JSON
Ultrafeedback

The dataset used in the paper is Ultrafeedback, which is a preference dataset that contains 63k preference pairs sampled from models other than the SFT model.
- Dataset
- JSON
Wikipedia Corpus

The dataset used in the paper is a subset of the Wikipedia corpus, consisting of 7500 English Wikipedia articles belonging to one of the following categories: People, Cities,...
- Dataset
- JSON
Gutenberg Corpus

A dataset of 2,857 books written by 141 authors, used for pre-training and fine-tuning a language model for author-stylized text generation.
- Dataset
- JSON
A general language assistant as a laboratory for alignment

A general language assistant for aligning language models with human users
- Dataset
- JSON
ZJUKLAB at SemEval-2021 task 4

The dataset used in the paper for negative augmentation with language model for reading comprehension of abstract meaning
- Dataset
- JSON
BigBench

The BigBench dataset is a collection of 12 challenging language model reasoning tasks.
- Dataset
- JSON
Language models are few-shot learners

A language model that demonstrates capabilities in processing and generating human-like text.
- Dataset
- JSON
Self-Supervised Alignment with Mutual Information

The dataset is used for training a language model to follow behavioral principles without the use of preference labels, demonstrations, or human oversight.
- Dataset
- JSON
C4

The dataset used for pre-training language models, containing a large collection of text documents.
- Dataset
- JSON
AlpacaFarm

The AlpacaFarm dataset is a large-scale dataset for preference optimization, which consists of a set of instructions and their corresponding responses.
- Dataset
- JSON
COCO

Large scale datasets [18, 17, 27, 6] boosted text conditional image generation quality. However, in some domains it could be difficult to make such datasets and usually it could...
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

16 datasets found