-
Improving Generalization in Language Model-Based Text-to-SQL
Two simple semantic boundary-based techniques to improve the generalization of language model-based text-to-SQL -
The Pile dataset
The Pile dataset is a large-scale dataset containing 800GB of text data. -
LM-Extraction benchmark
The LM-Extraction benchmark is derived from The Pile (Gao et al., 2020) dataset, which contains 15,000 pairs of prefixes and suffixes derived from The Pile dataset (Gao et al.,... -
Collective Constitutional AI
A platform for aligning a language model with public input. -
Ultrafeedback
The dataset used in the paper is Ultrafeedback, which is a preference dataset that contains 63k preference pairs sampled from models other than the SFT model. -
Wikipedia Corpus
The dataset used in the paper is a subset of the Wikipedia corpus, consisting of 7500 English Wikipedia articles belonging to one of the following categories: People, Cities,... -
Gutenberg Corpus
A dataset of 2,857 books written by 141 authors, used for pre-training and fine-tuning a language model for author-stylized text generation. -
A general language assistant as a laboratory for alignment
A general language assistant for aligning language models with human users -
ZJUKLAB at SemEval-2021 task 4
The dataset used in the paper for negative augmentation with language model for reading comprehension of abstract meaning -
Language models are few-shot learners
A language model that demonstrates capabilities in processing and generating human-like text. -
Self-Supervised Alignment with Mutual Information
The dataset is used for training a language model to follow behavioral principles without the use of preference labels, demonstrations, or human oversight. -
AlpacaFarm
The AlpacaFarm dataset is a large-scale dataset for preference optimization, which consists of a set of instructions and their corresponding responses.