Language Models - Groups

LongPile

LongPile is a diverse dataset derived from the Pile corpus.

Dataset
JSON

Edit Distance Robust Watermarks for Language Models

The dataset used in the paper is a language model output, which is a sequence of tokens generated by a language model.

Dataset
JSON

Llama: Open and efficient foundation language models

The LLaMA dataset is a large language model dataset used in the paper.

Dataset
JSON

Fine-tuning Language Models with Advantage-Induced Policy Alignment

The dataset used in the paper is the Anthropic Helpfulness and Harmlessness dataset and the StackExchange dataset.

Dataset
JSON

Laion-5b

A large-scale dataset of text and images for training next-generation language models.

Dataset
JSON

BERT: Pre-training of deep bidirectional transformers for language understanding

This paper proposes BERT, a pre-trained deep bidirectional transformer for language understanding.

Dataset
JSON

SHP dataset

The SHP dataset is used to evaluate the performance of the proposed Compositional Preference Models (CPMs).

Dataset
JSON

HH-RLHF dataset

The HH-RLHF dataset is used to evaluate the performance of the proposed Compositional Preference Models (CPMs).

Dataset
JSON

Training Language Models to Perform Tasks

A dataset for training language models to perform tasks such as question answering and text classification.

Dataset
JSON

Interpreting Learned Feedback Patterns in Large Language Models

The dataset used in the paper is not explicitly described, but it is mentioned that the authors used a condensed representation of LLM activations obtained from sparse...

Dataset
JSON

10 datasets found