Dataset - LDM

CamemBERT

Pretrained language model for French, trained on OSCAR corpus
- Dataset
- JSON
Phi-2: A Dataset for Language Model Evaluation

The Phi-2 dataset is a collection of language models used to evaluate the performance of language models.
- Dataset
- JSON
MBPP: A Dataset for Language Model Evaluation

The MBPP dataset is a collection of basic programming questions used to evaluate the performance of language models.
- Dataset
- JSON
Language Model as an Annotator: Exploring DialoGPT for Dialogue Summarization

Dialogue summarization aims to generate a succinct summary while retaining essential information of the dialogue.
- Dataset
- JSON
LLaMA

The dataset used in the paper is LLaMA, a large language model.
- Dataset
- JSON
LLaMA-7B

A benchmark for evaluating the perception ability of Large Vision-Language Models (LVLMs) via various subtasks and scenarios.
- Dataset
- JSON
A general language assistant as a laboratory for alignment

A general language assistant for aligning language models with human users
- Dataset
- JSON
Realtoxicityprompts: Evaluating neural toxic degeneration in language models

A dataset for evaluating neural toxic degeneration in language models
- Dataset
- JSON
BERT

The dataset used in this paper is a pre-trained BERT model trained on English Wikipedia and Books datasets.
- Dataset
- JSON
Text8

Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.
- Dataset
- JSON
GPT-2 small

The dataset used in this paper is a large language model, GPT-2 small, and its residual stream activations.
- Dataset
- JSON
GPT-4

The dataset used in this paper is a large language model, GPT-4, and its residual stream activations.
- Dataset
- JSON
Direct preference optimization: Your language model is secretly a reward model

The dataset used in the paper is not explicitly described. However, it is mentioned that the authors used a language model to optimize the performance of a reinforcement...
- Dataset
- JSON
Falcon 7B

This dataset has no description
- Dataset
- JSON
RedPajama 3B

This dataset has no description
- Dataset
- JSON
RedPajama

The RedPajama dataset is an open-source recipe to reproduce the LLaMA training dataset.
- Dataset
- JSON
GPT-4 Dataset

The GPT-4 dataset used for fine-tuning the Qwen model.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

17 datasets found