Dataset - LDM

Latent Distance Guided Alignment Training for Large Language Models

Ensuring alignment with human preferences is a crucial characteristic of large language models (LLMs). Presently, the primary alignment methods, RLHF and DPO, require extensive...
- Dataset
- JSON
VisualBERT

The VisualBERT dataset is a pre-trained model for vision-and-language tasks, which is built on top of PyTorch.
- Dataset
- JSON
Task Driven Image Understanding Challenge (TDIUC)

The Task Driven Image Understanding Challenge (TDIUC) dataset is a large VQA dataset with 12 more fine-grained categories proposed to compensate for the bias in distribution of...
- Dataset
- JSON
ZESHEL dataset

The ZESHEL dataset was constructed by Logeswaran et al. (2019) from Wikia. The task of zero-shot entity linking involves linking entity mentions in text to an entity from a list...
- Dataset
- JSON
A general theoretical paradigm to understand learning from human preferences

The paper proposes a novel approach to aligning language models with human preferences, focusing on the use of preference optimization in reward-free RLHF.
- Dataset
- JSON
CLIMA-INS

CLIMA-INS is a dataset composed of semi-structured questionnaires from insurance companies. The dataset is used to train self-supervised models for climate question answering...
- Dataset
- JSON
CLIMA-CDP

CLIMA-CDP is a dataset composed of semi-structured questionnaires from corporations. The dataset is used to train self-supervised models for climate question answering tasks.
- Dataset
- JSON
Simplifying graph convolutional networks

Simplifying graph convolutional networks.
- Dataset
- JSON
StackLLaMA: An RL fine-tuned LLaMA model for Stack Exchange question and answ...

The dataset used in the paper is the StackExchange dataset.
- Dataset
- JSON
ACE 2005, WebNLG, CoNLL, NYT, and FB15k-237

The dataset used in the paper is ACE 2005, WebNLG, CoNLL, NYT, and FB15k-237. The ACE 2005 dataset is a collection of news articles, while WebNLG is a corpus used for natural...
- Dataset
- JSON
Open-Orca

The dataset used for training large language models, with a focus on balancing the text distribution and mitigating overfitting.
- Dataset
- JSON
Multimodal Visual Patterns (MMVP) Benchmark

The Multimodal Visual Patterns (MMVP) benchmark is a dataset used to evaluate the visual question answering capabilities of multimodal large language models (MLLMs).
- Dataset
- JSON
VQA 1.0

The VQA 1.0 dataset is a large-scale dataset for visual question answering, containing 15,000 images with 50,000 questions.
- Dataset
- JSON
InstructBLIP

The InstructBLIP dataset is a vision-language model for comprehensive scene understanding and textual descriptions.
- Dataset
- JSON
LLaMA-7B

A benchmark for evaluating the perception ability of Large Vision-Language Models (LVLMs) via various subtasks and scenarios.
- Dataset
- JSON
Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of ...

Dysca is a dynamic and scalable benchmark for evaluating the perception ability of Large Vision-Language Models (LVLMs) via various subtasks and scenarios.
- Dataset
- JSON
Symbolic, Language Agnostic and Ontologically Grounded Large Language Models

The dataset used in the paper to demonstrate the limitations of large language models (LLMs) in capturing inferential aspects of natural language.
- Dataset
- JSON
VQA

The VQA dataset is a large-scale visual question answering dataset that consists of pairs of images that require natural language answers.
- Dataset
- JSON
A general language assistant as a laboratory for alignment

A general language assistant for aligning language models with human users
- Dataset
- JSON
SimpleQuestion

The SimpleQuestion dataset is a dataset for question answering, consisting of 100,000 questions and 1,000,000 answers.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

416 datasets found