-
Buffer of Thoughts
Buffer of Thoughts is a novel and versatile thought-augmented reasoning approach for enhancing accuracy, efficiency and robustness of large language models (LLMs). -
Towards Expert-Level Medical Question Answering with Large Language Models
The Towards Expert-Level Medical Question Answering with Large Language Models dataset contains a large-scale dataset for medical question answering using large language models. -
ALCUNA: Large Language Models Meet New Knowledge
ALCUNA is a benchmark for evaluating the ability of large language models (LLMs) to handle new knowledge. -
Mistral-7B-Instruct-v0.2
The dataset used in the paper is a benchmark contamination detection dataset, which contains questions and answers from various benchmarks. -
Conceptual Inconsistencies in Large Language Models
The dataset consists of 119 clusters, with a total of 584 questions, which include 4 different linguistic forms per query, so we have approximately 146 semantically different... -
AstroMLab 1: Who Wins Astronomy Jeopardy!?
A comprehensive evaluation of proprietary and open-weights large language models using the first astronomy-specific benchmarking dataset. -
TruthfulQA
The TruthfulQA dataset is a dataset that contains 817 questions designed to evaluate language models' preference to mimic some human falsehoods. -
Multimodal Visual Patterns (MMVP) Benchmark
The Multimodal Visual Patterns (MMVP) benchmark is a dataset used to evaluate the visual question answering capabilities of multimodal large language models (MLLMs). -
PANDA (Pedantic ANswer-correctness Determination and Adjudication)
Question answering (QA) can only make progress if we know if an answer is correct, but for many of the most challenging and interesting QA examples, current answer correctness... -
PokeMQA: Programmable knowledge editing for Multi-hop Question Answering
Multi-hop question answering (MQA) is one of the challenging tasks to evaluate machine’s comprehension and reasoning abilities, where large language models (LLMs) have widely... -
Universal and transferable adversarial attacks on aligned language models
AdvBench is a dataset for evaluating the safety of large language models. -
TruthX: Alleviating Hallucinations by Editing Large Language Models
TruthX: Alleviating Hallucinations by Editing Large Language Models -
Conceptual Captions
The dataset used in the paper "Scaling Laws of Synthetic Images for Model Training". The dataset is used for supervised image classification and zero-shot classification tasks.