-
TruthfulQA
The TruthfulQA dataset is a dataset that contains 817 questions designed to evaluate language models' preference to mimic some human falsehoods. -
Evaluating large language models trained on code
The paper presents the results of the OpenAI Codex evaluation on generating Python code. -
Confidence Calibration in Large Language Models
The dataset used in this study to analyze the self-assessment behavior of Large language models. -
Proof-Pile-2
The dataset used for continual pre-training of large language models, with a focus on balancing the text distribution and mitigating overfitting. -
Multi-XScience
The dataset used in the paper is a collection of summaries of longer texts, with human evaluators' ratings of existing summaries. -
Multimodal Visual Patterns (MMVP) Benchmark
The Multimodal Visual Patterns (MMVP) benchmark is a dataset used to evaluate the visual question answering capabilities of multimodal large language models (MLLMs). -
PANDA (Pedantic ANswer-correctness Determination and Adjudication)
Question answering (QA) can only make progress if we know if an answer is correct, but for many of the most challenging and interesting QA examples, current answer correctness... -
Slot-VLM: SlowFast Slots for Video-Language Modeling
Video-Language Models (VLMs), powered by the advancements in Large Language Models (LLMs), are charting new frontiers in video understanding. A pivotal challenge is the... -
Chatbot Arena
The dataset used in this paper is a large-scale dataset for evaluating LLMs, which is used to train and evaluate the Chatbot Arena model. -
Arena-Hard
The dataset used in this paper is a large-scale dataset for evaluating LLMs, which is used to train and evaluate the Arena-Hard model. -
LMSYS ChatBot Arena
The dataset used in this paper is a large-scale real-world LLM conversation dataset, which is used to train and evaluate the LMSYS ChatBot Arena model. -
WizardArena
The dataset used in this paper is a large-scale conversational data, which is used to train and evaluate the WizardLM-β model. -
PokeMQA: Programmable knowledge editing for Multi-hop Question Answering
Multi-hop question answering (MQA) is one of the challenging tasks to evaluate machine’s comprehension and reasoning abilities, where large language models (LLMs) have widely...