-
Confidence Calibration in Large Language Models
The dataset used in this study to analyze the self-assessment behavior of Large language models. -
Chatbot Arena
The dataset used in this paper is a large-scale dataset for evaluating LLMs, which is used to train and evaluate the Chatbot Arena model. -
Arena-Hard
The dataset used in this paper is a large-scale dataset for evaluating LLMs, which is used to train and evaluate the Arena-Hard model. -
LMSYS ChatBot Arena
The dataset used in this paper is a large-scale real-world LLM conversation dataset, which is used to train and evaluate the LMSYS ChatBot Arena model. -
WizardArena
The dataset used in this paper is a large-scale conversational data, which is used to train and evaluate the WizardLM-β model. -
OpenAssistant Conversations– Democratizing Large Language Model Alignment
OpenAssistant Conversations– Democratizing Large Language Model Alignment -
Sparse Watermarking in LLMs with Enhanced Text Quality
The dataset used in the paper is not explicitly described, but it is mentioned that the authors used the ELI5, FinanceQA, MultiNews, and QMSum datasets. -
Inducing Anxiety in Large Language Models Increases Exploration and Bias
The Inducing Anxiety in Large Language Models Increases Exploration and Bias dataset contains anxiety-inducing scenarios for large language models. -
Xiezhi Benchmark
Xiezhi comprises multiple-choice questions across 516 diverse disciplines ranging from 13 different subjects with 249,587 questions and accompanied by Xiezhi-Specialty with... -
Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation
New Natural Language Process (NLP) benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present Xiezhi, the most comprehensive... -
GraphEval2000
GraphEval2000 is a graph dataset designed to evaluate the graph reasoning abilities of large language models (LLMs) through coding challenges.