-
Dataset for Hashtag Recommendation Evaluation
Dataset for evaluating hashtag recommendation methods -
nvBench-Rob(nlq,schema)
The nvBench-Rob(nlq,schema) dataset is a testing set from nvBench-Rob, containing both NLQ variants and data schema variants, specifically designed to test the robustness of... -
nvBench-Robschema
The nvBench-Robschema dataset is a testing set from nvBench-Rob, containing only data schema variants, specifically designed to test the robustness of models against data schema... -
nvBench-Robnlq
The nvBench-Robnlq dataset is a testing set from nvBench-Rob, containing only NLQ variants, specifically designed to test the robustness of models against NLQ variants. -
nvBench-Rob
The nvBench-Rob dataset is a comprehensive robustness evaluation dataset for text-to-vis models, containing diverse lexical and phrasal variations based on the original... -
Human-Centered IML Systems
A dataset for designing and evaluating human-centered IML systems -
GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks
Automatically evaluating vision-language tasks is challenging, especially when it comes to reflecting human judgments due to limitations in accounting for fine-grained details. -
Expert Demonstrations
The expert demonstrations are generated according to the given optimal policy for the recovery. The length of each expert demonstration is 5-grid size trajectory length. Four... -
AI Quadrant Dataset
The dataset used for evaluating the performance of AI software, with different levels of smartness and automation. -
CYBERSECEVAL 2
A wide-ranging cybersecurity evaluation suite for large language models. -
GridWorld and BlockDude Domains
The GridWorld and BlockDude domains were used to evaluate the proposed task sequencing framework. -
Quantile Off-Policy Evaluation via Deep Conditional Generative Learning
The dataset used in this paper for quantile off-policy evaluation via deep conditional generative learning. -
TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task
TACRED revisited: A thorough evaluation of the TACRED relation extraction task. -
NATURE: Natural Auxiliary Text Utterances for Realistic Spoken Language Evalu...
The NATURE dataset is a set of simple spoken-language oriented transformations, applied to the evaluation set of datasets, to introduce human spoken language variations while... -
Pedestrian Detection: An Evaluation of the State of the Art
Pedestrian detection: An evaluation of the state of the art.