-
SMART-101 dataset
The dataset for the SMART-101 challenge consists of 101 unique puzzles that require a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning,... -
High Quality Image Text Pairs
The High Quality Image Text Pairs (HQITP-134M) dataset consists of 134 million diverse and high-quality images paired with descriptive captions and titles. -
Mutan: Multimodal Tucker Fusion for Visual Question Answering
The dataset used in the paper is a collection of images and corresponding referring expressions. -
VQA-CPv1 and VQA-CPv2
VQA-CPv1 and VQA-CPv2 are datasets for visual question answering, containing questions answerable using visual images. -
Object Attribute Matters in Visual Question Answering
Visual question answering is a multimodal task that requires the joint comprehension of visual and textual information. The proposed approach utilizes object attributes to... -
SpatialSense
A dataset for visual spatial relationship classification (VSRC) with nine well-defined spatial relations. -
Winoground
The Winoground dataset consists of 400 items, each containing two image-caption pairs (I0, C0), (I1, C1). -
MovieQA, TVQA, AVSD, EQA, Embodied QA
A collection of datasets for visual question answering, including MovieQA, TVQA, AVSD, EQA, and Embodied QA. -
Conceptual Captions 12M
The Conceptual Captions 12M (CC-12M) dataset consists of 12 million diverse and high-quality images paired with descriptive captions and titles. -
Sort-of-CLEVR
The dataset used in the paper is Sort-of-CLEVR, a visual question answering dataset. -
CLEVR dataset
The CLEVR dataset is a dataset for visual question answering, where each image is annotated with a question. -
Visual7W dataset
The Visual7W dataset is a visual question answering dataset, which consists of images and corresponding questions. -
Conceptual Captions
The dataset used in the paper "Scaling Laws of Synthetic Images for Model Training". The dataset is used for supervised image classification and zero-shot classification tasks.