Visual Question Answering - Groups

Visual Spatial Reasoning

Visual Spatial Reasoning (VSR) is a controlled probing dataset for testing vision-language models' capabilities of recognizing and reasoning about spatial relations in natural...

Dataset
JSON

Visual7W dataset

The Visual7W dataset is a visual question answering dataset, which consists of images and corresponding questions.

Dataset
JSON

Conceptual Captions

The dataset used in the paper "Scaling Laws of Synthetic Images for Model Training". The dataset is used for supervised image classification and zero-shot classification tasks.

Dataset
JSON

NLVR2

The dataset used in the paper is a set of sequential vision-and-language tasks, where each task consists of an image and a text input.

Dataset
JSON

SBU Captions

The SBU Captions dataset is a large-scale image-text dataset used for vision-language pre-training.

Dataset
JSON

Visual Genome

The Visual Genome dataset is a large-scale visual question answering dataset, containing 1.5 million images, each with 15-30 annotated entities, attributes, and relationships.

Dataset
JSON

MS-COCO

Large scale datasets [18, 17, 27, 6] boosted text conditional image generation quality. However, in some domains it could be difficult to make such datasets and usually it could...

Dataset
JSON

MSCOCO

Human Pose Estimation (HPE) aims to estimate the position of each joint point of the human body in a given image. HPE tasks support a wide range of downstream tasks such as...

Dataset
JSON

8 datasets found