Multimodal Learning - Groups

Training CLIP models on Data from Scientific Papers

Contrastive Language-Image Pretraining (CLIP) models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores...

Dataset
JSON

AVQA

The AVQA dataset contains 57,015 videos and 57,335 question-and-answer pairs.

Dataset
JSON

Music-AVQA

The Music-AVQA dataset contains multiple question-and-answer pairs, with 9,288 videos and 45,867 question-and-answer pairs.

Dataset
JSON

Audio-Visual Question Answering

Audio-visual question answering (AVQA) requires reference to video content and auditory information, followed by correlating the question to predict the most precise answer.

Dataset
JSON

Conceptual Captions

The dataset used in the paper "Scaling Laws of Synthetic Images for Model Training". The dataset is used for supervised image classification and zero-shot classification tasks.

Dataset
JSON

EgoSchema

EgoSchema is a diagnostic benchmark for assessing very long-form video-language understanding capabilities of modern multimodal systems.

Dataset
JSON

Conceptual 12m

Conceptual 12m dataset for automatic image captioning

Dataset
JSON

LLaVA-1.5

The dataset used in this paper is a multimodal large language model (LLaMA) dataset, specifically LLaVA-1.5, which consists of 7 billion parameters and is used for multimodal...

Dataset
JSON

MSR-VTT

The dataset used in the paper is MSR-VTT, a large video description dataset for bridging video and language. The dataset contains 10k video clips with length varying from 10 to...

Dataset
JSON

VoxCeleb

Speaker verification systems experience significant performance degradation when tasked with short-duration trial recordings. To address this challenge, a multi-scale feature...

Dataset
JSON

Youtube-8M

Youtube-8M is a large-scale video classification benchmark.

Dataset
JSON

Video Captioning Dataset

A video captioning dataset generated by pseudolabeling videos with image captioning models.

Dataset
JSON

MNIST-SVHN-Text dataset

The MNIST-SVHN-Text dataset is a multi-modal dataset consisting of images, text, and labels.

Dataset
JSON

COCO

Large scale datasets [18, 17, 27, 6] boosted text conditional image generation quality. However, in some domains it could be difficult to make such datasets and usually it could...

Dataset
JSON

MSCOCO

Human Pose Estimation (HPE) aims to estimate the position of each joint point of the human body in a given image. HPE tasks support a wide range of downstream tasks such as...

Dataset
JSON

95 datasets found