-
Training CLIP models on Data from Scientific Papers
Contrastive Language-Image Pretraining (CLIP) models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores... -
Music-AVQA
The Music-AVQA dataset contains multiple question-and-answer pairs, with 9,288 videos and 45,867 question-and-answer pairs. -
Audio-Visual Question Answering
Audio-visual question answering (AVQA) requires reference to video content and auditory information, followed by correlating the question to predict the most precise answer. -
Conceptual Captions
The dataset used in the paper "Scaling Laws of Synthetic Images for Model Training". The dataset is used for supervised image classification and zero-shot classification tasks. -
Conceptual 12m
Conceptual 12m dataset for automatic image captioning -
Youtube-8M
Youtube-8M is a large-scale video classification benchmark. -
Video Captioning Dataset
A video captioning dataset generated by pseudolabeling videos with image captioning models. -
MNIST-SVHN-Text dataset
The MNIST-SVHN-Text dataset is a multi-modal dataset consisting of images, text, and labels.