95 datasets found

Formats: JSON

Filter Results
  • Training CLIP models on Data from Scientific Papers

    Contrastive Language-Image Pretraining (CLIP) models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores...
  • AVQA

    The AVQA dataset contains 57,015 videos and 57,335 question-and-answer pairs.
  • Music-AVQA

    The Music-AVQA dataset contains multiple question-and-answer pairs, with 9,288 videos and 45,867 question-and-answer pairs.
  • Audio-Visual Question Answering

    Audio-visual question answering (AVQA) requires reference to video content and auditory information, followed by correlating the question to predict the most precise answer.
  • Conceptual Captions

    The dataset used in the paper "Scaling Laws of Synthetic Images for Model Training". The dataset is used for supervised image classification and zero-shot classification tasks.
  • EgoSchema

    EgoSchema is a diagnostic benchmark for assessing very long-form video-language understanding capabilities of modern multimodal systems.
  • Conceptual 12m

    Conceptual 12m dataset for automatic image captioning
  • LLaVA-1.5

    The dataset used in this paper is a multimodal large language model (LLaMA) dataset, specifically LLaVA-1.5, which consists of 7 billion parameters and is used for multimodal...
  • MSR-VTT

    The dataset used in the paper is MSR-VTT, a large video description dataset for bridging video and language. The dataset contains 10k video clips with length varying from 10 to...
  • VoxCeleb

    Speaker verification systems experience significant performance degradation when tasked with short-duration trial recordings. To address this challenge, a multi-scale feature...
  • Youtube-8M

    Youtube-8M is a large-scale video classification benchmark.
  • Video Captioning Dataset

    A video captioning dataset generated by pseudolabeling videos with image captioning models.
  • MNIST-SVHN-Text dataset

    The MNIST-SVHN-Text dataset is a multi-modal dataset consisting of images, text, and labels.
  • COCO

    Large scale datasets [18, 17, 27, 6] boosted text conditional image generation quality. However, in some domains it could be difficult to make such datasets and usually it could...
  • MSCOCO

    Human Pose Estimation (HPE) aims to estimate the position of each joint point of the human body in a given image. HPE tasks support a wide range of downstream tasks such as...