89 datasets found

Groups: Multimodal Learning Formats: JSON

Filter Results
  • Conceptual 12m

    Conceptual 12m dataset for automatic image captioning
  • LLaVA-1.5

    The dataset used in this paper is a multimodal large language model (LLaMA) dataset, specifically LLaVA-1.5, which consists of 7 billion parameters and is used for multimodal...
  • MSR-VTT

    The dataset used in the paper is MSR-VTT, a large video description dataset for bridging video and language. The dataset contains 10k video clips with length varying from 10 to...
  • VoxCeleb

    Speaker verification systems experience significant performance degradation when tasked with short-duration trial recordings. To address this challenge, a multi-scale feature...
  • Youtube-8M

    Youtube-8M is a large-scale video classification benchmark.
  • Video Captioning Dataset

    A video captioning dataset generated by pseudolabeling videos with image captioning models.
  • MNIST-SVHN-Text dataset

    The MNIST-SVHN-Text dataset is a multi-modal dataset consisting of images, text, and labels.
  • COCO

    Large scale datasets [18, 17, 27, 6] boosted text conditional image generation quality. However, in some domains it could be difficult to make such datasets and usually it could...
  • MSCOCO

    Human Pose Estimation (HPE) aims to estimate the position of each joint point of the human body in a given image. HPE tasks support a wide range of downstream tasks such as...