-
Multimodal Visual Patterns (MMVP) Benchmark
The Multimodal Visual Patterns (MMVP) benchmark is a dataset used to evaluate the visual question answering capabilities of multimodal large language models (MLLMs). -
Degree Datasets
Degree datasets are constructed by gradually adjusting the degree of alignment between image and text. -
Multimodal Learning Task
The dataset used in the paper is a multimodal learning task for robots. -
Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of ...
Dysca is a dynamic and scalable benchmark for evaluating the perception ability of Large Vision-Language Models (LVLMs) via various subtasks and scenarios. -
Multimodal C4 (mmc4)
Multimodal C4 (mmc4) is a public, billion-scale corpus of images and text, constructed from public webpages contained in the cleaned English c4 corpus. -
TCGA-OMICS
TCGA-OMICS: A comprehensive dataset of genomic, transcriptomic, and proteomic data from The Cancer Genome Atlas Program -
MUGEN-GAME
MUGEN-GAME: A large-scale and multimodal dataset for video-audio-text multimodal understanding and generation -
Training transitive and commutative multimodal transformers with LoReTTa
Training transitive and commutative multimodal transformers with LoReTTa -
Towards Empathetic Open-Domain Conversation Models: A New Benchmark and Dataset
A dialogue dataset for open-domain conversation models. -
Personalizing Dialogue Agents: I Have a Dog, Do You Have Pets Too?
A dialogue dataset for personalizing dialogue agents. -
PhotoChat: A Human-Human Dialogue Dataset with Photo Sharing Behavior
A dialogue dataset with photo sharing behavior for joint image-text modeling. -
Constructing Multi-Modal Dialogue Dataset by Replacing Text with Semantically...
A multi-modal dialogue dataset created by replacing text with semantically relevant images. -
DialogCC: Large-Scale Multi-Modal Dialogue Dataset
A large-scale multi-modal dialogue dataset created by leveraging the automatic pipeline with filtering using CLIP similarity. -
Sentiment-oriented Transformer-based Variational Autoencoder Network for Live...
Sentiment-oriented Transformer-based Variational Autoencoder (So-TVAE) for Live Video Commenting -
InternVid: A Large-Scale Video-Text Dataset for Multimodal Understanding and ...
InternVid: A large-scale video-text dataset for multimodal understanding and generation.