-
LLaVA 158k
The LLaVA 158k dataset is a large-scale multimodal learning dataset, which is used for training and testing multimodal large language models. -
Multimodal Robustness Benchmark
The MMR benchmark is designed to evaluate MLLMs' comprehension of visual content and robustness against misleading questions, ensuring models truly leverage multimodal inputs... -
Twitter15 and Twitter17
Twitter15 and Twitter17 are two English datasets for Target-oriented Multimodal Sentiment Classification (TMSC). The datasets contain text and image data, where the text data is... -
Degree Datasets
Degree datasets are constructed by gradually adjusting the degree of alignment between image and text. -
Caption MNIST
Caption MNIST is a synthetic image-text pair dataset built by filling in the missing colors, digits, and positions in the MNIST dataset. -
XD-Violence
The XD-Violence dataset is a large-scale multimodal video dataset for violence detection. It consists of 4,754 untrimmed videos with a total duration of 217 hours, covering six... -
TCGA-OMICS
TCGA-OMICS: A comprehensive dataset of genomic, transcriptomic, and proteomic data from The Cancer Genome Atlas Program -
MUGEN-GAME
MUGEN-GAME: A large-scale and multimodal dataset for video-audio-text multimodal understanding and generation -
InternVid: A Large-Scale Video-Text Dataset for Multimodal Understanding and ...
InternVid: A large-scale video-text dataset for multimodal understanding and generation. -
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese...
WanJuan: A comprehensive multimodal dataset for advancing English and Chinese large models. -
Crisscrossed Captions
Crisscrossed Captions (CxC) dataset is a multimodal learning dataset used for training and evaluation of the MURAL model. -
Wikipedia Image Text
Wikipedia Image Text (WIT) dataset is a large-scale multimodal learning dataset used for training and evaluation of the MURAL model. -
EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge
EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge -
DeepSense 6G: Large-Scale Real-World Multimodal Sensing and Communication Dat...
Development dataset for multimodal beam prediction challenge -
Multimodal Transformers for Wireless Communications: A Case Study in Beam Pre...
Multimodal transformer deep learning framework for sensing-assisted beam prediction in wireless communications -
Youtube2Text-QA
Video question answering task, which requires machines to answer questions about videos in a natural language form.