-
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese...
WanJuan: A comprehensive multimodal dataset for advancing English and Chinese large models. -
Crisscrossed Captions
Crisscrossed Captions (CxC) dataset is a multimodal learning dataset used for training and evaluation of the MURAL model. -
Wikipedia Image Text
Wikipedia Image Text (WIT) dataset is a large-scale multimodal learning dataset used for training and evaluation of the MURAL model. -
EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge
EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge -
Multimodal Learning (MLM) dataset
The MLM dataset is a collection of images and captions that represent different cultures from around the world. -
Stanford Large Movie, Games and Datasets Archive (SMLMDA)
Stanford Large Movie, Games and Datasets Archive (SMLMDA) dataset is used for training and evaluation. -
DeepSense 6G: Large-Scale Real-World Multimodal Sensing and Communication Dat...
Development dataset for multimodal beam prediction challenge -
Multimodal Transformers for Wireless Communications: A Case Study in Beam Pre...
Multimodal transformer deep learning framework for sensing-assisted beam prediction in wireless communications -
Multimodal Contrastive Learning
The dataset used in the paper is a collection of pairs of observations (xi, ˜xi) from two modalities, where xi ∈ Rd1 and ˜xi ∈ Rd2. The dataset is used to evaluate the... -
Youtube2Text-QA
Video question answering task, which requires machines to answer questions about videos in a natural language form. -
RWTH-PHOENIX-Weather
Continuous sign language recognition (SLR) deals with unaligned video-text pair and uses the word error rate (WER), i.e., edit distance, as the main evaluation metric. -
AccidentBlip2
A multimodal large language model for accident detection with multi-view motion reasoning -
RANKCLIP: Ranking-Consistent Language-Image Pretraining
Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid... -
Kosmos-2: Grounding multimodal large language models to the world
Kosmos-2: Grounding multimodal large language models to the world. -
Visual instruction tuning
Visual instruction tuning. -
Flamingo: a visual language model for few-shot learning
Flamingo: a visual language model for few-shot learning. -
Audio-visual scene-aware dialog
Audio-visual scene-aware dialog. -
ChatBridge
ChatBridge is a multimodal language model capable of perceiving real-world multimodal information, as well as following instructions, thinking, and interacting with humans in... -
Flickr30k entities: Collecting region-to-phrase correspondences for richer im...
A dataset for multimodal learning tasks, focusing on region-to-phrase correspondences for image-to-sentence models.