-
Stanford Large Movie, Games and Datasets Archive (SMLMDA)
Stanford Large Movie, Games and Datasets Archive (SMLMDA) dataset is used for training and evaluation. -
DeepSense 6G: Large-Scale Real-World Multimodal Sensing and Communication Dat...
Development dataset for multimodal beam prediction challenge -
Multimodal Transformers for Wireless Communications: A Case Study in Beam Pre...
Multimodal transformer deep learning framework for sensing-assisted beam prediction in wireless communications -
Multimodal Contrastive Learning
The dataset used in the paper is a collection of pairs of observations (xi, ˜xi) from two modalities, where xi ∈ Rd1 and ˜xi ∈ Rd2. The dataset is used to evaluate the... -
Youtube2Text-QA
Video question answering task, which requires machines to answer questions about videos in a natural language form. -
RWTH-PHOENIX-Weather
Continuous sign language recognition (SLR) deals with unaligned video-text pair and uses the word error rate (WER), i.e., edit distance, as the main evaluation metric. -
AccidentBlip2
A multimodal large language model for accident detection with multi-view motion reasoning -
RANKCLIP: Ranking-Consistent Language-Image Pretraining
Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid... -
Kosmos-2: Grounding multimodal large language models to the world
Kosmos-2: Grounding multimodal large language models to the world. -
Visual instruction tuning
Visual instruction tuning. -
Flamingo: a visual language model for few-shot learning
Flamingo: a visual language model for few-shot learning. -
Audio-visual scene-aware dialog
Audio-visual scene-aware dialog. -
ChatBridge
ChatBridge is a multimodal language model capable of perceiving real-world multimodal information, as well as following instructions, thinking, and interacting with humans in... -
Flickr30k entities: Collecting region-to-phrase correspondences for richer im...
A dataset for multimodal learning tasks, focusing on region-to-phrase correspondences for image-to-sentence models. -
WIT: Wikipedia-based image text dataset for multimodal multilingual machine l...
A multimodal dataset for machine learning tasks, focusing on Wikipedia-based image text datasets. -
ShapeNeRF–Text
The ShapeNeRF–Text dataset consists of 40K paired NeRFs and language annotations for ShapeNet objects. -
Video-LLaMA: An instruction-tuned audio-visual language model for video under...
A video-LLaMA model for video understanding, comprising 100k videos with detailed captions. -
VideoChat: Chat-centric video understanding
A video-based instruction dataset for video understanding, comprising 100k videos with detailed captions.