No Organization - Organizations

Stanford Large Movie, Games and Datasets Archive (SMLMDA)

Stanford Large Movie, Games and Datasets Archive (SMLMDA) dataset is used for training and evaluation.

Dataset
JSON

DeepSense 6G: Large-Scale Real-World Multimodal Sensing and Communication Dat...

Development dataset for multimodal beam prediction challenge

Dataset
JSON

Multimodal Transformers for Wireless Communications: A Case Study in Beam Pre...

Multimodal transformer deep learning framework for sensing-assisted beam prediction in wireless communications

Dataset
JSON

Multimodal Contrastive Learning

The dataset used in the paper is a collection of pairs of observations (xi, ˜xi) from two modalities, where xi ∈ Rd1 and ˜xi ∈ Rd2. The dataset is used to evaluate the...

Dataset
JSON

Youtube2Text-QA

Video question answering task, which requires machines to answer questions about videos in a natural language form.

Dataset
JSON

RWTH-PHOENIX-Weather

Continuous sign language recognition (SLR) deals with unaligned video-text pair and uses the word error rate (WER), i.e., edit distance, as the main evaluation metric.

Dataset
JSON

AccidentBlip2

A multimodal large language model for accident detection with multi-view motion reasoning

Dataset
JSON

RANKCLIP: Ranking-Consistent Language-Image Pretraining

Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid...

Dataset
JSON

Kosmos-2: Grounding multimodal large language models to the world

Kosmos-2: Grounding multimodal large language models to the world.

Dataset
JSON

Visual instruction tuning

Visual instruction tuning.

Dataset
JSON

Flamingo: a visual language model for few-shot learning

Flamingo: a visual language model for few-shot learning.

Dataset
JSON

Audio-visual scene-aware dialog

Audio-visual scene-aware dialog.

Dataset
JSON

ChatBridge

ChatBridge is a multimodal language model capable of perceiving real-world multimodal information, as well as following instructions, thinking, and interacting with humans in...

Dataset
JSON

Flickr30k entities: Collecting region-to-phrase correspondences for richer im...

A dataset for multimodal learning tasks, focusing on region-to-phrase correspondences for image-to-sentence models.

Dataset
JSON

WIT: Wikipedia-based image text dataset for multimodal multilingual machine l...

A multimodal dataset for machine learning tasks, focusing on Wikipedia-based image text datasets.

Dataset
JSON

ShapeNeRF–Text

The ShapeNeRF–Text dataset consists of 40K paired NeRFs and language annotations for ShapeNet objects.

Dataset
JSON

MSVD-QA

The MSVD-QA dataset is a benchmark for video question answering, containing 1,970 videos with multiple-choice questions.

Dataset
JSON

TGIF-QA

The TGIF-QA dataset consists of 165165 QA pairs chosen from 71741 animated GIFs. To evaluate the spatiotemporal reasoning ability at the video level, TGIF-QA dataset designs...

Dataset
JSON

Video-LLaMA: An instruction-tuned audio-visual language model for video under...

A video-LLaMA model for video understanding, comprising 100k videos with detailed captions.

Dataset
JSON

VideoChat: Chat-centric video understanding

A video-based instruction dataset for video understanding, comprising 100k videos with detailed captions.

Dataset
JSON

89 datasets found