Multimodal Learning - Groups

YouCook2

YouCook2 consists of recipes containing labels that separate the long horizon trajectories of demonstrations into events - with explicit time stamps for the beginning and end of...

Dataset
JSON

TimeIT: A Video-Centric Instruction-Tuning Dataset

TimeIT is a video-centric instruction-tuning dataset designed for instruction tuning. It is composed of 6 diverse tasks, 12 widely-used academic benchmarks, and a total of 125K...

Dataset
JSON

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Und...

TimeChat is a time-sensitive multimodal large language model specifically designed for long video understanding. It incorporates two key architectural contributions: a...

Dataset
JSON

ActivityNet-QA

Video question answering (VideoQA) is an essential task in vision-language understanding, which has attracted numerous research attention recently. Nevertheless, existing works...

Dataset
JSON

Multimodal Food Perception and Custom End-Effector Mount Designs

A dataset for multimodal food perception and custom end-effector mount designs, which we hope expands the scope of assistive feeding research

Dataset
JSON

Generalized K-fan Multimodal Deep Model with Shared Representations

Multimodal learning with deep Boltzmann machines (DBMs) is an generative approach to fuse multimodal inputs, and can learn the shared representation via Contrastive Divergence...

Dataset
JSON

Voice Aging with Audio-Visual Style Transfer

Face aging techniques have used generative adversarial networks (GANs) and style transfer learning to transform one’s appearance to look younger/older. Identity is maintained by...

Dataset
JSON

LLaVA 158k

The LLaVA 158k dataset is a large-scale multimodal learning dataset, which is used for training and testing multimodal large language models.

Dataset
JSON

Multimodal Robustness Benchmark

The MMR benchmark is designed to evaluate MLLMs' comprehension of visual content and robustness against misleading questions, ensuring models truly leverage multimodal inputs...

Dataset
JSON

Multi-ZOL

Multi-ZOL is a Chinese dataset for Target-oriented Multimodal Sentiment Classification (TMSC). The dataset contains text and image data, where the text data is used to determine...

Dataset
JSON

Twitter15 and Twitter17

Twitter15 and Twitter17 are two English datasets for Target-oriented Multimodal Sentiment Classification (TMSC). The datasets contain text and image data, where the text data is...

Dataset
JSON

Hateful Memes Dataset

The Hateful Memes Dataset consists of a training set of 8500 images, a dev set of 500 images & a test set of 1000 images. The meme text is present on the images, but also...

Dataset
JSON

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix

Unpaired vision-language pre-training via cross-modal CutMix.

Dataset
JSON

PowMix: A Versatile Regularizer for Multimodal Sentiment Analysis

Multimodal sentiment analysis (MSA) leverages heterogeneous data sources to interpret the complex nature of human sentiments.

Dataset
JSON

Openclip

Openclip: A large-scale multimodal dataset for vision and language understanding.

Dataset
JSON

InternLM2

InternLM2 is a vision-language large model that supports images with any aspect ratio from 336 pixels up to 4K HD, facilitating its deployment in real-world contexts.

Dataset
JSON

FocusCLIP: Multimodal Subject-Level Guidance for Zero-Shot Transfer in Human-...

FocusCLIP: Multimodal Subject-Level Guidance for Zero-Shot Transfer in Human-Centric Tasks. This paper introduces FocusCLIP, an enhancement for CLIP pretraining using a new ROI...

Dataset
JSON

Custom Dataset for Fine-tuning Open-sourced Models

The dataset used in this paper is a custom dataset generated for fine-tuning open-sourced models.

Dataset
JSON

NTU-RGBD

The NTU-RGBD dataset is a large-scale dataset for 3D human activity analysis, containing 56,000 videos and 60 actions performed by 40 people from 80 different views.

Dataset
JSON

QVHighlights

QVHighlights is a dataset for video highlight detection, which consists of over 10,000 videos annotated with human-written text queries.

Dataset
JSON

95 datasets found