-
TimeIT: A Video-Centric Instruction-Tuning Dataset
TimeIT is a video-centric instruction-tuning dataset designed for instruction tuning. It is composed of 6 diverse tasks, 12 widely-used academic benchmarks, and a total of 125K... -
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Und...
TimeChat is a time-sensitive multimodal large language model specifically designed for long video understanding. It incorporates two key architectural contributions: a... -
ActivityNet-QA
Video question answering (VideoQA) is an essential task in vision-language understanding, which has attracted numerous research attention recently. Nevertheless, existing works... -
Multimodal Food Perception and Custom End-Effector Mount Designs
A dataset for multimodal food perception and custom end-effector mount designs, which we hope expands the scope of assistive feeding research -
Generalized K-fan Multimodal Deep Model with Shared Representations
Multimodal learning with deep Boltzmann machines (DBMs) is an generative approach to fuse multimodal inputs, and can learn the shared representation via Contrastive Divergence... -
Voice Aging with Audio-Visual Style Transfer
Face aging techniques have used generative adversarial networks (GANs) and style transfer learning to transform one’s appearance to look younger/older. Identity is maintained by... -
LLaVA 158k
The LLaVA 158k dataset is a large-scale multimodal learning dataset, which is used for training and testing multimodal large language models. -
Multimodal Robustness Benchmark
The MMR benchmark is designed to evaluate MLLMs' comprehension of visual content and robustness against misleading questions, ensuring models truly leverage multimodal inputs... -
Twitter15 and Twitter17
Twitter15 and Twitter17 are two English datasets for Target-oriented Multimodal Sentiment Classification (TMSC). The datasets contain text and image data, where the text data is... -
Hateful Memes Dataset
The Hateful Memes Dataset consists of a training set of 8500 images, a dev set of 500 images & a test set of 1000 images. The meme text is present on the images, but also... -
VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
Unpaired vision-language pre-training via cross-modal CutMix. -
PowMix: A Versatile Regularizer for Multimodal Sentiment Analysis
Multimodal sentiment analysis (MSA) leverages heterogeneous data sources to interpret the complex nature of human sentiments. -
FocusCLIP: Multimodal Subject-Level Guidance for Zero-Shot Transfer in Human-...
FocusCLIP: Multimodal Subject-Level Guidance for Zero-Shot Transfer in Human-Centric Tasks. This paper introduces FocusCLIP, an enhancement for CLIP pretraining using a new ROI... -
Custom Dataset for Fine-tuning Open-sourced Models
The dataset used in this paper is a custom dataset generated for fine-tuning open-sourced models. -
QVHighlights
QVHighlights is a dataset for video highlight detection, which consists of over 10,000 videos annotated with human-written text queries.