-
Voice Aging with Audio-Visual Style Transfer
Face aging techniques have used generative adversarial networks (GANs) and style transfer learning to transform one’s appearance to look younger/older. Identity is maintained by... -
LLaVA 158k
The LLaVA 158k dataset is a large-scale multimodal learning dataset, which is used for training and testing multimodal large language models. -
Multimodal Robustness Benchmark
The MMR benchmark is designed to evaluate MLLMs' comprehension of visual content and robustness against misleading questions, ensuring models truly leverage multimodal inputs... -
Twitter15 and Twitter17
Twitter15 and Twitter17 are two English datasets for Target-oriented Multimodal Sentiment Classification (TMSC). The datasets contain text and image data, where the text data is... -
Hateful Memes Dataset
The Hateful Memes Dataset consists of a training set of 8500 images, a dev set of 500 images & a test set of 1000 images. The meme text is present on the images, but also... -
VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
Unpaired vision-language pre-training via cross-modal CutMix. -
PowMix: A Versatile Regularizer for Multimodal Sentiment Analysis
Multimodal sentiment analysis (MSA) leverages heterogeneous data sources to interpret the complex nature of human sentiments. -
FocusCLIP: Multimodal Subject-Level Guidance for Zero-Shot Transfer in Human-...
FocusCLIP: Multimodal Subject-Level Guidance for Zero-Shot Transfer in Human-Centric Tasks. This paper introduces FocusCLIP, an enhancement for CLIP pretraining using a new ROI... -
Custom Dataset for Fine-tuning Open-sourced Models
The dataset used in this paper is a custom dataset generated for fine-tuning open-sourced models. -
QVHighlights
QVHighlights is a dataset for video highlight detection, which consists of over 10,000 videos annotated with human-written text queries. -
Multimodal Visual Patterns (MMVP) Benchmark
The Multimodal Visual Patterns (MMVP) benchmark is a dataset used to evaluate the visual question answering capabilities of multimodal large language models (MLLMs). -
Degree Datasets
Degree datasets are constructed by gradually adjusting the degree of alignment between image and text. -
Multimodal Learning Task
The dataset used in the paper is a multimodal learning task for robots. -
Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of ...
Dysca is a dynamic and scalable benchmark for evaluating the perception ability of Large Vision-Language Models (LVLMs) via various subtasks and scenarios.