-
TimeIT: A Video-Centric Instruction-Tuning Dataset
TimeIT is a video-centric instruction-tuning dataset designed for instruction tuning. It is composed of 6 diverse tasks, 12 widely-used academic benchmarks, and a total of 125K... -
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Und...
TimeChat is a time-sensitive multimodal large language model specifically designed for long video understanding. It incorporates two key architectural contributions: a... -
Voice Aging with Audio-Visual Style Transfer
Face aging techniques have used generative adversarial networks (GANs) and style transfer learning to transform one’s appearance to look younger/older. Identity is maintained by... -
SemEval-2021 Task 6: Detection of Persuasion Techniques in Texts and Images
The dataset used in the paper for the SemEval-2021 task 6: Detection of persuasion techniques in texts and images using CLIP features. -
Reuters Video-Language News Dataset
The Reuters Video-Language News Dataset (ReutersViLNews) is a large-scale video-language understanding dataset containing 1,974 long-form news videos with an average video... -
Hateful Memes Dataset
The Hateful Memes Dataset consists of a training set of 8500 images, a dev set of 500 images & a test set of 1000 images. The meme text is present on the images, but also... -
VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
Unpaired vision-language pre-training via cross-modal CutMix. -
PowMix: A Versatile Regularizer for Multimodal Sentiment Analysis
Multimodal sentiment analysis (MSA) leverages heterogeneous data sources to interpret the complex nature of human sentiments. -
FocusCLIP: Multimodal Subject-Level Guidance for Zero-Shot Transfer in Human-...
FocusCLIP: Multimodal Subject-Level Guidance for Zero-Shot Transfer in Human-Centric Tasks. This paper introduces FocusCLIP, an enhancement for CLIP pretraining using a new ROI... -
QVHighlights
QVHighlights is a dataset for video highlight detection, which consists of over 10,000 videos annotated with human-written text queries. -
Multimodal Visual Patterns (MMVP) Benchmark
The Multimodal Visual Patterns (MMVP) benchmark is a dataset used to evaluate the visual question answering capabilities of multimodal large language models (MLLMs). -
Multimodal C4 (mmc4)
Multimodal C4 (mmc4) is a public, billion-scale corpus of images and text, constructed from public webpages contained in the cleaned English c4 corpus. -
Multimodal Learning (MLM) dataset
The MLM dataset is a collection of images and captions that represent different cultures from around the world. -
Stanford Large Movie, Games and Datasets Archive (SMLMDA)
Stanford Large Movie, Games and Datasets Archive (SMLMDA) dataset is used for training and evaluation. -
Multimodal Contrastive Learning
The dataset used in the paper is a collection of pairs of observations (xi, ˜xi) from two modalities, where xi ∈ Rd1 and ˜xi ∈ Rd2. The dataset is used to evaluate the...