-
Valley: A Video Assistant with Large Language Model Enhanced Ability
A large multi-modal instruction-following dataset for video understanding, comprising 37k conversation pairs, 26k complex reasoning QA pairs and 10k detail description... -
The Hateful Memes dataset
The Hateful Memes dataset aims to help develop models that more effectively detect multimodal hateful content. -
IMAGINE: An Imagination-Based Automatic Evaluation Metric for Natural Languag...
Automatic evaluations for natural language generation (NLG) conventionally rely on token-level or embedding-level comparisons with the text references. This is different from... -
Hateful Memes Challenge
The Hateful Memes dataset is a multimodal dataset containing 10,000+ new examples of multimodal content. -
Uniter dataset
The Uniter dataset is a multimodal learning dataset, which consists of images and corresponding text. -
End-to-End Referring Video Object Segmentation with Multimodal Transformers
The referring video object segmentation task (RVOS) involves segmentation of a text-referred object instance in the frames of a given video. -
Multimodal Variational Autoencoder for Cardiac Hemodynamics Instability Detec...
A multimodal variational autoencoder for low-cost cardiac hemodynamics instability detection from CXR and ECG. -
BioMedClip
BioMedClip: A CLIP model pretrained on image-text pairs extracted from PubMed Central repository. -
Training CLIP models on Data from Scientific Papers
Contrastive Language-Image Pretraining (CLIP) models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores... -
Music-AVQA
The Music-AVQA dataset contains multiple question-and-answer pairs, with 9,288 videos and 45,867 question-and-answer pairs. -
Audio-Visual Question Answering
Audio-visual question answering (AVQA) requires reference to video content and auditory information, followed by correlating the question to predict the most precise answer. -
Conceptual Captions
The dataset used in the paper "Scaling Laws of Synthetic Images for Model Training". The dataset is used for supervised image classification and zero-shot classification tasks.