-
WIT: Wikipedia-based image text dataset for multimodal multilingual machine l...
A multimodal dataset for machine learning tasks, focusing on Wikipedia-based image text datasets. -
ShapeNeRF–Text
The ShapeNeRF–Text dataset consists of 40K paired NeRFs and language annotations for ShapeNet objects. -
Video-LLaMA: An instruction-tuned audio-visual language model for video under...
A video-LLaMA model for video understanding, comprising 100k videos with detailed captions. -
VideoChat: Chat-centric video understanding
A video-based instruction dataset for video understanding, comprising 100k videos with detailed captions. -
Valley: A Video Assistant with Large Language Model Enhanced Ability
A large multi-modal instruction-following dataset for video understanding, comprising 37k conversation pairs, 26k complex reasoning QA pairs and 10k detail description... -
The Hateful Memes dataset
The Hateful Memes dataset aims to help develop models that more effectively detect multimodal hateful content. -
IMAGINE: An Imagination-Based Automatic Evaluation Metric for Natural Languag...
Automatic evaluations for natural language generation (NLG) conventionally rely on token-level or embedding-level comparisons with the text references. This is different from... -
Hateful Memes Challenge
The Hateful Memes dataset is a multimodal dataset containing 10,000+ new examples of multimodal content. -
Uniter dataset
The Uniter dataset is a multimodal learning dataset, which consists of images and corresponding text. -
End-to-End Referring Video Object Segmentation with Multimodal Transformers
The referring video object segmentation task (RVOS) involves segmentation of a text-referred object instance in the frames of a given video. -
Multimodal Variational Autoencoder for Cardiac Hemodynamics Instability Detec...
A multimodal variational autoencoder for low-cost cardiac hemodynamics instability detection from CXR and ECG. -
BioMedClip
BioMedClip: A CLIP model pretrained on image-text pairs extracted from PubMed Central repository.