-
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese...
WanJuan: A comprehensive multimodal dataset for advancing English and Chinese large models. -
Visual instruction tuning
Visual instruction tuning. -
Flamingo: a visual language model for few-shot learning
Flamingo: a visual language model for few-shot learning. -
Audio-visual scene-aware dialog
Audio-visual scene-aware dialog. -
ChatBridge
ChatBridge is a multimodal language model capable of perceiving real-world multimodal information, as well as following instructions, thinking, and interacting with humans in... -
ShapeNeRF–Text
The ShapeNeRF–Text dataset consists of 40K paired NeRFs and language annotations for ShapeNet objects. -
Training CLIP models on Data from Scientific Papers
Contrastive Language-Image Pretraining (CLIP) models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores...