LLaVA-1.5

The dataset used in this paper is a multimodal large language model (LLaMA) dataset, specifically LLaVA-1.5, which consists of 7 billion parameters and is used for multimodal tasks such as image captioning and visual question answering.

BibTex: