7 datasets found

Groups: Image Captioning

Filter Results
  • MME

    MME: A comprehensive evaluation benchmark for multimodal large language models
  • Mmbench

    Mmbench: Is your multi-modal model an all-around player?
  • Language models are few-shot learners

    A language model that demonstrates capabilities in processing and generating human-like text.
  • Mmicl

    Mmicl: Empowering vision-language model with multi-modal in-context learning
  • Prompt Highlighter

    Prompt Highlighter is a novel paradigm for user-model interactions in multi-modal LLMs, offering output control through a token-level highlighting mechanism.
  • Conceptual Captions

    The dataset used in the paper "Scaling Laws of Synthetic Images for Model Training". The dataset is used for supervised image classification and zero-shot classification tasks.
  • Visual Genome

    The Visual Genome dataset is a large-scale visual question answering dataset, containing 1.5 million images, each with 15-30 annotated entities, attributes, and relationships.