2 datasets found

Groups: Multimodal Learning Formats: JSON

Filter Results
  • InternLM2

    InternLM2 is a vision-language large model that supports images with any aspect ratio from 336 pixels up to 4K HD, facilitating its deployment in real-world contexts.
  • RANKCLIP: Ranking-Consistent Language-Image Pretraining

    Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid...