Automatically evaluating vision-language tasks is challenging, especially when it comes to reflecting human judgments due to limitations in accounting for fine-grained details.
InternLM2 is a vision-language large model that supports images with any aspect ratio from 336 pixels up to 4K HD, facilitating its deployment in real-world contexts.