VILLA

The dataset used in the paper for vision-and-language representation learning.

BibTex: