-
Conceptual 12m
Conceptual 12m dataset for automatic image captioning -
Youtube-8M
Youtube-8M is a large-scale video classification benchmark. -
Video Captioning Dataset
A video captioning dataset generated by pseudolabeling videos with image captioning models. -
MNIST-SVHN-Text dataset
The MNIST-SVHN-Text dataset is a multi-modal dataset consisting of images, text, and labels.