-
MSR Video to Text (MSR-VTT)
The MSR-VTT dataset is a large-scale video captioning benchmark that contains 10,000 video clips with 200,000 descriptions. -
Microsoft Video Description Corpus (MSVD)
The MSVD dataset is a public video captioning benchmark that contains 1,970 short video clips with 80,000 descriptions.