-
Learning to compose topic-aware mixture of experts for zero-shot video captio...
The dataset is used for zero-shot video captioning. -
MSR Video to Text (MSR-VTT)
The MSR-VTT dataset is a large-scale video captioning benchmark that contains 10,000 video clips with 200,000 descriptions. -
Microsoft Video Description Corpus (MSVD)
The MSVD dataset is a public video captioning benchmark that contains 1,970 short video clips with 80,000 descriptions.