-
Discriminative Training: Learning to Describe Video with Sentences
The dataset used in the paper is a collection of video clips paired with sentential labels, where the goal is to learn word meanings from complex and realistic video clips. -
Grounded Video Description
Grounded video description is a dataset for video description. -
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
A Large Video Description Dataset for Bridging Video and Language. -
ActivityNet Captions
The ActivityNet Captions is a benchmark dataset proposed for dense video captioning. There are 20K untrimmed videos in total, and each video has several annotated segments with...