Visual Storytelling Dataset (VIST)
The Visual Storytelling Dataset (VIST) consists of 10,117 Flickr albums and 210,819 unique images. Each sample is one sequence of 5 photos selected from the same album paired with a single human constructed story, where each story is comprised of mostly one sentence per image.
BibTex: