Long Video Understanding Benchmark

Towards long-form video understanding. We propose a two-stream spatio-temporal attention network for long video classification which combines the advantages of convolutional inductive bias with the computational advantages of transformer networks.

Data and Resources

Cite this as

Chao-Yuan Wu, Philipp Krahenbuhl (2024). Dataset: Long Video Understanding Benchmark. https://doi.org/10.57702/plafsryv

DOI retrieved: December 16, 2024

Additional Info

Field Value
Created December 16, 2024
Last update December 16, 2024
Author Chao-Yuan Wu
More Authors
Philipp Krahenbuhl
Homepage https://arxiv.org/abs/2106.02036