Long Video Understanding Benchmark
Towards long-form video understanding. We propose a two-stream spatio-temporal attention network for long video classification which combines the advantages of convolutional inductive bias with the computational advantages of transformer networks.
BibTex: