Long-term Leap Attention, Short-term Periodic Shift for Video Classification

Video transformer naturally incurs a heavier computation burden than a static vision transformer, as the former processes T times longer sequence than the latter under the current attention of quadratic complexity. The existing works treat the temporal axis as a simple extension of spatial axes, focusing on shortening the spatio-temporal sequence by either generic pooling or local windowing without utilizing temporal redundancy.

BibTex: