-
Moving MNIST
Moving MNIST is a benchmark data set for video recognition. There are 10,000 samples including 8,000 for training and 2,000 for test. Each sample consists of 20 sequential gray... -
Temporal-attentive Covariance Pooling Networks for Video Recognition
Video recognition aims to automatically analyze the contents of videos (e.g., events and actions), and has a wide range of applications, including intelligent surveillance,... -
Multi-Fiber Networks for Video Recognition
The proposed multi-fiber architecture is used for reducing the computational cost of spatio-temporal deep neural networks, making them run as fast as their 2D counterparts while... -
Multiscale Vision Transformers
Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models.