-
Two-level Group Convolution
The proposed two-level group convolution is suitable for distributed memory computing and robust with respect to the large number of groups. -
Degenerate Swin to Win: Plain Window-based Transformer without Sophisticated ...
The proposed Win Transformer achieves consistently superior performance than Swin Transformer on multiple computer vision tasks, including image recognition, semantic... -
ANTNets: Mobile Convolutional Neural Networks for Resource Efficient Image Cla...
Deep convolutional neural networks have achieved remarkable success in computer vision. However, deep neural networks require large computing resources to achieve high... -
Traffic Signs dataset
The Traffic Signs dataset contains 39252 training images in 43 classes. -
Pose-Aware Video Transformers
Human perception of surroundings is often guided by the various poses present within the environment. Many computer vision tasks, such as human action recognition and robot... -
Cap3D dataset
The Cap3D dataset is a large-scale dataset of 3D models with captions. -
Objaverse-LVIS dataset
The Objaverse-LVIS dataset contains ∼ 46,000 3D models in 1,156 categories. -
ImageNet-1000
The dataset used in this paper is ImageNet-1000 pre-trained CNNs. -
Attentive Normalization
The proposed Attentive Normalization (AN) that aims to harness the best of feature normalization and feature attention in a single lightweight module. -
Graph Edit Distance
Graph Edit Distance as a quadratic assignment problem. -
Binarized MNIST
We use the preprocessed binarized MNIST dataset from [49] which has a split of 50k/10k/10k. -
MNIST and CIFAR-10 datasets
The MNIST and CIFAR-10 datasets are used to test the theory suggesting the existence of many saddle points in high-dimensional functions. -
ImageNet, ImageNet ReaL, ImageNet V2, etc.
The dataset used in the paper is not explicitly described. However, it is mentioned that the authors used various benchmarks such as ImageNet, ImageNet ReaL, ImageNet V2, etc. -
VideoAttentionTarget
VideoAttentionTarget is a video-based gaze target dataset comprising 71,666 frames from 1,331 clips. -
GazeFollow
GazeFollow is a large-scale dataset consisting of 122,143 images with 130,339 annotations on head-target instances. -
GazeHTA: End-to-end Gaze Target Detection with Head-Target Association
Gaze target detection aims to directly associate individuals and their gaze targets within a single image or across multiple video frames. -
DINOv2: Learning robust visual features without supervision
The authors propose a method for self-supervised representation learning using knowledge distillation and vision transformers. -
Diffusion Classifier
The authors propose a method for zero-shot classification that leverages conditional density estimates from text-to-image diffusion models.