-
TerraIncognita
The TerraIncognita dataset consists of 24,778 samples from four domains: painting, sketch, cartoon, and photo. -
ShiftAddViT: Towards Efficient Vision Transformers
ShiftAddViT: A hardware-inspired multiplication-reduced Vision Transformer model. -
TPC-ViT: Token Propagation Controller for Efficient Vision Transformers
Vision transformers (ViTs) have achieved promising results on a variety of Computer Vision tasks, however their quadratic complexity in the number of input tokens has limited... -
Vision Transformers increase efficiency of 3D cardiac CT multi-label segmenta...
Two cardiac computed tomography (CT) datasets consisting of 760 volumes across the whole cardiac cycle from 39 patients, and of 60 volumes from 60 patients respectively were... -
Data Level Lottery Ticket Hypothesis for Vision Transformers
The conventional lottery ticket hypothesis (LTH) claims that there exists a sparse subnetwork within a dense neural network and a proper random initial-ization method called the... -
From Pixels to Predictions: Spectrogram and Vision Transformer for Better Tim...
Time series forecasting using spectrograms and vision transformers -
Patch Similarity
Patch Similarity dataset -
AUFormer: Vision Transformers are Parameter-Efficient Facial Action Unit Dete...
Facial Action Units (AU) is a vital concept in the realm of affective computing, and AU detection has always been a hot research topic. -
Vision Big Bird
Vision Big Bird: Random Sparsification for Full Attention -
Low-Resolution Self-Attention for Semantic Segmentation
Semantic segmentation tasks naturally require high-resolution information for pixel-wise segmentation and global context information for class prediction. -
PartImageNet
The PartImageNet dataset is a large-scale vehicle part segmentation dataset, containing 24,000 images of 158 classes. -
Position Embedding Needs an Independent Layer Normalization
The dataset used in the paper is not explicitly described, but it is mentioned that the authors analyzed the input and output of each encoder layer in Vision Transformers (VTs)... -
MOCA: Masked Online Codebook Assignments prediction
Self-supervised representation learning for Vision Transformers (ViT) to mitigate the greedy needs of ViT networks for very large fully-annotated datasets.