-
ImageNet and YouTube-8M
The dataset used in this paper is not explicitly described. However, it is mentioned that the authors used datasets such as ImageNet and YouTube-8M. -
Structural Vision Transformer
Structural Vision Transformer (StructViT) is a vision transformer network that leverages structural self-attention (StructSA) to capture correlation structures in images and...