Semantic Equitable Clustering: A Simple, Fast and Effective Strategy for Vision Transformer

The Vision Transformer (ViT) has gained prominence for its superior relational modeling prowess. However, its global attention mechanism’s quadratic complexity poses substantial computational burdens. A common remedy spatially groups tokens for self-attention, reducing computational requirements. Nonethe-less, this strategy neglects semantic information in tokens, possibly scattering semantically-linked tokens across distinct groups, thus compromising the efficacy of self-attention intended for modeling inter-token dependencies. Motivated by these insights, we introduce a fast and balanced clustering method, named Semantic Equitable Clustering (SEC). SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner.

Data and Resources

Cite this as

Qihang Fan, Huaibo Huang, Mingrui Chen, Ran He (2024). Dataset: Semantic Equitable Clustering: A Simple, Fast and Effective Strategy for Vision Transformer. https://doi.org/10.57702/yharvnmc

DOI retrieved: December 16, 2024

Additional Info

Field Value
Created December 16, 2024
Last update December 16, 2024
Author Qihang Fan
More Authors
Huaibo Huang
Mingrui Chen
Ran He
Homepage https://github.com/qhfan/SecViT