Structural Vision Transformer

Structural Vision Transformer (StructViT) is a vision transformer network that leverages structural self-attention (StructSA) to capture correlation structures in images and videos.

Data and Resources

Cite this as

Paul Hongsuck Seo, Cordelia Schmid, Minsu Cho, Manjin Kim (2024). Dataset: Structural Vision Transformer. https://doi.org/10.57702/ian4t1e5

DOI retrieved: December 2, 2024

Additional Info

Field Value
Created December 2, 2024
Last update December 2, 2024
Author Paul Hongsuck Seo
More Authors
Cordelia Schmid
Minsu Cho
Manjin Kim
Homepage https://arxiv.org/abs/2103.15691