X-volution: On the Unification of Convolution and Self-attention

Convolution and self-attention are acting as two fundamental building blocks in deep neural networks, where the former extracts local image features in a linear way while the latter non-locally encodes high-order contextual relationships.

BibTex: