Neighborhood Attention Transformer

Hassani, Ali; Walton, Steven; Li, Jiachen; Li, Shen; Shi, Humphrey

Computer Science > Computer Vision and Pattern Recognition

arXiv:2204.07143 (cs)

[Submitted on 14 Apr 2022 (v1), last revised 16 May 2023 (this version, v5)]

Title:Neighborhood Attention Transformer

Authors:Ali Hassani, Steven Walton, Jiachen Li, Shen Li, Humphrey Shi

View PDF

Abstract:We present Neighborhood Attention (NA), the first efficient and scalable sliding-window attention mechanism for vision. NA is a pixel-wise operation, localizing self attention (SA) to the nearest neighboring pixels, and therefore enjoys a linear time and space complexity compared to the quadratic complexity of SA. The sliding-window pattern allows NA's receptive field to grow without needing extra pixel shifts, and preserves translational equivariance, unlike Swin Transformer's Window Self Attention (WSA). We develop NATTEN (Neighborhood Attention Extension), a Python package with efficient C++ and CUDA kernels, which allows NA to run up to 40% faster than Swin's WSA while using up to 25% less memory. We further present Neighborhood Attention Transformer (NAT), a new hierarchical transformer design based on NA that boosts image classification and downstream vision performance. Experimental results on NAT are competitive; NAT-Tiny reaches 83.2% top-1 accuracy on ImageNet, 51.4% mAP on MS-COCO and 48.4% mIoU on ADE20K, which is 1.9% ImageNet accuracy, 1.0% COCO mAP, and 2.6% ADE20K mIoU improvement over a Swin model with similar size. To support more research based on sliding-window attention, we open source our project and release our checkpoints at: this https URL .

Comments:	To appear in CVPR 2023. NATTEN is open-sourced at: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2204.07143 [cs.CV]
	(or arXiv:2204.07143v5 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2204.07143

Submission history

From: Ali Hassani [view email]
[v1] Thu, 14 Apr 2022 17:55:15 UTC (2,641 KB)
[v2] Sat, 9 Jul 2022 23:38:38 UTC (3,159 KB)
[v3] Mon, 7 Nov 2022 18:57:32 UTC (3,220 KB)
[v4] Thu, 10 Nov 2022 18:55:49 UTC (3,243 KB)
[v5] Tue, 16 May 2023 21:26:30 UTC (3,221 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Neighborhood Attention Transformer

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Neighborhood Attention Transformer

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators