View a PDF of the paper titled Low-Resolution Self-Attention for Semantic Segmentation, by Yu-Huan Wu and 8 other authors
Abstract:Semantic segmentation tasks naturally require high-resolution information for pixel-wise segmentation and global context information for class prediction. While existing vision transformers demonstrate promising performance, they often utilize high-resolution context modeling, resulting in a computational bottleneck. In this work, we challenge conventional wisdom and introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost, i.e., FLOPs. Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image’s resolution, with additional 3×3 depth-wise convolutions to capture fine details in the high-resolution space. We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure. Extensive experiments on the ADE20K, COCO-Stuff, and Cityscapes datasets demonstrate that LRFormer outperforms state-of-the-art models. he code is available at this https URL.
Submission history
From: Yu-Huan Wu [view email]
[v1]
Sun, 8 Oct 2023 06:10:09 UTC (1,851 KB)
[v2]
Thu, 23 Jan 2025 04:45:10 UTC (1,964 KB)
Source link
lol