View a PDF of the paper titled Cross-Axis Transformer with 3D Rotary Positional Embeddings, by Lily Erickson
Abstract:Despite lagging behind their modal cousins in many respects, Vision Transformers have provided an interesting opportunity to bridge the gap between sequence modeling and image modeling. Up until now however, vision transformers have largely been held back, due to both computational inefficiency, and lack of proper handling of spatial dimensions. In this paper, we introduce the Cross-Axis Transformer. CAT is a model inspired by both Axial Transformers, and Microsoft’s recent Retentive Network, that drastically reduces the required number of floating point operations required to process an image, while simultaneously converging faster and more accurately than the Vision Transformers it replaces.
Submission history
From: Lily Erickson [view email]
[v1]
Mon, 13 Nov 2023 09:19:14 UTC (180 KB)
[v2]
Wed, 29 Nov 2023 17:01:00 UTC (181 KB)
[v3]
Mon, 16 Dec 2024 21:43:18 UTC (181 KB)
Source link
lol