Fast Gradient Computation for RoPE Attention in Almost Linear Time

stp2yJanuary 4, 20250 Comments

Architecture of OpenAI

[Submitted on 23 Dec 2024 (v1), last revised 31 Dec 2024 (this version, v2)]

View a PDF of the paper titled Fast Gradient Computation for RoPE Attention in Almost Linear Time, by Yifang Chen and 5 other authors

View PDF
HTML (experimental)

Abstract:The Rotary Position Embedding (RoPE) mechanism has become a powerful enhancement to the Transformer architecture, which enables models to capture token relationships when encoding positional information. However, the RoPE mechanisms make the computations of attention mechanisms more complicated, which makes efficient algorithms challenging. Earlier research introduced almost linear time, i.e., $n^{1+o(1)}$ where $n$ is the number of input tokens, algorithms for the forward computation under specific parameter settings. However, achieving a subquadratic time algorithm for other parameter regimes remains impossible unless the widely accepted Strong Exponential Time Hypothesis (SETH) is disproven. In this work, we develop the first almost linear time algorithm for backward computations in the RoPE-based attention under bounded entries. Our approach builds on recent advancements in fast RoPE attention computations, utilizing a novel combination of the polynomial method and the Fast Fourier Transform. Furthermore, we show that with lower bounds derived from the SETH, the bounded entry condition is necessary for subquadratic performance.

Submission history

From: Zhenmei Shi [view email]
[v1]
Mon, 23 Dec 2024 06:20:22 UTC (24 KB)
[v2]
Tue, 31 Dec 2024 06:53:40 UTC (26 KB)

Source link
lol

By stp2y