View a PDF of the paper titled Transformers need glasses! Information over-squashing in language tasks, by Federico Barbero and 7 other authors
Abstract:We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier large language models (LLMs). We rely on a theoretical signal propagation analysis — specifically, we analyse the representations of the last token in the final layer of the Transformer, as this is the representation used for next-token prediction. Our analysis reveals a representational collapse phenomenon: we prove that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token. This effect is exacerbated by the low-precision floating-point formats frequently used in modern LLMs. As a result, the model is provably unable to respond to these sequences in different ways — leading to errors in, e.g., tasks involving counting or copying. Further, we show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input, which relates to the well-known phenomenon of over-squashing in graph neural networks. We provide empirical evidence supporting our claims on contemporary LLMs. Our theory also points to simple solutions towards ameliorating these issues.
Submission history
From: Federico Barbero [view email]
[v1]
Thu, 6 Jun 2024 17:14:44 UTC (3,096 KB)
[v2]
Thu, 24 Oct 2024 23:12:55 UTC (3,563 KB)
Source link
lol