Dissecting Query-Key Interaction in Vision Transformers

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning



arXiv:2405.14880v1 Announce Type: new
Abstract: Self-attention in vision transformers has been thought to perform perceptual grouping where tokens attend to other tokens with similar embeddings, which could correspond to semantically similar features in an image. However, contextualization is also an important and necessary computation for processing signals. Contextualization potentially requires tokens to attend to dissimilar tokens such as those corresponding to backgrounds or different objects, but this effect has not been reported in previous studies. In this study, we investigate whether self-attention in vision transformers exhibits a preference for attending to similar tokens or dissimilar tokens, providing evidence of perceptual grouping and contextualization, respectively. To study this question, we propose the use of singular value decomposition on the query-key matrix ${textbf{W}_q}^Ttextbf{W}_k$. Naturally, the left and right singular vectors are feature directions of the self-attention layer and can be analyzed in pairs to interpret the interaction between tokens. We find that early layers attend more to similar tokens, while late layers show increased attention to dissimilar tokens. Moreover, many of these interactions between features represented by singular vectors are interpretable. We present a novel perspective on interpreting the attention mechanism, which may contribute to understanding how transformer models utilize context and salient features when processing images.



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.