TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models

stp2yDecember 30, 20240 Comments

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning

[Submitted on 24 Dec 2024]

View a PDF of the paper titled TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models, by Pooyan Rahmanzadehgrevi and Hung Huy Nguyen and Rosanne Liu and Long Mai and Anh Totti Nguyen

Abstract:Multi-head self-attention (MHSA) is a key component of Transformers, a widely popular architecture in both language and vision. Multiple heads intuitively enable different parallel processes over the same input. Yet, they also obscure the attribution of each input patch to the output of a model. We propose a novel 1-head Transformer Attention Bottleneck (TAB) layer, inserted after the traditional MHSA architecture, to serve as an attention bottleneck for interpretability and intervention. Unlike standard self-attention, TAB constrains the total attention over all patches to $in [0, 1]$. That is, when the total attention is 0, no visual information is propagated further into the network and the vision-language model (VLM) would default to a generic, image-independent response. To demonstrate the advantages of TAB, we train VLMs with TAB to perform image difference captioning. Over three datasets, our models perform similarly to baseline VLMs in captioning but the bottleneck is superior in localizing changes and in identifying when no changes occur. TAB is the first architecture to enable users to intervene by editing attention, which often produces expected outputs by VLMs.

Submission history

From: Pooyan Rahmanzadehgervi [view email]
[v1]
Tue, 24 Dec 2024 20:28:07 UTC (23,674 KB)

Source link
lol

By stp2y