The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers

stp2ySeptember 30, 20240 Comments

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning

[Submitted on 21 Feb 2023 (v1), last revised 27 Sep 2024 (this version, v4)]

View a PDF of the paper titled The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers, by Seungwoo Son and Jegwang Ryu and Namhoon Lee and Jaeho Lee

Abstract:Knowledge distillation is an effective method for training lightweight vision models. However, acquiring teacher supervision for training samples is often costly, especially from large-scale models like vision transformers (ViTs). In this paper, we develop a simple framework to reduce the supervision cost of ViT distillation: masking out a fraction of input tokens given to the teacher. By masking input tokens, one can skip the computations associated with the masked tokens without requiring any change to teacher parameters or architecture. We find that masking patches with the lowest student attention scores is highly effective, saving up to 50% of teacher FLOPs without any drop in student accuracy, while other masking criterion leads to suboptimal efficiency gains. Through in-depth analyses, we reveal that the student-guided masking provides a good curriculum to the student, making teacher supervision easier to follow during the early stage and challenging in the later stage.

Submission history

From: Seungwoo Son [view email]
[v1]
Tue, 21 Feb 2023 07:48:34 UTC (5,837 KB)
[v2]
Wed, 31 May 2023 04:50:46 UTC (3,084 KB)
[v3]
Mon, 15 Jul 2024 06:37:04 UTC (8,169 KB)
[v4]
Fri, 27 Sep 2024 14:50:23 UTC (8,169 KB)

Source link
lol

By stp2y