Multi-matrix Factorization Attention

stp2yJanuary 15, 20250 Comments

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning

[Submitted on 26 Dec 2024 (v1), last revised 14 Jan 2025 (this version, v2)]

View a PDF of the paper titled Multi-matrix Factorization Attention, by Jingcheng Hu and 7 other authors

View PDF
HTML (experimental)

Abstract:We propose novel attention architectures, Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain as strong performance under stringent Key-Value cache (KV cache) constraints. MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads through low-rank matrix factorization in the Query-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory requirements by repurposing the key cache as value through value projection re-parameterization. MFA’s design enables strong model capacity when working under tight KV cache budget, while MFA-KR is suitable for even harsher KV cache limits with minor performance trade-off. Notably, in our extensive and large-scale experiments, the proposed architecture outperforms MLA and performs comparably to MHA, while reducing KV cache usage by up to 56% and 93.7%, respectively.

Submission history

From: Jingcheng Hu [view email]
[v1]
Thu, 26 Dec 2024 15:45:45 UTC (7,174 KB)
[v2]
Tue, 14 Jan 2025 05:48:07 UTC (7,174 KB)

Source link
lol

By stp2y