Paper
3 January 2025 SHA: sparse adaptive head of attention
Haoyang Yuan
Author Affiliations +
Proceedings Volume 13442, Fifth International Conference on Signal Processing and Computer Science (SPCS 2024); 134421M (2025) https://doi.org/10.1117/12.3053346
Event: Fifth International Conference on Signal Processing and Computer Science (SPCS 2024), 2024, Kaifeng, China
Abstract
In the trend of deepening and broadening large language models, the limitations posed by key-value (KV) caches on LLM inference have become increasingly prominent. This study elucidates how different input samples induce variations in the correlation between attention heads during the computation of transformer attention mechanisms. By quantifying this context-dependent correlation and dynamically assigning weights to each attention head accordingly, we propose SHA. SHA identifies and prunes heads that contribute minimally to overall performance, accelerating the inference process with minimal loss of performance. Experimental results demonstrate that on the Llama-7B, we successfully remove 30% of the attention heads, reducing KV cache memory requirements by 24.2%, and achieve a throughput improvement of up to 2.04 times.
(2025) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Haoyang Yuan "SHA: sparse adaptive head of attention", Proc. SPIE 13442, Fifth International Conference on Signal Processing and Computer Science (SPCS 2024), 134421M (3 January 2025); https://doi.org/10.1117/12.3053346
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Matrices

Mathematical optimization

Transformers

Singular value decomposition

Neural networks

Back to Top