Abstract
Grouped Query Attention (GQA), is a generalized form of multi-query attention, crafted to minimize the size of models by enabling shared key-value pairs for given queries. This approach facilitates faster decoder inference. Nonetheless, disparities between the queries may introduce uncertainty in the key-value pairs, leading to a decline in the message propagation effectiveness of the attention models. A potential solution to this problem is to cluster queries based on their proximity into groups that share a key-value pair. This paper proposes a novel framework to enhance GQA, exploiting a graph method to cluster akin queries. It adaptively guides the fusion of key-value pairs to reduce model parameters while preserving inference performance. Three different clustering methods, namely k-means, Hungarian Algorithm, and Blossom Algorithm, have each been employed to instantiate the clustering-based GQA (CGQA). This approach is validated through applications to two real-world problems, addressing tasks of Natural Language Understanding and Natural Language Generation, respectively. The comparative experimental analysis carried out demonstrates that the proposed approach outperforms the existing GQA methods, reinforcing the effectiveness of the underlying attention models.
| Original language | English |
|---|---|
| Article number | 115311 |
| Journal | Knowledge-Based Systems |
| Volume | 336 |
| DOIs | |
| Publication status | Accepted/In press - 09 Jan 2026 |
Keywords
- Artificial neural networks
- Grouped query attention
- Large language model
- Model compression
- Query clustering
Fingerprint
Dive into the research topics of 'Grouped query attention supported with graph-based query clustering'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver