PAPER DIGEST
Most Influential EMNLP 2023 Paper · 2026-03 edition

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, Sumit Sanghai

Venue
Conference on Empirical Methods in Natural Language Processing (EMNLP) 2023
Recognition
Most Influential EMNLP 2023 Paper (Rank No. 3)
Edition
2026-03
Impact factor
8
Certificate ID
135f03a4b4c89f07

Abstract

Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show that uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA.

Download PDF certificate