Abstract
Despite the revolutionary advances made by Transformer in Neural Machine Translation (NMT), inference efficiency remains an obstacle due to the heavy use of attention operations in auto-regressive decoding. We thereby propose a lightweight attention structure called Attention Refinement Network (ARN) for speeding up Transformer. Specifically, we design a weighted residual network, which reconstructs the attention by reusing the features across layers. To further improve the Transformer efficiency, we merge the self-attention and cross-attention components for parallel computing. Extensive experiments on ten WMT machine translation tasks show that the proposed model yields an average of 1.35× faster (with almost no decrease in BLEU) over the state-of-the-art inference implementation.
| Original language | English |
|---|---|
| Pages (from-to) | 5109-5118 |
| Number of pages | 10 |
| Journal | Proceedings - International Conference on Computational Linguistics, COLING |
| Volume | 29 |
| Issue number | 1 |
| Publication status | Published - 2022 |
| Externally published | Yes |
| Event | 29th International Conference on Computational Linguistics, COLING 2022 - Gyeongju, Korea, Republic of Duration: 12 Oct 2022 → 17 Oct 2022 |
Bibliographical note
Publisher Copyright:© 2022 Proceedings - International Conference on Computational Linguistics, COLING. All rights reserved.