In this thesis, we introduce Greenformers, a collection of model efficiency methods to improve the model efficiency of the recently renowned transformer models with a low-rank approximation approach. The development trend of deep learning models tends to results in a more complex and larger model. Although it leads to a better and more accurate prediction, the resulting model becomes even more costly, as it requires weeks of training with a huge amount of GPU resources. Particularly, the size and computational cost of transformer-based models have increased tremendously since its first debut in 2017 from ~100 million parameters up to ~1.6 trillion parameters in early 2021. This computationally hungry model also incurs a substantial cost to the environment and even reaches an alarming level of carbon footprint. Some of these models are so massive that it is even impossible to run the model without a GPU cluster. Greenformers improve the model efficiency of transformer models by applying low-rank approximation approaches. Specifically, we propose a low-rank factorization approach to improve the efficiency of the transformer model called Low-Rank Transformer. We further compare our model with an existing low-rank factorization approach called Linformer. Based on our analysis, the Low-Rank Transformer model is suitable for improving both the time and memory efficiency in processing short-sequence (≤ 512) input data, while the Linformer model is suitable for improving the efficiency in processing long-sequence input data (≥ 512). We also show that Low-Rank Transformer is more suitable for on-device deployment, as it significantly reduces the model size. Additionally, we estimate that applying LRT to the existing BERT
BASE model can significantly reduce the computational, economical, and environmental costs for developing such models by more than 30% of its original costs. Our Low-Rank Transformer can significantly reduce the computational time and memory usage on the speech recognition task. Specifically, our Low-Rank Transformer can halve the size of the model and increase the speed by up to 1.35x in the GPU and 1.25x in the CPU while maintaining the performance of the model compared to the original transformer model. Our finding suggests that transformer models tend to be over-parameterized and our Low-Rank Transformer can help to mitigate the over-parameterization problem, yielding a more efficient model with a better generalization. Additionally, we extend the possibility of applying a low-rank approximation approach to a genomics study for Alzheimer’s disease risk prediction. We apply sequence modeling techniques with the Linformer model to predict Alzheimer’s disease in the Chinese cohort. We define our problem as a long sequence classification problem with various lengths up to ~33,000 nucleotides long. Our result shows that Linformer models with Subword Tokenization can process very long sequence data and boost the evaluation performance by up to ~5% AUC compared to the existing FDA-approved risk scoring model and other deep learning variants. Based on our analysis, we further conclude that the choice of tokenization approach can also provide a huge computation and memory efficiency as much as the efficient model approach, which makes consideration of choosing tokenization approach more prominent for developing a more efficient transformer model.
| Date of Award | 2021 |
|---|
| Original language | English |
|---|
| Awarding Institution | - The Hong Kong University of Science and Technology
|
|---|
| Supervisor | Pascale Ngan FUNG (Supervisor) |
|---|
Greenformers : improving computation and memory efficiency in transformer models via low-rank approximation
CAHYAWIJAYA, S. (Author). 2021
Student thesis: Master's thesis