Skip to main navigation Skip to search Skip to main content

Towards Efficient and Effective Inference for Large-Scale Models

  • Xijie HUANG

Student thesis: Doctoral thesis

Abstract

While large-scale deep learning models have achieved remarkable success in natural language and vision tasks, their growing computational demands and model sizes necessitate efficient inference, particularly for inference on edge devices with memory bandwidth constrain. To address this efficiency bottleneck, model compression techniques, such as quantization, pruning, knowledge distillation, and low-rank decomposition, have been extensively studied in the research community and widely adopted in various AI applications.

In this thesis, we will start from an introduction on the principles and discussion on the challenges of model compression and inference acceleration for large-scale models, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), Large Language Models (LLMs), and Diffusion Models (DMs). In the following chapters, we will present novel methods to enhance the efficiency and effectiveness of inference across these architectures.

First, we focus on the inference efficiency of CNNs and propose Stochastic Differentiable Quantization (SDQ). In our SDQ framework, the optimal mixed-precision strategy is learned via optimization on the differentiable bitwidth parameters during the stochastic quantization. Second, we turn to the challenges in ViTs’ inference efficiency. We propose an effective Variation-aware ViT Quantization (VVTQ), which includes module-dependent quantization and scaling, variation-aware knowledge distillation, and oscillation-aware bin regularization. Third, we improve the inference efficiency of LLMs via solving the activation outlier problems. We propose RoLoRA, the first LoRA-based scheme for effective weight-activation quantization. RoLoRA utilizes rotation for outlier elimination and proposes rotation-aware fine-tuning to preserve the outlier-free characteristics in rotated LLMs. Fourth, we improve both the reasoning efficiency and effectiveness of LLMs using a coarse-to-fine prompt pruner, named as CoT-Influx. The CoT-Influx pruner first selects important Chain-of-Thoughts (CoT) candidates and then prunes uninformative tokens to fit the context window. Lastly, we build an efficient text-to-image (T2I) diffusion models, SnapGen, that generates high-resolution and high-quality images on mobile platforms. A cross-architecture knowledge distillation scheme is proposed to guide the training of SnapGen, and we also enable fewer-step generations by integrating adversarial distillation.

Date of Award2025
Original languageEnglish
Awarding Institution
  • The Hong Kong University of Science and Technology
SupervisorKwang-Ting Tim CHENG (Supervisor)

Cite this

'