Delta Decompression for MoE-based LLMs Compression

Hao Gu, Wei Li, Lujun Li, Qiyuan Zhu, Mark Lee*, Shengjie Sun, Wei Xue, Yike Guo*

*Corresponding author for this work

Research output: Contribution to journalConference article published in journalpeer-review

Abstract

Mixture-of-Experts (MoE) architectures in large language models (LLMs) achieve exceptional per-formance, but face prohibitive storage and mem-ory requirements. To address these challenges, we present D2-MoE, a new delta decompression compressor for reducing the parameters of MoE LLMs. Based on observations of expert diversity, we decompose their weights into a shared base weight and unique delta weights. Specifically, our method first merges each expert's weight into the base weight using the Fisher information ma-trix to capture shared components. Then, we compress delta weights through Singular Value Decomposition (SVD) by exploiting their low-rank properties. Finally, we introduce a semi-dynamical structured pruning strategy for the base weights, combining static and dynamic redun-dancy analysis to achieve further parameter re-duction while maintaining input adaptivity. In this way, our D2-MoE successfully compacts MOE LLMs to high compression ratios without additional training. Extensive experiments high-light the superiority of our approach, with over 13% performance gains than other compressors on Mixtral Phi-3.5 DeepSeek/Qwen2 MoE LLMs at 40~60% compression rates. Codes are available in https://github.com/lliai/D2MoE.

Original languageEnglish
Pages (from-to)20497-20514
Number of pages18
JournalProceedings of Machine Learning Research
Volume267
Publication statusPublished - 2025
Event42nd International Conference on Machine Learning, ICML 2025 - Vancouver, Canada
Duration: 13 Jul 202519 Jul 2025

Bibliographical note

Publisher Copyright:
© 2025, by the authors.

Fingerprint

Dive into the research topics of 'Delta Decompression for MoE-based LLMs Compression'. Together they form a unique fingerprint.

Cite this