Abstract
Mixture-of-Experts (MoE) architectures in large language models (LLMs) achieve exceptional per-formance, but face prohibitive storage and mem-ory requirements. To address these challenges, we present D2-MoE, a new delta decompression compressor for reducing the parameters of MoE LLMs. Based on observations of expert diversity, we decompose their weights into a shared base weight and unique delta weights. Specifically, our method first merges each expert's weight into the base weight using the Fisher information ma-trix to capture shared components. Then, we compress delta weights through Singular Value Decomposition (SVD) by exploiting their low-rank properties. Finally, we introduce a semi-dynamical structured pruning strategy for the base weights, combining static and dynamic redun-dancy analysis to achieve further parameter re-duction while maintaining input adaptivity. In this way, our D2-MoE successfully compacts MOE LLMs to high compression ratios without additional training. Extensive experiments high-light the superiority of our approach, with over 13% performance gains than other compressors on Mixtral Phi-3.5 DeepSeek/Qwen2 MoE LLMs at 40~60% compression rates. Codes are available in https://github.com/lliai/D2MoE.
| Original language | English |
|---|---|
| Pages (from-to) | 20497-20514 |
| Number of pages | 18 |
| Journal | Proceedings of Machine Learning Research |
| Volume | 267 |
| Publication status | Published - 2025 |
| Event | 42nd International Conference on Machine Learning, ICML 2025 - Vancouver, Canada Duration: 13 Jul 2025 → 19 Jul 2025 |
Bibliographical note
Publisher Copyright:© 2025, by the authors.