Skip to main navigation Skip to search Skip to main content

Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs

  • Youhe Jiang
  • , Fangcheng Fu
  • , Xiaozhe Yao
  • , Guoliang He
  • , Xupeng Miao
  • , Ana Klimovic
  • , Bin Cui
  • , Binhang Yuan*
  • , Eiko Yoneki*
  • *Corresponding author for this work

Research output: Contribution to journalConference article published in journalpeer-review

Abstract

Recent advancements in Large Language Models (LLMs) have led to increasingly diverse requests, accompanied with varying resource (compute and memory) demands to serve them. However, this in turn degrades the cost-efficiency of LLM serving as common practices primarily rely on homogeneous GPU resources. In response to this problem, this work conducts a thorough study about serving LLMs over heterogeneous GPU resources on cloud platforms. The rationale is that different GPU types exhibit distinct compute and memory characteristics, aligning well with the divergent resource demands of diverse requests. Particularly, through comprehensive benchmarking, we discover that the cost-efficiency of LLM serving can be substantially optimized by meticulously determining GPU composition, deployment configurations, and workload assignments. Subsequently, we design a scheduling algorithm via mixed-integer linear programming, aiming at deducing the most cost-efficient serving plan under the constraints of price budget and real-time GPU availability. Remarkably, our approach effectively outperforms homogeneous and heterogeneous baselines under a wide array of scenarios, covering diverse workload traces, varying GPU availablilities, and multi-model serving. This casts new light on more accessible and efficient LLM serving over heterogeneous cloud resources.

Original languageEnglish
Pages (from-to)27534-27552
Number of pages19
JournalProceedings of Machine Learning Research
Volume267
Publication statusPublished - 2025
Event42nd International Conference on Machine Learning, ICML 2025 - Vancouver, Canada
Duration: 13 Jul 202519 Jul 2025

Bibliographical note

Publisher Copyright:
© 2025, ML Research Press. All rights reserved.

Fingerprint

Dive into the research topics of 'Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs'. Together they form a unique fingerprint.

Cite this