Workload Consolidation in Alibaba Clusters: The Good, the Bad, and the Ugly

Yongkang Zhang, Yinghao Yu, Wei Wang, Qiukai Chen, Jie Wu, Zuowei Zhang, Jiang Zhong, Tianchen Ding, Qizhen Weng, Lingyun Yang, Cheng Wang, Jian He, Guodong Yang, Liping Zhang

Research output: Chapter in Book/Conference Proceeding/ReportConference Paper published in a bookpeer-review

19 Citations (Scopus)

Abstract

Web companies typically run latency-critical long-running services and resource-intensive, throughput-hungry batch jobs in a shared cluster for improved utilization and reduced cost. Despite many recent studies on workload consolidation, the production practice remains largely unknown. This paper describes our efforts to efficiently consolidate the two types of workloads in Alibaba clusters to support the company's e-commerce businesses. At the cluster level, the host and GPU memory are the bottleneck resources that limit the scale of consolidation. Our system proactively reclaims the idle host memory pages of service jobs and dynamically relinquishes their unused host and GPU memory following the predictable diurnal pattern of user traffic, a technique termed tidal scaling. Our system further performs node-level micro-management to ensure that the increased workload consolidation does not result in harmful resource contention. We briefly share our experience in handling the surging traffic with flash-crowd customers during the seasonal shopping festivals (e.g., November 11) using these "good"practices. We also discuss the limitations of our current solution (the "bad") and some practical engineering constraints (the "ugly") that make many prior research solutions inapplicable to our system.

Original languageEnglish
Title of host publicationSoCC 2022 - Proceedings of the 13th Symposium on Cloud Computing
PublisherAssociation for Computing Machinery, Inc
Pages210-225
Number of pages16
ISBN (Electronic)9781450394147
DOIs
Publication statusPublished - 7 Nov 2022
Event13th Annual ACM Symposium on Cloud Computing, SoCC 2022 - San Francisco, United States
Duration: 7 Nov 202211 Nov 2022

Publication series

NameSoCC 2022 - Proceedings of the 13th Symposium on Cloud Computing

Conference

Conference13th Annual ACM Symposium on Cloud Computing, SoCC 2022
Country/TerritoryUnited States
CitySan Francisco
Period7/11/2211/11/22

Bibliographical note

Publisher Copyright:
© 2022 ACM.

Keywords

  • cluster management
  • scheduling
  • workload consolidation

Fingerprint

Dive into the research topics of 'Workload Consolidation in Alibaba Clusters: The Good, the Bad, and the Ugly'. Together they form a unique fingerprint.

Cite this