Abstract
Web companies typically run latency-critical long-running services and resource-intensive, throughput-hungry batch jobs in a shared cluster for improved utilization and reduced cost. Despite many recent studies on workload consolidation, the production practice remains largely unknown. This paper describes our efforts to efficiently consolidate the two types of workloads in Alibaba clusters to support the company's e-commerce businesses. At the cluster level, the host and GPU memory are the bottleneck resources that limit the scale of consolidation. Our system proactively reclaims the idle host memory pages of service jobs and dynamically relinquishes their unused host and GPU memory following the predictable diurnal pattern of user traffic, a technique termed tidal scaling. Our system further performs node-level micro-management to ensure that the increased workload consolidation does not result in harmful resource contention. We briefly share our experience in handling the surging traffic with flash-crowd customers during the seasonal shopping festivals (e.g., November 11) using these "good"practices. We also discuss the limitations of our current solution (the "bad") and some practical engineering constraints (the "ugly") that make many prior research solutions inapplicable to our system.
| Original language | English |
|---|---|
| Title of host publication | SoCC 2022 - Proceedings of the 13th Symposium on Cloud Computing |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 210-225 |
| Number of pages | 16 |
| ISBN (Electronic) | 9781450394147 |
| DOIs | |
| Publication status | Published - 7 Nov 2022 |
| Event | 13th Annual ACM Symposium on Cloud Computing, SoCC 2022 - San Francisco, United States Duration: 7 Nov 2022 → 11 Nov 2022 |
Publication series
| Name | SoCC 2022 - Proceedings of the 13th Symposium on Cloud Computing |
|---|
Conference
| Conference | 13th Annual ACM Symposium on Cloud Computing, SoCC 2022 |
|---|---|
| Country/Territory | United States |
| City | San Francisco |
| Period | 7/11/22 → 11/11/22 |
Bibliographical note
Publisher Copyright:© 2022 ACM.
Keywords
- cluster management
- scheduling
- workload consolidation