Abstract
Deep learning recommendation models (DLRM) are extensively adopted to support many online services. Typical DLRM training frameworks adopt the parameter server (PS) in CPU servers to maintain memory-intensive embedding tables, and leverage GPU workers with embedding cache to accelerate compute-intensive neural network computation and enable fast embedding lookups. However, such distributed systems suffer from significant communication overhead caused by the embedding transmissions between workers and PS. Prior work reduces the number of cache embedding transmissions by compromising model accuracy, including oversampling hot embeddings or applying staleness-tolerant updates. This paper reveals that many of such transmissions can be avoided given the predictability and infrequency natures of in-cache embedding accesses in distributed training. Based on this observation, we explore a new direction to accelerate distributed DLRM training without compromising model accuracy, i.e., embedding scheduling—with the core idea of proactively determining "where embeddings should be trained" and "which embeddings should be synchronized" to increase the cache hit rate and decrease unnecessary updates, thus achieving a low communication overhead. To realize this idea, we design Herald, a real-time embedding scheduler consisting of two main components: an adaptive location-aware inputs allocator to determine where embeddings should be trained and an optimal communication plan generator to determine which embeddings should be synchronized. Our experiments with real-world workloads show that Herald reduces 48%-89% embedding transmissions, leading up to 2.11× and up to 1.61× better performance with TCP and RDMA, respectively, over 100 Gbps Ethernet for end-to-end DLRM training.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024 |
| Publisher | USENIX Association |
| Pages | 1141-1156 |
| Number of pages | 16 |
| ISBN (Electronic) | 9781939133397 |
| Publication status | Published - 2024 |
| Event | 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024 - Santa Clara, United States Duration: 16 Apr 2024 → 18 Apr 2024 |
Publication series
| Name | Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024 |
|---|
Conference
| Conference | 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024 |
|---|---|
| Country/Territory | United States |
| City | Santa Clara |
| Period | 16/04/24 → 18/04/24 |
Bibliographical note
Publisher Copyright:© 2024 Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024. All rights reserved.