Accelerating Neural Recommendation Training with Embedding Scheduling

Chaoliang Zeng, Xudong Liao, Xiaodian Cheng, Han Tian, Xinchen Wan, Hao Wang, Kai Chen*

*Corresponding author for this work

Research output: Chapter in Book/Conference Proceeding/ReportConference Paper published in a bookpeer-review

7 Citations (Scopus)

Abstract

Deep learning recommendation models (DLRM) are extensively adopted to support many online services. Typical DLRM training frameworks adopt the parameter server (PS) in CPU servers to maintain memory-intensive embedding tables, and leverage GPU workers with embedding cache to accelerate compute-intensive neural network computation and enable fast embedding lookups. However, such distributed systems suffer from significant communication overhead caused by the embedding transmissions between workers and PS. Prior work reduces the number of cache embedding transmissions by compromising model accuracy, including oversampling hot embeddings or applying staleness-tolerant updates. This paper reveals that many of such transmissions can be avoided given the predictability and infrequency natures of in-cache embedding accesses in distributed training. Based on this observation, we explore a new direction to accelerate distributed DLRM training without compromising model accuracy, i.e., embedding scheduling—with the core idea of proactively determining "where embeddings should be trained" and "which embeddings should be synchronized" to increase the cache hit rate and decrease unnecessary updates, thus achieving a low communication overhead. To realize this idea, we design Herald, a real-time embedding scheduler consisting of two main components: an adaptive location-aware inputs allocator to determine where embeddings should be trained and an optimal communication plan generator to determine which embeddings should be synchronized. Our experiments with real-world workloads show that Herald reduces 48%-89% embedding transmissions, leading up to 2.11× and up to 1.61× better performance with TCP and RDMA, respectively, over 100 Gbps Ethernet for end-to-end DLRM training.

Original languageEnglish
Title of host publicationProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024
PublisherUSENIX Association
Pages1141-1156
Number of pages16
ISBN (Electronic)9781939133397
Publication statusPublished - 2024
Event21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024 - Santa Clara, United States
Duration: 16 Apr 202418 Apr 2024

Publication series

NameProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024

Conference

Conference21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024
Country/TerritoryUnited States
CitySanta Clara
Period16/04/2418/04/24

Bibliographical note

Publisher Copyright:
© 2024 Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024. All rights reserved.

Fingerprint

Dive into the research topics of 'Accelerating Neural Recommendation Training with Embedding Scheduling'. Together they form a unique fingerprint.

Cite this