Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices

Shengyuan Ye, Liekang Zeng, Xiaowen Chu, Guoliang Xing, Xu Chen*

*Corresponding author for this work

Research output: Chapter in Book/Conference Proceeding/ReportConference Paper published in a bookpeer-review

21 Citations (Scopus)

Abstract

On-device Deep Neural Network (DNN) training has been recognized as crucial for privacy-preserving machine learning at the edge. However, the intensive training workload and limited onboard computing resources pose significant challenges to the availability and efficiency of model training. While existing works address these challenges through native resource management optimization, we instead leverage our observation that edge environments usually comprise a rich set of accompanying trusted edge devices with idle resources beyond a single terminal. We propose Asteroid, a distributed edge training system that breaks the resource walls across heterogeneous edge devices for efficient model training acceleration. Asteroid adopts a hybrid pipeline parallelism to orchestrate distributed training, along with a judicious parallelism planning for maximizing throughput under certain resource constraints. Furthermore, a fault-tolerant yet lightweight pipeline replay mechanism is developed to tame the device-level dynamics for training robustness and performance stability. We implement Asteroid on heterogeneous edge devices with both vision and language models, demonstrating up to 12.2× faster training than conventional parallelism methods and 2.1× faster than state-of-the-art hybrid parallelism methods through evaluations. Furthermore, Asteroid can recover training pipeline 14× faster than baseline methods while preserving comparable throughput despite unexpected device exiting and failure.

Original languageEnglish
Title of host publicationACM MobiCom 2024 - Proceedings of the 30th International Conference on Mobile Computing and Networking
PublisherAssociation for Computing Machinery, Inc
Pages312-326
Number of pages15
ISBN (Electronic)9798400704895
DOIs
Publication statusPublished - 4 Dec 2024
Externally publishedYes
Event30th International Conference on Mobile Computing and Networking, ACM MobiCom 2024 - Washington, United States
Duration: 18 Nov 202422 Nov 2024

Publication series

NameACM MobiCom 2024 - Proceedings of the 30th International Conference on Mobile Computing and Networking

Conference

Conference30th International Conference on Mobile Computing and Networking, ACM MobiCom 2024
Country/TerritoryUnited States
CityWashington
Period18/11/2422/11/24

Bibliographical note

Publisher Copyright:
© 2024 Copyright held by the owner/author(s).

Keywords

  • data parallelism
  • distributed machine learning
  • Edge intelligence
  • hybrid parallelism
  • pipeline parallelism

Fingerprint

Dive into the research topics of 'Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices'. Together they form a unique fingerprint.

Cite this