Abstract
Distributed deep learning has been widely used in training deep neural networks, especially for big models on massive datasets. Parameter Server (PS) architecture is the most popular distributed training framework, which can flexibly design the global parameter update manner. However, when scaling to complex heterogeneous clusters, stragglers make it difficult for existing distributed paradigms on PS framework to balance between synchronous waiting and staleness, which slows down the model training sharply. In this article, we propose Grouping Stale Synchronous Parallel (GSSP) scheme, which groups workers with similar performance together. Group servers coordinate intra-group workers using Stale Synchronous Parallel while they communicate with each other asynchronously to eliminate stragglers and refine the model weights. We further propose Grouping Dynamic Tok-K Sparsification (GDTopK), which dynamically adjusts the upload ratio for each group so as to make communication volume differentiated and mitigate inter-group iteration speed gap. We have conducted experiments on LeNet-5 on MNIST, ResNet-18, VGG-19 on Cifar-10, and Seq2Seq on Multi30k. Results show that GSSP accelerates the training by 46% ∼ ∼120%, with less than 1 percent accuracy drop. And GDTopK can make up for part of the lost accuracy.
| Original language | English |
|---|---|
| Pages (from-to) | 2637-2648 |
| Number of pages | 12 |
| Journal | IEEE Transactions on Cloud Computing |
| Volume | 10 |
| Issue number | 4 |
| DOIs | |
| Publication status | Published - 1 Oct 2022 |
| Externally published | Yes |
Bibliographical note
Publisher Copyright:© 2013 IEEE.
Keywords
- Deep learning
- distributed training
- gradient compression
- parameter server
- top-k sparsification
Fingerprint
Dive into the research topics of 'GSSP: Eliminating Stragglers Through Grouping Synchronous for Distributed Deep Learning in Heterogeneous Cluster'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver