GSSP: Eliminating Stragglers Through Grouping Synchronous for Distributed Deep Learning in Heterogeneous Cluster

Haifeng Sun, Zhiyi Gui, Song Guo, Qi Qi, Jingyu Wang*, Jianxin Liao

*Corresponding author for this work

Research output: Contribution to journalJournal Articlepeer-review

Abstract

Distributed deep learning has been widely used in training deep neural networks, especially for big models on massive datasets. Parameter Server (PS) architecture is the most popular distributed training framework, which can flexibly design the global parameter update manner. However, when scaling to complex heterogeneous clusters, stragglers make it difficult for existing distributed paradigms on PS framework to balance between synchronous waiting and staleness, which slows down the model training sharply. In this article, we propose Grouping Stale Synchronous Parallel (GSSP) scheme, which groups workers with similar performance together. Group servers coordinate intra-group workers using Stale Synchronous Parallel while they communicate with each other asynchronously to eliminate stragglers and refine the model weights. We further propose Grouping Dynamic Tok-K Sparsification (GDTopK), which dynamically adjusts the upload ratio for each group so as to make communication volume differentiated and mitigate inter-group iteration speed gap. We have conducted experiments on LeNet-5 on MNIST, ResNet-18, VGG-19 on Cifar-10, and Seq2Seq on Multi30k. Results show that GSSP accelerates the training by 46% ∼ ∼120%, with less than 1 percent accuracy drop. And GDTopK can make up for part of the lost accuracy.

Original languageEnglish
Pages (from-to)2637-2648
Number of pages12
JournalIEEE Transactions on Cloud Computing
Volume10
Issue number4
DOIs
Publication statusPublished - 1 Oct 2022
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2013 IEEE.

Keywords

  • Deep learning
  • distributed training
  • gradient compression
  • parameter server
  • top-k sparsification

Fingerprint

Dive into the research topics of 'GSSP: Eliminating Stragglers Through Grouping Synchronous for Distributed Deep Learning in Heterogeneous Cluster'. Together they form a unique fingerprint.

Cite this