Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems

Ning Lu*, Qian Xie, Hao Zhang, Wenyi Fang, Yang Zheng, Zheng Hu, Jiantao Ma

*Corresponding author for this work

Research output: Chapter in Book/Conference Proceeding/ReportConference Paper published in a bookpeer-review

Abstract

Large Language Models (LLMs) are revolutionizing the AI industry with their superior capabilities. Training these models requires large-scale GPU clusters and significant computing time, leading to frequent failures that significantly increase training costs. Despite its significance, this field lacks a metric for evaluating reliability. In this work, we introduce a novel reliability metric called Training Overhead Ratio (TOR) to evaluate the reliability of fault-tolerant LLM training systems. TOR is defined as the ratio of optimal training time to the observed training time of a system, serving as a practical tool for users to estimate the actual time required to train an LLM on a given system. Furthermore, our investigation identifies the key factor for enhancing reliability and present TOR equations for various types of failures encountered in practice.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops, ISSREW 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages391-393
Number of pages3
ISBN (Electronic)9798350367041
DOIs
Publication statusPublished - 2024
Externally publishedYes
Event35th IEEE International Symposium on Software Reliability Engineering Workshops, ISSREW 2024 - Tsukuba, Japan
Duration: 28 Oct 202431 Oct 2024

Publication series

NameProceedings - 2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops, ISSREW 2024

Conference

Conference35th IEEE International Symposium on Software Reliability Engineering Workshops, ISSREW 2024
Country/TerritoryJapan
CityTsukuba
Period28/10/2431/10/24

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

Keywords

  • fault-tolerant training system
  • large language models
  • reliability

Fingerprint

Dive into the research topics of 'Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems'. Together they form a unique fingerprint.

Cite this