Abstract
Large Language Models (LLMs) are revolutionizing the AI industry with their superior capabilities. Training these models requires large-scale GPU clusters and significant computing time, leading to frequent failures that significantly increase training costs. Despite its significance, this field lacks a metric for evaluating reliability. In this work, we introduce a novel reliability metric called Training Overhead Ratio (TOR) to evaluate the reliability of fault-tolerant LLM training systems. TOR is defined as the ratio of optimal training time to the observed training time of a system, serving as a practical tool for users to estimate the actual time required to train an LLM on a given system. Furthermore, our investigation identifies the key factor for enhancing reliability and present TOR equations for various types of failures encountered in practice.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops, ISSREW 2024 |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 391-393 |
| Number of pages | 3 |
| ISBN (Electronic) | 9798350367041 |
| DOIs | |
| Publication status | Published - 2024 |
| Externally published | Yes |
| Event | 35th IEEE International Symposium on Software Reliability Engineering Workshops, ISSREW 2024 - Tsukuba, Japan Duration: 28 Oct 2024 → 31 Oct 2024 |
Publication series
| Name | Proceedings - 2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops, ISSREW 2024 |
|---|
Conference
| Conference | 35th IEEE International Symposium on Software Reliability Engineering Workshops, ISSREW 2024 |
|---|---|
| Country/Territory | Japan |
| City | Tsukuba |
| Period | 28/10/24 → 31/10/24 |
Bibliographical note
Publisher Copyright:© 2024 IEEE.
Keywords
- fault-tolerant training system
- large language models
- reliability
Fingerprint
Dive into the research topics of 'Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver