Reliable Distributed Sorting Through the Application-Oriented Fault Tolerance Paradigm

Bruce M. McMillin

Research output: Contribution to journalJournal Articlepeer-review

12 Citations (Scopus)

Abstract

Reliability is an important concern in large scale applications such as distributed sorting. An error made due to a hardware or software fault during the application execution can corrupt, perhaps undetectably, the final result of the calculation. Rather than implement the necessary hardware system level reliability by such traditional (and expensive) methods as replication and masking, we appeal to Application-Oriented Fault Tolerance. The Application-Oriented Fault Tolerance Paradigm relies on properties of expected algorithm behavior to form a test for faulty behavior thus also detecting software failures. The test is performed on the actual application’s intermediate values by the peers of a particular tested processor. This paper describes the design and implementation of a reliable version of the distributed bitonic sorting algorithm using this paradigm on commercially available multicomputers.

Original languageEnglish
Pages (from-to)411-420
Number of pages10
JournalIEEE Transactions on Parallel and Distributed Systems
Volume3
Issue number4
DOIs
Publication statusPublished - Jul 1992
Externally publishedYes

Fingerprint

Dive into the research topics of 'Reliable Distributed Sorting Through the Application-Oriented Fault Tolerance Paradigm'. Together they form a unique fingerprint.

Cite this