Abstract
Reliability is an important concern in large scale applications such as distributed sorting. An error made due to a hardware or software fault during the application execution can corrupt, perhaps undetectably, the final result of the calculation. Rather than implement the necessary hardware system level reliability by such traditional (and expensive) methods as replication and masking, we appeal to Application-Oriented Fault Tolerance. The Application-Oriented Fault Tolerance Paradigm relies on properties of expected algorithm behavior to form a test for faulty behavior thus also detecting software failures. The test is performed on the actual application’s intermediate values by the peers of a particular tested processor. This paper describes the design and implementation of a reliable version of the distributed bitonic sorting algorithm using this paradigm on commercially available multicomputers.
| Original language | English |
|---|---|
| Pages (from-to) | 411-420 |
| Number of pages | 10 |
| Journal | IEEE Transactions on Parallel and Distributed Systems |
| Volume | 3 |
| Issue number | 4 |
| DOIs | |
| Publication status | Published - Jul 1992 |
| Externally published | Yes |