Click here for full text:
Susceptibility of Modern Systems and Software to Soft Errors
Messer, Alan; Bernadat, Philippe; Fu, Guangrui; Chen, Deqing; Dimitrijevic, Zoran; Lie, David; Mannaru, Durga Devi; Riska, Alma; Milojicic, Dejan
Keyword(s): No keywords available.
Abstract:Abstract: It is widely understood that most downtime is accounted for by programming errors and administration time. However, recent work has indicated an increasing cause of downtime may stem from transient hardware errors caused by external factors, such as cosmic rays. Moving to denser semiconductor technologies at lower voltages will cause an increase in transient errors. We investigate the trends in transient errors and the susceptibility of operating systems and applications to them, and we introduce ideas regarding software transient error recoverability. We believe that if transient errors become a prominent problem, that it will be possible to improve commodity system availability with simple software recovery. Results indicate that in the Linux kernel and a Java virtual machine few errors need to be fatal. We also propose two recovery examples which we believe indicate that it is possible to increase error detection and recovery without the cost of a fail-over cluster.
Back to Index