Kelly, Terence; Karp, Alan H.; Stiegler, Marc; Close, Tyler; Cho, Hyoun Kyu
Keyword(s): fault tolerance, rollback recovery, distributed computing, Ken, Waterken, checkpointing, nonvolatile memory.
Abstract: Previous research has proposed and analyzed a broad spectrum of rollback-recovery protocols for fault- tolerant distributed computing, some of which are frequently used in specialized domains such as scientific computing. Adoption has been less widespread in commercial software development, however, because pragmatic engineering concerns and business requirements weigh against the properties of some existing protocols. Furthermore, a confluence of technology trends is casting doubt on some popular approaches to rollback-recovery while breathing new life into techniques less well suited to older technologies. This paper analyzes an application- transparent rollback-recovery protocol for crash/recover hosts and fair-loss links. The protocol, Ken, is abstracted from an open-source implementation, Waterken, designed to facilitate reliable distributed commercial application development. Ken unifies application state checkpointing with logging required for reliable communication and is well suited to current technology and to the requirements of decentralized commercial software development. It preserves the main advantages of pessimistic logging, including simple local recovery and the need to maintain only one checkpoint per process. However it relaxes the very strong correctness guarantee provided by previous log-based approaches to avoid the increasing computational cost of providing it on current and foreseeable hardware. Ken provides a weaker yet still satisfactory guarantee, output validity: even in the presence of failures, the outside world sees outputs that could have resulted from failure-free operation.
External Posting Date: October 21, 2010 [Fulltext]. Approved for External Publication
Internal Posting Date: October 21, 2010 [Fulltext]