Click here for full text:
Dealing Efficiently with Data-Center Disasters
Frolund, Svend; Pedone, Fernando
Keyword(s): reliability; high-availability; disaster recovery; wide-area networks
Abstract: High-end, mission-critical computer systems commonly guard against disaster. Such systems are composed of data centers (i.e., local-area networks of failure- independent computers) in distributed geographical locations, connected through wide-area network links. Wide-area network links are a major source of overhead, and to build efficient disaster-resilient protocols, their use should be reduced without compromising the overall reliability of the system. This paper claims that efficient disaster-resilient protocols can be devised by adequately modeling wide- area distributed systems. To support our claim, we define a model for wide-area distributed systems that distinguishes between data-center disaster failures and computer failures, and develop a hierarchical Atomic Broadcast protocol for this model. The main idea behind a hierarchical protocol is to run a local sub-protocol within each local-area network, and then use a global protocol to orchestrate the communication between the local protocols across wide-area links. The hierarchical nature of the protocol, and the accuracy of disaster detection, allows us to achieve disaster resilience with few messages across wide-area links.
Back to Index