Design of VMS Volume Shadowing Phase II-Host-based Shadowing By Scott H. Davis Abstract VMS Volume Shadowing Overview Phase II is a fully distributed, clusterwide Volume shadowing is a data availability product technique that provides designed to replace the data availability to obsolete controller-based computer systems by shadowing implementation. protecting against Phase II is intended data loss from media to service current and deterioration, future generations of communication path storage architectures. failures, and controller In these architectures, or device failures. The there is no intelligent, process of volume shadowing multiunit controller that entails maintaining functions as a centralized multiple copies of the gateway to the multiple same data on two or more drives in the shadow physical volumes. Up to set. The new software three physical devices makes many additional are bound together by the topologies suitable for volume shadowing software shadowing, including and present a virtual DSSI drives, DSA drives, device to the system. This and shadowing across VMS device is referred to as MSCP servers. This last a shadow set or a virtual configuration allows shadow unit. The volume shadowing set members to be separated software replicates data by any supported cluster across the physical interconnect, including devices. All shadowing FDDI. All essential mechanisms are hidden from shadowing functions are the users of the system, performed within the VMS i.e., applications access operating system. New MSCP the virtual unit as if it controllers and drives can were a standard, physical optionally implement a set disk. Figure 1 shows a VMS of shadowing performance Volume Shadowing Phase II assists, which Digital set for a Digital Storage intends to support in Systems Interconnect (DSSI) a future release of the configuration of two VAX shadowing product. host computers. Digital Technical Journal Vol. 3 No. 3 Summer 1991 1 Design of VMS Volume Shadowing Phase II-Host-based Shadowing To support the range of configurations required by our customers, the new product had to be capable of shadowing physical devices located anywhere within a VAXcluster system Product Goals and of doing so in a The VMS host-based controller-independent shadowing project was fashion. The VAXcluster I/O undertaken because the system provides parallel original controller access to storage devices shadowing product from all nodes in a cluster is architecturally simultaneously. In order incompatible with many to meet its performance prospective storage devices goals, our shadowing and their connectivity product had to preserve requirements. Controller this semantic also. Figure shadowing requires an 2 shows clusterwide shadow intelligent, common sets for a hierarchical controller to access storage controller (HSC) all physical devices in configuration with multiple a shadow set. Devices computer interconnect (CI) such as the RF-series buses. When compared to integrated storage elements Figure 1, this figure shows (ISEs) with DSSI adapters a larger cluster containing and the RZ-series small several clusterwide shadow computer systems interface sets. Note that multiple (SCSI) disks present nodes in the cluster have configurations that direct, writable access to conflict with this method the disks comprising the of access. shadow sets. In addition to providing impact on the design of highly available access the host-based shadowing to shadow sets from implementation. Our goals anywhere in a cluster, to maximize application the new shadowing I/O availability during implementation had other transient states, to requirements. Phase II had provide customizable, to deliver performance event-driven design and comparable to that of fail-over, to enable all controller-based shadowing, cluster nodes to manage the maximize application I/O shadow sets, and to enhance availability, and ensure system disk capabilities data integrity for critical were all affected by applications. customer feedback. In designing the new Technical Challenges product, we benefited To provide volume from customer feedback shadowing in a VAXcluster about the existing environment running under implementation. This the VMS operating system feedback had a positive required that we solve 2 Digital Technical Journal Vol. 3 No. 3 Summer 1991 Design of VMS Volume Shadowing Phase II-Host-based Shadowing complex, distributed cluster. Membership and systems problems.[1] This state information about section describes the the shadow set is stored most significant technical on all physical members in challenges we encountered an on-disk data structure and the solutions we called the storage control arrived at during the block (SCB). One way design and development that shadowing uses this of the product. SCB information is to Membership Consistency. automatically determine To ensure the level of the most up-to-date shadow integrity required for set member(s) when the set high availability systems, is created. In addition to the shadowing design distributed synchronization must guarantee that a primitives, the VMS shadow set has the same lock manager provides a membership and states on capability for managing a all nodes in the cluster. distributed state variable A simple way to guarantee called a lock value block. this property would have Shadowing uses the lock been a strict client- value block to define a server implementation, disk that is guaranteed where one VAX computer to be a current member of serves the shadow set the shadow set. Whenever to the remainder of the a membership change is cluster. This approach, made, all nodes take part however, would have in a protocol of lock violated several design operations; the value block goals; the intermediate and the on-disk SCB are hop required by data the final arbiters of set transfers would decrease constituency. system performance, Sequential Commands. A and any failure of the sequential I/O command, serving CPU would require i.e., a Mass Storage a lengthy fail-over and Control Protocol (MSCP) rebuild operation, thus concept, forces all negatively impacting system commands in progress availability. to complete before To solve the problem of the sequential command membership consistency, we begins execution. While used the VMS distributed a sequential command lock manager through a is pending, all new I/O new executive thread- requests are stalled level interface.[2,3] We until that sequential designed a set of event- command completes driven protocols that execution. Shadowing shadowing uses to guarantee requires the capability membership consistency. to execute a clusterwide, These protocols allowed sequential command during us to make the shadow certain operations. This set virtual unit a local capability, although device on all nodes in the a simple design goal Digital Technical Journal Vol. 3 No. 3 Summer 1991 3 Design of VMS Volume Shadowing Phase II-Host-based Shadowing for a client-server on application I/O implementation, is performance. a complex one for a Merge Operations. Merge distributed access model. operations are triggered We chose an event-driven, when a CPU with write request/response protocol access to a shadow set to create the sequential fails. (Note that with command capability. controller shadowing, Since sequential commands merge operations are have a negative impact on copy operations that are performance, we limited triggered when an HSC the use of these commands fails.) Devices may still to performing membership be valid members of the changes, mount/dismount shadow set but may no operations, and bad block longer be identical, due and merge difference to outstanding writes in repairs. Steady state progress when the host CPU processing never requires failed. The merge operation using sequential commands. must detect and correct Full Copy. A full copy these differences, so that is the means by which a successive application new member of the shadow reads for the same set is made current with data produce consistent the rest of the set. The results. As for full copy challenge is to make copy operations, the challenge operations unintrusive; with merge processing is application I/Os must to generate consistent proceed with minimal results with minimal impact so that the level impact on application I/O of service provided by the performance. system is both acceptable Booting and Crashing. and predictable. VMS System disk shadowing file I/O provides record- presents some special level sharing through the problems because the shadow application transparent set must be accessible locking provided by the to CPUs in the cluster VAX RMS software, Digital's when locking protocols and record management services. inter-CPU communication are Shadowing operates at the disabled. In addition, physical device level to crashing must ensure handle a variety of low- appropriate behavior for level errors. Because writing crash dumps through shadowing has no knowledge the primitive bootstrap of the higher-layer driver, including how record locking, a copy to propagate the dump to operation must guarantee the shadow set. It was that the application I/Os not practical to modify and the copy operation the bootstrap drivers itself generate the because they are stored in correct results and do read-only memory (ROM) on so with minimal impact various CPU platforms that shadowing would support. 4 Digital Technical Journal Vol. 3 No. 3 Summer 1991 Design of VMS Volume Shadowing Phase II-Host-based Shadowing Error Processing. speeds. Controller One major function of shadowing does not volume shadowing is provide this capability. to perform appropriate o Allows each node in error processing for the cluster to perform members of the shadow error recovery based set, while maximizing data on access to physical availability. To carry out data source members. this function, the software The shadowing software must prevent deadlocks treats communication between nodes and decide failures between any when to remove devices from cluster node and shadow the shadow set. We adopted set members as normal a simple recovery ethic: a shadowing events with node that detects an error customer-definable is responsible for fixing recovery metrics. that error. Membership changes are serialized in Major Components the cluster, and a node VMS Volume Shadowing Phase only makes a membership II consists of two major change if the change is components: SHDRIVER and accompanied by improved SHADOW_SERVER. SHDRIVER is access to the shadow the shadowing virtual unit set. A node never makes driver. As a client of disk a change in membership class drivers, SHDRIVER is without having access to responsible for handling some source members of the all I/O operations that are set. directed to the virtual unit. SHDRIVER issues Architecture physical I/O operations Phase II shadowing provides to the disk class driver a local virtual unit on to satisfy the shadow each node in the cluster set virtual unit I/O with distributed control requests. SHDRIVER is also of that unit. Although responsible for performing the virtual unit is not all distributed locking and served to the cluster, the for driving error recovery. underlying physical units SHADOW_SERVER is a VMS that constitute a shadow ancillary control process set are served to the (ACP) responsible for cluster using the standard driving copy and merge VMS mechanisms. This scheme operations performed on has many data availability the local node. Only one advantages. The Phase II optimal node is responsible design for driving a copy or merge o Allows shadowing to use operation on a given shadow all the VMS controller set, but when a failure fail-over mechanisms for occurs the operation will physical devices. As a fail over and resume on result, member fail-over another CPU. Several approaches hardware factors determine this optimal node including the Digital Technical Journal Vol. 3 No. 3 Summer 1991 5 Design of VMS Volume Shadowing Phase II-Host-based Shadowing types of access paths, and sequential stall requests controllers for the members to other nodes that have and user-settable, per-node the shadow set mounted. copy quotas. This initiating thread Primitives waits until all other nodes in the cluster have flushed This section describes their I/Os and responded the locking protocols and to the node requesting the error recovery processing sequential operation. Once functions that are used all nodes have responded by the shadowing software. or left the cluster, the These primitives provide operations that compose basic synchronization and the sequential command recovery mechanisms for execute. When this process shadow sets in a VAXcluster is complete, the locks system. are released, allowing Locking Protocols. The asynchronous threads on the shadowing software uses other nodes to proceed and event-driven locking automatically resume I/O protocols to coordinate operations. The local node clusterwide activity. These resumes I/O as well. request/response protocols Error Recovery Processing. provide maximum application Error recovery processing I/O performance. A VMS is triggered by either executive interface to the asynchronous notification distributed lock manager of a communication failure allows shadowing to make or a failing I/O operation efficient use of locking directed towards a physical directly from SHDRIVER. member of the shadow set. One example of this use of Two major functions of locking protocols in VMS error recovery are built Volume Shadowing Phase II into the virtual unit is the sequential command driver: active and passive protocol. As mentioned in volume processing. the Technical Challenges Active volume processing section, shadowing requires is triggered directly by the sequential command events that occur on a capability but minimizes local node in the cluster. the use of this primitive. This type of volume Phase II implements the processing uses a simple, capability by using several localized ethic for error locks, as described in the recovery from communication following series of events. or controller failures. A node that needs to Shadow set membership execute a sequential decisions are made locally, command first stalls based on accessibility. If I/O locally and flushes no members of a shadow set operations in progress. are currently accessible The node then performs from a node, then the lock operations that ensure membership does not change. serialization and sends If some but not all members of the set are accessible, 6 Digital Technical Journal Vol. 3 No. 3 Summer 1991 Design of VMS Volume Shadowing Phase II-Host-based Shadowing the local node, after volume processing, the attempting fail-over, I/O requests are stalled removes some members to because the membership of allow application I/O to the set is in doubt, and proceed. The system manager correct processing of the sets the time period during request cannot be performed which members may attempt until the situation is fail-over. The actual corrected. removal operation is a sequential command. The Steady State Processing design allows for maximum flexibility and quick error The shadowing virtual unit recovery and implicitly driver receives application avoids deadlock scenarios. read and write requests Passive volume processing and must direct the I/O responds to events that appropriately. This section occur elsewhere in the describes these steady cluster; messages from state operations. nodes other than the local Read Algorithms one trigger the processing The shadowing virtual unit by means of the shadowing driver receives application distributed locking read requests and directs protocols. This volume a physical I/O to an processing function is appropriate member of the responsible for verifying set. SHDRIVER attempts the shadow set membership to direct the I/O to the and state on the local optimum device based on node and for modifying this locally available data. membership to reflect any This decision is based on changes made to the set by (1) the access path, i.e., the cluster. To accomplish local or served by the VMS these operations, the operating system, (2) the shadowing software first service queue lengths at reads the lock value block the candidate controller, to find a disk guaranteed and (3) a round-robin to still be in the shadow algorithm among equal set. Then the recovery paths. Figure 3 shows a process retrieves the shadow set read operation. physical member's on-disk An application read to the SCB data and uses this shadow set causes a single information to perform the physical read to be sent relevant data structure to an optimal member of the updates on the local node. set. In Figure 3, there is Application I/O requests one local and one remote to the virtual unit are member, so the read is sent always stalled during to the local member. volume processing. In the case of active volume processing, the stalling is necessary because many I/Os would fail until the error was corrected. In passive Digital Technical Journal Vol. 3 No. 3 Summer 1991 7 Design of VMS Volume Shadowing Phase II-Host-based Shadowing Data repair operations caused by media defects are triggered by a read operation failing with an appropriate error, such as forced error or parity. The shadowing driver attempts this repair using another member of the shadow set. This repair operation is performed with the synchronization of a sequential command. Sequential protection is required because a read operation is being converted into a write operation without explicit, RMS-layer synchronization. Write Algorithms The shadowing virtual unit driver receives application write requests and then issues, in parallel, write requests to the physical members of the set. The virtual unit write operation does not complete until all physical writes complete. A shadow set write operation is shown in Figure 4. Physical write operations to member units can fail or be timed out; either condition triggers the shadowing error recovery logic and can cause a fail-over or the removal of the erring device from the shadow set. 8 Digital Technical Journal Vol. 3 No. 3 Summer 1991 Design of VMS Volume Shadowing Phase II-Host-based Shadowing operations on each logical Transient State Processing block number (LBN) range until the compare operation Shadowing performs a succeeds. If an LBN range variety of operations has such frequent activity in order to maintain that the compare fails many consistency among the times, SHDRIVER performs members of the set. These a synchronized update. A operations include full distributed fence provides copy, merge, and data a clusterwide boundary repair and recovery. This between the copied and section describes these the uncopied areas of the transient state operations. new member. This fence is Full Copy used to avoid performing the special full copy Full copy operations are mechanisms on application performed under direct writes to that area of the system manager control. disk already processed by When a disk is added to the the copy thread. shadow set, copy operations This algorithm meets take place to make the the goal of operational contents of this new set correctness (both the member identical to that application and the copy of the other members. Copy thread achieve the proper operations are transparent results with regard to to application processing. the contents of the shadow The new member of the set members) and requires shadow set does not provide no synchronization with any data availability the copy thread. Thus, protection until the copy the algorithm achieves completes. maximum application I/O There is no explicit availability during the gatekeeping during the transient state. Crucial copy operation. Thus, to achieving this goal is application read and the fact that, by design, write operations occur the copy thread does not in parallel with copy perform I/O optimization thread reads and writes. As techniques such as double shown in Figure 5, correct buffering. The copy results are accomplished by operations receive equal the following algorithm. service as application During the full copy, I/Os. the shadowing driver Merge Operations processes application write operations in two groups: The VMS Volume Shadowing first, those directed to Phase II merge algorithm all source members and meets the product goals of second, writes to all full operational correctness, copy targets. The copy while maintaining thread performs a sequence high application I/O of read source, compare availability and minimal target, and write target synchronization. A merge Digital Technical Journal Vol. 3 No. 3 Summer 1991 9 Design of VMS Volume Shadowing Phase II-Host-based Shadowing operation is required when a CPU crashes with the shadow set mounted for write operations. A merge is needed to correct for the possibility of partially completed writes that may have been outstanding to the physical set members when the failure occurred. The merge operation ensures that all members contain identical data, and thus the shadow set virtual unit behaves like a single, highly available disk. It does not matter which data is more recent, only that the members are the same. This satisfies the purpose of shadowing, which is to provide data availability. But since the failure occurred while a write operation was in progress, this consistent shadow set can contain either old or new data. To make sure that the shadow set contains the most recent data, a data integrity technique such as journaling must be employed. 10 Digital Technical Journal Vol. 3 No. 3 Summer 1991 Design of VMS Volume Shadowing Phase II-Host-based Shadowing In Phase II shadowing, disk already processed by merge processing is the merge thread. Figure distinctly different 6 illustrates the merge from copy processing. The algorithm. shadow set provides full Note that controller availability protection shadowing performs an during the merge. As a operation called a merge result, merge processing copy. Although this HSC is intentionally designed merge copy operation is to be a background activity designed for the same and to maximize application purpose as the Phase II I/O throughput while the operation, the approaches merge is progressing. The differ greatly. An HSC merge thread carefully merge copy is triggered monitors I/O rates and when an HSC, not a shadow inserts a delay between set, fails and performs its I/Os if it detects a copy operation; the HSC contention for shared merge copy does not detect system resources, such as differences. adapters and interconnects. Performance Assists In addition to maximizing A future version of I/O availability, the merge the shadowing product algorithm is designed to is intended to utilize minimize synchronization controller performance with application I/Os and assists to improve copy to identify and correct and merge operations. data inconsistencies. These assists will be used Synchronization takes automatically, if supported place only when a rare by the controllers involved difference is found. in accessing the physical When an application read members of a shadow set. operation is issued to a shadow set in the merge COPY_DATA is the ability of state, the set executes the a host to control a direct read with merge semantics. disk-to-disk transfer Thus, a read to a source without the data entering and a parallel compare or leaving the host CPU I/O with the other members adapters and memory. This of the set are performed. capability will be used Usually the compare matches by full copy processing to and the operation is decrease the system impact, complete. If a mismatch the bandwidth, and the time is detected, a sequential required for a full copy. repair operation begins. The members of the set and The merge thread scans /or their controllers must the entire disk in the share a common interconnect same manner as the read, in order to use this looking for differences. A capability. The COPY_ distributed fence is used DATA operation performs to avoid performing merge specific shadowing around mechanisms for application the active, copy LBN range reads to that area of the to ensure correctness. Digital Technical Journal Vol. 3 No. 3 Summer 1991 11 Design of VMS Volume Shadowing Phase II-Host-based Shadowing This operation involves LBN range-based gatekeeping in the copy target device controller. 12 Digital Technical Journal Vol. 3 No. 3 Summer 1991 Design of VMS Volume Shadowing Phase II-Host-based Shadowing Controller write logging The forced error returned is a future capability on a read operation is the in controllers, such as signal to the shadowing HSCs, that will allow software to execute a more efficient merge repair operation. SHDRIVER processing. Shadowing attempts to read usable write operation messages data from another source will include information device. If such data is for the controller to log available, the software I/Os in its memory. These writes the data to the logs will then be used by revectored block and then the remaining host CPUs returns the data to the during merge processing application. If no usable to determine exactly which data source is available, blocks contain outstanding the software performs write write operations from a operations with a forced failed CPU. With such a error to all set members performance assist, merge and signals the application operations will take less that this error condition time and will have less has occurred. Note that impact on the system. a protected system buffer is used for this operation Data Repair and Recovery because the application As discussed in the reading the data may not Primitives section, have write access. data repair operations A future shadowing product are triggered by is intended to support SCSI failing reads and are peripherals, which do not repaired as sequential have the DSA primitives commands. Digital Storage outlined above. There is Architecture (DSA) devices no forced error indicator support two primitive in the SCSI architecture, capabilities that are key and the revector operation to this repair mechanism. is nonatomic. To perform When a DSA controller shadowing data repair on detects a media error, such devices, we will use the block in question the READL/WRITEL capability is sometimes repaired optionally supported by automatically, thus SCSI devices. These I/O requiring no shadowing functions allow blocks to intervention. When the be read and written with controller cannot repair error correction code the data, a spare block (ECC) data. Shadowing is revectored to this LBN, emulates forced error and the contents of the by writing data with an block are marked with a intentionally incorrect forced error. This causes ECC. To circumvent the subsequent read operations lack of atomicity on the to fail, since the contents revector operation, a of the block are lost. device being repaired is temporarily marked as a full copy target until Digital Technical Journal Vol. 3 No. 3 Summer 1991 13 Design of VMS Volume Shadowing Phase II-Host-based Shadowing the conclusion of the time, shadowing builds a repair operation. If the read-only shadow set that CPU fails in the middle contains only the boot of a repair operation, member. Once locking is the repair target is now a enabled, shadowing performs full copy target, which a variety of checks on preserves correctness the system disk shadow set in the presence of these to determine whether or nonatomic operations. not the boot is valid. If the boot is valid, System Disk shadowing turns the single- member, read-only set into System disk shadow sets a multimember, writable set presented some unique with preserved copy states. design problems. The system If this node is joining disk must be accessed an existing cluster, the through a single bootstrap system disk shadow set uses driver and hence, a single the same set as the rest of controller type. This the cluster. access takes place when Crash Dumps multihost synchronization is not possible. These two The primitive boot driver access modes occur during uses the system disk to system bootstrap and during write crash dumps when a a crash dump write. system failure occurs. This Shadowed Booting driver only knows how to access a single physical The system disk must be disk in the shadow set. accessed by the system But since a failing node initialization code automatically triggers a executing on the booting merge operation on shadow node prior to any host-to- sets mounted for write, we host communication. Since can use the merge thread the boot drivers on many to process the dump file. processors reside in ROM, The merge occurs either it was impractical to make when the node leaves the boot driver modifications cluster (if there are other to support system disk nodes present) or later, processing. To solve when the set is reformed. this problem, the system As the source for merge disk operations performed difference repairs, the prior to the controller merge process attempts to initialization routine of use the member to which the the system device driver dump file was written and are read-only. It is propagates the dump file to safe to read data from a the remainder of the set. clusterwide, shared device The mechanism here for dump without synchronization file propagation is best- when there is little effort, not guaranteed; but or no risk of the data since writing the dump is being modified by another always best-effort, this node in the cluster. At solution is considered controller initialization acceptable. 14 Digital Technical Journal Vol. 3 No. 3 Summer 1991 Design of VMS Volume Shadowing Phase II-Host-based Shadowing Conclusion David Thiel for general VMS Volume Shadowing consulting. Phase II is a state-of- the-art implementation References of distributed data availability. The project 1. N. Kronenberg, H. team arrived at innovative Levy, W. Strecker, solutions to problems and R. Merewood, "The attributable to a set of VAXcluster Concept: complex, conflicting goals. An Overview of a Digital has applied for Distributed System," four patents on various Digital Technical aspects of this technology. Journal (September 1987): 7-21. Acknowledgements 2. W. Snaman, Jr. and D. I would like to Thiel, "The VAX/VMS acknowledge the efforts Distributed Lock and contributions of the Manager," Digital other members of the VMS Technical Journal shadowing engineering (September 1987): 29-44. team: Renee Culver, William 3. W. Snaman, Jr., Goleman, and Wai Yim. In "Application Design in addition, I would also like a VAXcluster System," to acknowledge Sandy Snaman Digital Technical for Fork Thread Locking, Journal, vol 3. no. Ravindran Jagannathan for 3 (Summer 1991, this performance analysis, and issue): 16-26. Digital Technical Journal Vol. 3 No. 3 Summer 1991 15 ============================================================================= Copyright 1991 Digital Equipment Corporation. Forwarding and copying of this article is permitted for personal and educational purposes without fee provided that Digital Equipment Corporation's copyright is retained with the article and that the content is not modified. This article is not to be distributed for commercial advantage. Abstracting with credit of Digital Equipment Corporation's authorship is permitted. All rights reserved. =============================================================================