Application Design in a VAXcluster System By William E. Snaman, Jr. Abstract systems provide solutions VAXcluster systems provide to these data availability a flexible way to configure and growth problems that a computing system that modern organizations can survive the failure of face.[1] any component. In addition, This paper begins with these systems can grow with an overview of VAXcluster an organization and can be systems and application serviced without disruption design in such systems to applications. These and proceeds with a features make VAXcluster detailed discussion of systems an ideal base VAXcluster design and for developing high- implementation. The paper availability applications then focuses on how this such as transaction information affects the processing systems, servers design of applications for network client-server that take advantage of the applications, and data availability and growth sharing applications. characteristics of a Understanding the basic VAXcluster system. design of VAXcluster systems and the possible Overview of VAXcluster Systems configuration options can help application designers VAXcluster systems take advantage of the are loosely coupled availability and growth multiprocessor characteristics of these configurations that systems. allow the system designer Many organizations depend to configure redundant on near constant access hardware that can survive to data and computing most types of equipment resources; interruption failures. These systems of these services results provide a way to add new in the interruption of processors and storage primary business functions. resources as required by In addition, growing the organization. This organizations face the need feature eliminates the need to increase the amount of either to buy nonessential computing power available equipment initially or to them over an extended to experience painful period of time. VAXcluster upgrades and application Digital Technical Journal Vol. 3 No. 3 Summer 1991 1 Application Design in a VAXcluster System conversions as the systems Application Design in a are outgrown. VAXcluster Environment The VMS operating Application design in a system, which runs on VAXcluster environment each processor node in involves making some a VAXcluster system, basic choices. These provides a high level of choices concern the type of transparent data sharing application to be designed and independent failure and the method used to characteristics. The synchronize the events that processors interact to form occur during the execution a cooperating distributed of the application. The operating system. In designer must also consider this system, all disks application communication and their stored files within a VAXcluster system. are accessible from any A discussion of these processor as if those files issues follows. were connected to a single General Choices for processor. Files can be Application Design shared transparently at the record level by application This section briefly software. describes the general To provide the features of choices available to a VAXcluster system, the application designers in VMS operating system was the areas of client-server enhanced to facilitate computing and data access. this data sharing and Client-server Computing. the dynamic adjustment to The VAXcluster environment changes in the underlying provides a fine base for hardware configuration. client-server computing. These enhancements make Application designers it possible to dynamically can construct server add multiple processors, applications that run storage controllers, disks, on each node and accept and tapes to a VAXcluster requests from clients system configuration. running on nodes in the Thus, an organization can VAXcluster system or purchase a small system elsewhere in a wider initially and expand as network. needed. The addition of If the node running computing and storage a server application resources to the existing fails, the clients of configuration requires no that server can switch software modifications or to another server running application conversions on a surviving node. The and can be accomplished new server can access without shutting down the the same data on disk system. The ability to use or tape that was being redundant devices virtually accessed by the server that eliminates single points of failed. In addition, the failure. redundancy offered by the 2 Digital Technical Journal Vol. 3 No. 3 Summer 1991 Application Design in a VAXcluster System VMS Volume Shadowing Phase of local buffer caches II software eliminates and can aggregate larger data unavailability amounts of data for write in the event of a disk operations, thus minimizing controller or media I/O activity. failure.[2] The system An application that uses is thus very available partitioned data access from the perspective of the lends itself to many client applications. types of high-performance Access to Storage Devices. database and transaction Many application design processing environments. questions involve how VAXcluster systems provide to best access the data such an application with stored on disk. One major the advantage of having advantage of the VAXcluster a storage medium that is system design is that disk available to all nodes storage devices can be even when they are not accessed from all nodes actively accessing the data in an identical manner. files. Thus, if the server The application designer node fails, another server can choose whether the running on a surviving access is simultaneous node can assume the work from multiple nodes or and be able to access the from one node at a time. same files. For this type Consequently, applications of application design, can be designed using VAXcluster systems offer either partitioned data the performance advantages access or shared data of a partitioned data access. model without the problems Using a partitioned data associated with the failure model, the application of a single server. designer can construct an Using a shared data model, application that limits the application designer data access to a single can create an application node or subset of the that runs simultaneously on nodes. The application multiple VAXcluster nodes, runs as a server on a which naturally share data single node and accepts in a file. This type of requests from other nodes application can prevent the in the cluster and in the bottlenecks associated with network. And because the a single server and take application runs on a advantage of opportunities single node, there is no for parallelism on need to synchronize data multiple processors. access with other nodes. The VAX RMS software can Eliminating this source transparently share files of communication latencies between multiple nodes can improve performance in in a VAXcluster system. many applications. Also, Also, Digital's database if synchronization is not products, such as Rdb/VMS required, the designer and VAX DBMS software, can make the best use Digital Technical Journal Vol. 3 No. 3 Summer 1991 3 Application Design in a VAXcluster System provide the same data- The services use a two- sharing capabilities. phase commit protocol. Servers running on multiple A transaction may span nodes of a VAXcluster multiple nodes of a cluster system can accept requests or network. The support from clients in the network provided allows multiple and access the same files resource managers, such as or databases. Because there the VAX DBMS, Rdb/VMS, are multiple servers, the and VAX RMS software application continues to products, to be combined function in the event that in a single transaction. a single server node fails. The DECdtm transaction Application Synchronization processing services take Methods advantage of the guarantees against partitioning, the The application designer distributed lock manager, must also consider how to and the data availability synchronize events that features, all provided by take place on multiple VAXcluster systems. nodes of a VAXcluster VAXcluster and Networkwide system. Two main methods Communication Services can be used to accomplish this: the VMS lock Application communication manager and the DECdtm between different services that provide VMS processors in a VAXcluster transaction processing system is generally support. accomplished using VMS Lock Manager. The DECnet task-to-task VMS lock manager provides communication services or services that are flexible other networking software enough to be used by such as the transmission cooperating processes control protocol (TCP) for mutual exclusion, and the internet protocol synchronization, and (IP). Client-server event notification.[3] applications or peer- An application uses to-peer applications are these services either easy to develop with these directly or indirectly services. The services through components of the allow processes to locate system such as the VAX RMS or start remote servers and software. then to exchange messages. DECdtm Services. The VMS Since the individual operating system provides nodes of a VAXcluster a set of services to system exist as separate facilitate transaction entities in a wider processing.[4] These communication network, DECdtm services enable applications communication the application designer inside a VAXcluster system to implement atomic can rely on general transactions either network interfaces. directly or indirectly. Thus, no special-purpose communication services were 4 Digital Technical Journal Vol. 3 No. 3 Summer 1991 Application Design in a VAXcluster System developed. Applications are Several of the ports simpler to design when they utilize multiple physical can communicate within the communication paths, cluster in the same manner which appear as a single in which they communicate logical path to the with nodes located outside VAXcluster software. This the VAXcluster system. redundancy provides better A DECnet feature known as communication throughput cluster alias provides a and higher availability. collective name for the If multiple logical paths nodes in a VAXcluster exist between a pair of system. Application nodes, the VAXcluster software can connect to software generally selects a node in the cluster using one for active use and the cluster alias name relies on the remaining rather than a specific paths for backup in the node name. This feature event of failure. frees the application from The port layer can contain keeping track of individual any of the following nodes in the VAXcluster interconnects: system and results in o Computer Interconnect design simplification and (CI) bus configuration flexibility. o Ethernet VAXcluster Design and o Fiber distributed data Implementation Details interface (FDDI) To understand how the o Digital Storage Systems design and implementation Interconnect (DSSI) bus of a VAXcluster system Each bus is accessed by affects application design, a port (also called an one must be familiar with adapter) that connects the basic architecture of to the processor node. such a system, as shown in For example the CI bus is Figure 1. This section accessed by way of a CI describes the layers, port. The various buses which range from the provide a wide spectrum communication mechanisms of choices in terms of to the users of the system. wire and adapter capacity, Port Layer number of nodes that can be attached, distance between The port layer consists of nodes, and cost.[5] the lowest levels of the The CI bus was designed architecture, including for access to storage a choice of communication and for reliable host- ports and physical paths to-host communications. (buses). The VAXcluster Each CI port connects to software requires at least two redundant, high-speed one logical communication physical paths. The CI pathway between each pair port dynamically selects of processor nodes in one of the two paths for the VAXcluster system. each transmitted message. Digital Technical Journal Vol. 3 No. 3 Summer 1991 5 Application Design in a VAXcluster System Messages are received on either path. Thus, two nodes can communicate Ethernet and FDDI on one path at the same are open local area time that two other nodes networks, generally shared communicate on the other. by a wide variety of If one physical path consumers. Consequently, fails, the port simply the VAXcluster software uses the remaining path. was designed to use the The existence of the two Ethernet and FDDI ports physical paths is hidden from the software that and buses simultaneously uses the CI port services. with the DECnet or TCP From the standpoint of /IP protocols. This is the cluster software, accomplished by allowing each port represents a the Ethernet data link single logical path to software to control a remote node. Multiple the hardware port. This CI ports can be used to software provides a provide multiple logical multiplexing function such paths between pairs of that the cluster protocols nodes. An automatic load- are simply another user of sharing feature distributes a shared hardware resource. the load between pairs of Each Ethernet and FDDI ports. port connects to a single The DSSI bus was primarily physical path. There may be designed for access to more than one port on each disk and tape storage. processor node. This means However, the bus has that there may be many proven an excellent way separate paths between any to connect small numbers pair of nodes when multiple of processors using the ports are used. The port VAXcluster protocols. Each driver software combines DSSI port connects to a the multiple Ethernet and single high-speed physical FDDI paths into a single path. As in the case of logical path between any the CI bus, several DSSI pair of nodes. The load is ports may be connected to a automatically distributed node to provide redundant among the various possible paths. (Note that the KFQSA physical paths by an DSSI port is for storage algorithm that chooses access only and provides the best path in terms of no general communication adapter capacity and path service between nodes.) latency.[6] System Communications only, depending upon the Services Layer type of port. The SCS The system communications layer manages a logical services (SCS) layer of the path between each pair of VAXcluster architecture nodes in the VAXcluster is implemented in a system. This logical path combination of hardware consists of a virtual and software or software circuit (VC) between each 6 Digital Technical Journal Vol. 3 No. 3 Summer 1991 Application Design in a VAXcluster System pair of SCS ports and a set accomplishes the block of SCS connections that are transfer, thus freeing the multiplexed on that virtual host processor to perform circuit. The SCS provides other tasks. Some DSSI basic connection management ports use hardware to copy and communication services data and others rely on in the form of datagrams, software to perform this messages, and block function. Depending on the transfers over each logical exact model of an Ethernet path. or FDDI port, the port The datagram is a software, rather than the best-effort delivery hardware, moves the data. service which offers no System Applications guarantees regarding loss, The next higher layer in duplication, or ordering the VAXcluster architecture of datagrams packets. consists of multiple system This service requires no applications (SYSAPs). connection between the These applications provide, communicating nodes. In for example, access to general, the VAXcluster disks and tapes and cluster software makes minimal use membership control. The of the datagram service. following sections describe The message and block some SYSAPs. transfer services Connection Manager. The take place over an SCS connection manager serves connection. Consumers of three major functions. SCS services communicate First, the connection with their counterparts on manager knows which remote nodes using these processor nodes are active connections. Multiple members of the VAXcluster connections are multiplexed system and which are not. on the logical path This is accomplished provided between each pair through a concept of of nodes in the VAXcluster cluster "membership." Nodes system. are explicitly added to and The message service is removed from the active set reliable and guarantees of nodes by a distributed that there will be no loss, software algorithm. In a duplication, or permutation VAXcluster system, every of message sequence on processor node must have an a given connection. The open SCS connection to all connection will break other processor nodes. Once rather than allow the a booting node establishes consumer of the service connections to all other to perceive such errors. nodes currently in the The block transfer service VAXcluster system, this provides a way to transfer node can request admission quantities of data directly to the system. When one from the memory of one node node is no longer able to to that of another. For CI communicate with another ports, the port hardware node, one of the two nodes Digital Technical Journal Vol. 3 No. 3 Summer 1991 7 Application Design in a VAXcluster System must be removed from the This message service VAXcluster system. allows users to construct In a VAXcluster system, all efficient protocols that do nodes have a consistent not require acknowledgment view of the cluster of messages. The service membership in the presence proved to be a very of permanent and temporary powerful tool in the design communication failures. of the VMS lock manager. This consistency is The delivery guarantees accomplished by using a inherent in the service two-phase commit protocol minimize the number of to form the cluster, add messages required to new nodes, and remove perform any given locking failed nodes. function, resulting in a corresponding increase in The second function performance. The ability provided by the connection to hide failures by manager is an extension of updating cluster membership the SCS message service. further simplified the This extension guarantees lock manager design and that the service will (1) increased performance; deliver a message to a this capability enabled remote node or (2) remove the removal of logic either the sending node used to handle changes in or the receiving node VAXcluster configurations from the cluster. The and communication errors strong notion of cluster from all main lock manager membership provided by the code paths. connection manager makes The third function of this guarantee possible. the connection manager The service attempts is to prevent partitioning to deliver the queued of the possible cluster messages to remote nodes. members. Partitioning If a connection breaks, of a system exists when the service attempts to separate processing reestablish communication elements function to the remote node and independently. If a system resend the message. allows data sharing, After a period of time completely independent specified by the system processing can result in manager, the service uncoordinated access to declares the connection shared resources and lead irrevocably broken and to data corruption. removes either the sending or the receiving node from In a VAXcluster system, the VAXcluster membership. processors communicate Thus, the service hides and coordinate access all temporary communication to resources by means of failures from its client. a voting algorithm. The system manager assigns a number of votes to each processor node based on the importance of that 8 Digital Technical Journal Vol. 3 No. 3 Summer 1991 Application Design in a VAXcluster System node. The system manager blockage of process and also informs each node I/O activity. of the total number To provide configurations of possible votes. The with a small number of algorithm requires that nodes, e.g., two nodes, more than half of these the concept of a quorum votes be present in a disk was invented. The VAXcluster system for nodes system manager assigns a to function. When the sum disk to contribute votes of all votes contributed by to the cluster. A node must the members of a VAXcluster be able to access a file system falls below this on the disk in order to quorum, the VMS software include the votes assigned blocks I/O to mounted to that disk in the node's devices and prevents the own total. Consequently, scheduling of processes. a special algorithm is As nodes join the cluster, used to access the file. votes are added. Activity This algorithm ensures resumes once a quorum is that two unrelated nodes reached. cannot both count the In practice, the quorum disk votes. Doing so connection manager uses could result in partitioned two measurements of the operation. number of votes: static and Mass Storage Control dynamic. The static count Protocol Server. The of votes is the globally Mass Storage Control agreed on number of votes Protocol (MSCP) server contributed by cluster allows disks that are members. This count is attached to one or more VAX created ignoring the state processors to be accessed of connections between by other processors in the nodes. The value of the VAXcluster system. Thus, static quorum changes only a VAXcluster processor at the completion of two- may emulate a multihost phase commit operations, disk controller by which accomplish a user- accepting and processing requested quorum adjustment I/O requests from other in addition to performing nodes and accessing the the other activities disk indicated by the mentioned earlier in this request. The server can Connection Manager section. process multiple commands Each node independently simultaneously and also maintains the dynamic performs fragmentation of count. This count commands if there is not represents the sum of enough system buffer space all votes contributed by to accommodate the entire VAXcluster members with amount of data at one time. which the tallying node has a functional connection. Changes in the dynamic quorum, and not the static quorum, initiate the Digital Technical Journal Vol. 3 No. 3 Summer 1991 9 Application Design in a VAXcluster System Hierarchical Storage restarts commands as Controllers, Local needed. Controllers, and RF- VAXcluster systems can series Integrated Storage be configured so that all Elements. Hierarchical disks are accessed by way storage controller (HSC) of redundant paths for servers are specialized increased availability. devices that perform MSCP The way in which this is serving of RA-series disk accomplished depends on the drives and TA-series tape type of disk and the disk drives in a VAXcluster controller. system. HSC servers connect directly to the CI bus. In RF-series disks contain addition to providing the integrated controllers that host with access to the connect to a single DSSI storage media, HSC servers storage bus. This bus can accomplish performance be accessed by up to two optimizations such as VAX processors. Each VAX seek-ordering and request processor can then serve fragmentation based on the disks to all other real-time head position nodes in the VAXcluster information. The local disk system. Thus, two paths are controllers attached to the provided to each disk. RA- and TA-series storage RA-series disks connect devices perform the same to up to two storage function for a single host controllers. These processor. The RF-series controllers can be either integrated storage elements (1) local adapters attached (ISEs) attach to a DSSI directly to a single bus. Each of these disk processor node or (2) HSC storage devices performs controllers located on the its own command queuing and CI bus. Disks connected optimization without using to local adapters can be a dedicated controller. served to other nodes of Disk Class Driver. The the VAXcluster system. disk class driver allows Disks located on an HSC access to disks served controller can be directly by an MSCP server, an HSC accessed by processors controller, a local Digital that are not on that bus. storage architecture (DSA) Thus, the use of multiple controller, or attached controllers when combined to a DSSI bus. This driver with disk serving provides provides a command queuing at least two paths to a function that allows a disk from every node in the disk controller to have VAXcluster system. multiple outstanding Since many paths exist to commands which can be used gain access to a disk, to provide seek, rotation, the disk class driver and other performance chooses which path to use optimizations. To handle when a disk is initially temporary communication mounted by a node. If the interruptions, the driver path to the disk becomes 10 Digital Technical Journal Vol. 3 No. 3 Summer 1991 Application Design in a VAXcluster System inoperative, the disk class Lock Manager. The VMS lock driver locates another manager is a system service path and begins to use that provides a distributed it. Server load and type synchronization function of path, i.e., local or used by many components of remote, are considered when the VMS operating system, selecting the new path. including volume shadowing, This reconfiguration is the file system, VAX RMS totally transparent to the software, and the batch end user of the disk I/O /print system. Application service. programs can also use the Tape Class Driver. The lock manager directly. tape class driver performs The lock manager provides functions in a VAXcluster a name space that is truly system similar to those of clusterwide. Cooperating the disk class driver by processes can request providing access to tapes locks on a specific located on HSC controllers, resource name. The lock local controllers, and DSSI manager either grants or buses. denies these requests. VMS Components Layered on Processes can also queue Top of SYSAPs requests. The lock manager services allow processes The SYSAPs provide basic to coordinate the means services that other VMS of access to physical components use to provide resources or simply a wide range of VAXcluster provide a communication features. pathway between processes. Volume Shadowing. The Processes can use the volume shadowing product service for such tasks allows multiple disks as mutual exclusion, event to be utilized as a notification, and server single, highly available failure detection.[2,7] disk. Volume shadowing The lock manager uses provides transparent the communication service access to the data in provided by the connection the event of disk media manager to minimize the or controller failures, message count for a media degradation, and given operation and to communication failures.[2] simplify the design by The shadowing layer works eliminating the need to in conjunction with the consider changes in cluster disk class driver to membership from all main accomplish this task. With paths of operation. the advent of VMS Volume Process Control Services. Shadowing Phase II, disk The VMS process control shadowing is extended to system services take many new configurations. advantage of VAXcluster systems. Applications can use these services to alter process states on remote nodes and to collect Digital Technical Journal Vol. 3 No. 3 Summer 1991 11 Application Design in a VAXcluster System information about those down into various phases processes. In the future, such as fetch sources, it is likely that other compile, and link. The services will be extended phases must execute in to make optimal use of a given order but are VAXcluster capabilities. otherwise independent. File System. The VMS file Each phase can be restarted system (XQP) allows disk from the beginning if devices to be accessed there is an error. Each by multiple nodes in a major component of the VAXcluster system. The VMS operating system is file system uses the lock processed separately during manager to coordinate disk each of the phases. All space allocation, buffer sources reside on a shared caches, modification disk to which all nodes of file headers, and of the VAXcluster system changes to the directory have access; the output structure.[8] disk is shared by all nodes also. A master data file Record Management Services. describes the phases and The VAX RMS software allows the components. For a the sharing of file data given phase, the actions by processes running on the required for each component same or multiple nodes. are fed into a generic The software uses the batch queue. This queue lock manager to coordinate feeds the jobs into work access to files, to record queues on multiple nodes, data within files, and to resulting in the execution global buffers. of many jobs in parallel. Batch/Print System. The When all jobs of a phase batch/print system allows have completed, the next users to submit batch or phase starts. If a node print jobs on one node and fails during the execution run them on another. This of a job, that job is system provides a form of restarted automatically load distribution, i.e., on another node either from generic batch queues can the beginning or from a feed executor queues on checkpoint in the job. This each node. Jobs running use of shared disks and on a failed node can be batch queues provides great restarted automatically parallelism and reliability on another node in the in the VMS build process. VAXcluster system. An Application Constructed The Impact of VAXcluster Using VAXcluster Mechanisms Design and Implementation on Applications The VMS software build This section discusses process is an example how multiple communication of how these mechanisms paths, membership changes, can be used to benefit disk location and application design. The VMS availability, controller software build is broken selection, disk and tape 12 Digital Technical Journal Vol. 3 No. 3 Summer 1991 Application Design in a VAXcluster System path changes, and disk Removing a node is more failure impact application complicated because both design. failure detection and Multiple Communication reconfiguration must take Paths place. In many cases, there may be multiple VAXcluster software simultaneous failures of components are able to nodes and communication take advantage of multiple paths. The view of what communication paths nodes are members and which between nodes. For greatest paths are functional may availability, there should be very different from be at least two physical each node. Additionally, paths between each pair new failures may occur of nodes in a VAXcluster while the cluster is being system.[6] reconfigured. Membership Changes The initial phase involves VAXcluster membership the detection of a node changes involve several failure. A node may cease distinct phases with slight processing, but other variations depending upon cluster members may not whether a node is being be aware of this fact. The added or removed. Adding communication components a node to a VAXcluster generally exchange messages system is the simplest periodically to determine case because it involves whether other nodes are reconfiguration. There is a functioning. The first further simplification in indication of a failure may that nodes are only added be the lack of response to one at a time. A booting these messages. However, node petitions a member a minimum period of time of an existing cluster for must elapse before the membership. This member connection is declared then describes the booting inoperative. This set node to all other member delay prevents breaking nodes and vice versa. In connections when the this way, it is determined network or remote system that the booting node is is unable to respond due in communication with all to a heavy load. Once the members of the cluster. communication failure is The connection manager then detected, the connection adds the new node to the manager is notified by the cluster using a two-phase SCS communication layer. commit protocol to ensure a The connection manager consistent membership view attempts to restore the from all nodes. connection for a time interval defined by the system manager using a system control parameter known as RECNXINTERVAL. Once this interval has expired, the connection Digital Technical Journal Vol. 3 No. 3 Summer 1991 13 Application Design in a VAXcluster System and hence the remote node of nodes rebooting or is declared inoperative. because failed connections The connection manager then are restored. Conditions begins a reconfiguration. can only get worse, i.e., Multiple nodes may attempt simpler, until failures the reconfiguration at the cease to happen long enough same time. A distributed for the reconfiguration to election algorithm is used complete. to select a node to propose However, this worst-case the new configuration. condition is atypical; The elected node proposes most reconfigurations are to all other nodes that very simple. A node that is it can communicate with a removed, as a result of a new cluster configuration planned shutdown or because that consists of the "best" it fails, attempts to send set of nodes that have a "last gasp" datagram to connections between each all VAXcluster members. other. "Best" is determined This datagram indicates by the greatest number of that the node is about possible votes. If multiple to cease functioning. The configurations are possible delay present during the with the same number of failure detection phase votes, the configuration is bypassed completely, with the most nodes is and the connection manager selected. configures a new VAXcluster Any node that receives the system in considerably less proposal and can describe than one second. a better cluster rejects Normally, the impact the proposal. The proposing on an application of a node then withdraws the node joining a VAXcluster proposal and the election system is minimal. For process begins again. This some configurations, cycle continues until all there is no blockage nodes accept the proposal. of locking. In other The cluster membership is cases, the distributed then altered using a two- directory portion of the phase commit protocol, lock database must be removing nodes as required. rebuilt. This process Even when one considers the may block locking for worst case of a continual up to a small number of failure situation, seconds, depending on the convergence on a solution number of nodes, number of is guaranteed because the directory entries, and type connection manager does of communication buses in not add new nodes during use. a reconfiguration and Application delays can connections that fail are result when an improperly never used again. Thus, dismounted disk is mounted conditions cannot oscillate by a booting node. Failure between good and bad during to properly dismount the the reconfiguration because disk, e.g., because of a 14 Digital Technical Journal Vol. 3 No. 3 Summer 1991 Application Design in a VAXcluster System node failure, results in lock manager may experience the temporary loss of some a delay, but as long as preallocated resources such there are sufficient votes as disk blocks and header present in the cluster to blocks. An application can constitute a quorum, the recover these resources I/O is not blocked during when the disk is mounted, the reconfiguration. If the but the I/O is blocked number of votes drops below to the disk during the a quorum, I/O and process mounting operation. activity are blocked to This I/O blocking has a prevent partitioning and potentially detrimental possible data corruption. impact on applications Another aspect of node that are attempting to removal is the need allocate space on the to ensure that all I/O disk. The answer to this requests initiated by the problem is to mount disks removed node complete prior so that the recovery of the to the initiation of new preallocated resources I/O requests to the same is deferred. For all disks. To enhance disk disks except the system performance, many disk disk, disk mounting is controllers can reduce head accomplished with the MOUNT movements by altering the /NOREBUILD command. Because order of simultaneously a system disk is implicitly outstanding commands. This mounting during a system command reordering is not boot, the system parameter a problem during normal ACP_REBLDSYSD must be set operation; applications to the value 0 to defer initiating I/O requests rebuilds. The application coordinate with each other can recover the resources using the lock manager, for at a more opportune time instance, so that multiple by issuing a SET VOLUME writes, or multiple reads /REBUILD command. and writes, to the same The impact on a VAXcluster disk location are never system of removing a node outstanding at the same varies depending on what time. However, when a resources the application node fails, all locks held needs. During the failure by processes running on detection phase, messages that node are released. to a failed node may be Releasing these locks queued pending discovery allows the granting of that there actually is a locks that are waiting and failure. If the application the initiation of new I/O needs a response based on requests. If new locks are one of these messages, the granted, a disk controller application is blocked. may move the new I/O Otherwise, the failure does requests (issued under the not affect the application. new locks) in front of old Once the reconfiguration I/O requests. To prevent starts, locking is blocked. this reordering, a special An application using the MSCP command is issued by Digital Technical Journal Vol. 3 No. 3 Summer 1991 15 Application Design in a VAXcluster System the connection manager to 2. Multiple paths should each disk before new locks exist to any given are granted. This command disk. A disk should creates a barrier for each be dual-pathed between disk that ensures that all multiple controllers. old commands complete prior Dual pathing allows to the initiation of new the disk to survive commands. controller failures. Physical Location and 3. Members of the same Availability of Disks shadow set should be The application designer connected to different does not generally have controllers or buses as to be concerned with determined by the type the physical location of of disk. a disk in a VAXcluster 4. Multiple servers should system. Disks located on be used whenever serving HSC storage controllers disks to a cluster are directly available in order to provide to VAX processors on the continued disk access same CI bus. These disks in the event of a server can then be MSCP-served failure. to any VAX processor Selection of Controllers that is not connected to that bus. Similarly, Using static load disks accessed by way of balancing, the VMS software a local disk controller attempts to select the on a VAX processor can be optimal MSCP server for a MSCP-served to all other disk unit when that unit nodes. This flexibility is initially brought on allows an application to line. The load information access a disk regardless provided by the MSCP of physical location. The server is considered only differences that the in this decision. The application can detect are HSC controllers do not varying transfer rates and participate in this latencies, which depend on algorithm. In addition, the exact path to the disk the VMS software selects and the type of controllers a local controller in involved. preference to a remote MSCP To provide the best server, where possible. application availability, If a remote server is in the following guidelines use and the disk becomes should be considered: available by way of a local controller, the 1. VMS Volume Shadowing software begins to access Phase II should be used the disk though the local to shadow disks, thus controller. This feature is allowing operations to know as local fail-back. continue transparently in the event that a single disk fails. 16 Digital Technical Journal Vol. 3 No. 3 Summer 1991 Application Design in a VAXcluster System An advanced development prevents data corruption effort in the VMS operating in the event that someone system is demonstrating substitutes the storage the viability of dynamic medium without dismounting load balancing across MSCP and remounting the device. servers. Load balancing After a successful check, considers server loading the software restarts dynamically and moves disk incomplete I/O requests and paths between servers to allows stalled I/O requests balance the load among the to proceed. In the case servers. of tapes, the tape must be Disk and Tape Path Changes repositioned to the correct location before restarting Path failures are initially I/O requests. detected by the low-level If the label check communication software, determines that the i.e., the SCS or port original medium is no layers. The communications longer on the disk or tape software then notifies unit, then I/O requests the disk or tape class continue to be stalled and driver of the failure. The a message is sent to the driver then transparently operator requesting manual blocks the initiation intervention to correct of new I/O requests to the problem. Attempts to the device, prepares to reestablish the correct restart outstanding I/O operation of a disk or operations, and begins a tape continue for an search for a new path to interval determined by the the device. Static load system parameter MVTIMOUT balancing information is (mount verification time- considered when attempting out). Once the time-out to find a new path. The period expires, further path search is accomplished attempts to restore are by sending an MSCP GET abandoned and pending UNIT STATUS command to requests are returned to any known disk controller the application with an or MSCP server capable of error status. Thus, the serving the device. Some software handles temporary consideration is given disk path failures in such to selecting the optimal a transparent fashion that controller; for example, the application program, the driver interrogates e.g., the user application, local controllers before VAX RMS software, or the remote controllers. VMS file system, is unaware Once a new path is that an interruption discovered or the old path occurred. reestablished, the VMS Disk Failures system checks the volume label to ensure that the disk or tape volume has not been changed on the device. This verification Digital Technical Journal Vol. 3 No. 3 Summer 1991 17 Application Design in a VAXcluster System If a disk fails completely when VMS Volume Shadowing Phase II software is used, the software removes the Phase II-Host-based failed disk from the Shadowing," Digital shadow set and satisfies Technical Journal, vol. all further I/O requests 3, no. 3 (Summer 1991, using a surviving disk. If this issue): 7-15. a block of data cannot be 3. W. Snaman, Jr. and D. recovered from a disk in Thiel, "The VAX/VMS a shadow set, the software Distributed Lock recovers the data from Manager," Digital the corresponding block Technical Journal, no. 5 on another disk, returns (September 1987): 29-44. the data to the user, and places the data on the bad 4. W. Laing, J. Johnson, disk so that subsequent and R. Landau, reads will obtain the good "Transaction Management data.[2] Support in the VMS Operating System Summary Kernel," Digital Technical Journal, vol. VAXcluster systems continue 3. no. 1 (Winter 1991): to provide a unique base 33-44. for building highly 5. Guidelines for available distributed VAXcluster System systems that span a wide Configurations (Maynard: range of configurations Digital Equipment and usages. In addition, Corporation, Order No. VAXcluster computer EK-VAXCS-CG-004, 1990). systems can grow with an organization. The 6. L. Leahy, "New availability, flexibility, Availability Features and growth potential of of Local Area VAXcluster VAXcluster systems result Systems," Digital from the ability to add Technical Journal, vol. or remove storage and 3, no. 3 (Summer 1991, processing components this issue): 27-35. without affecting normal 7. T.K. Rengarajan, P. operations. Spiro, W. Wright, "High Availability Mechanisms References of VAX DBMS Software," Digital Technical 1. N. Kronenberg, H. Journal, no. 8 (February Levy, and W. Strecker, 1989): 88-98. "VAXclusters: A 8. A. Goldstein, Closely-coupled "The Design and Distributed System," ACM Implementation of Transactions on Computer a Distributed File Systems, vol. 4, no. 2 System," Digital (May 1986): 130-146. Technical Journal, no. 5 2. S. Davis, "Design of (September 1987): 45-55. VMS Volume Shadowing 18 Digital Technical Journal Vol. 3 No. 3 Summer 1991 ============================================================================= Copyright 1991 Digital Equipment Corporation. Forwarding and copying of this article is permitted for personal and educational purposes without fee provided that Digital Equipment Corporation's copyright is retained with the article and that the content is not modified. This article is not to be distributed for commercial advantage. Abstracting with credit of Digital Equipment Corporation's authorship is permitted. All rights reserved. =============================================================================