Verification of the First Fault-tolerant VAX System By William F. Bruckert, Carlos Alonso and James M. Melvin Abstract strategy outlined a four- The fault-tolerant phase approach which would character of the VAXft require hardware to be 3000 system required that built into the system plans be made early in specifically for test the development stages purposes. for the verification and This paper presents a test of the system. To brief overview of the VAXft ensure proper test coverage system architecture and of the fault-tolerant then describes the methods features, engineers would used to verify the system's build fault-insertion fault tolerance. points directly into the system hardware. In the VAXft 3000 Architectural verification process, Overview test engineers would use hardware and software fault The VAXft fault-tolerant insertion in directed and system is designed to random test forms. A four- recover from any single phase verification strategy point of hardware failure. was devised that would Fault tolerance is provided ensure the VAXft system transparently for all hardware and software applications running was fully tested for under the VMS operating error recovery that is system. This section transparent to applications reviews the implementation on the system. of the system to provide Introduction background for the main discussion of the The VAXft 3000 system verification process. provides transparent fault The system comprises two tolerance for applications duplicate systems, called that run on the system. zones. Each zone is a fully Because the 3000 includes functional computer with fault-tolerant features, enough elements to run an verification of the system operating system. These was unlike that ordinarily two zones, referred to conducted on VAX systems. as zone A and zone B, are To facilitate system shown in Figure 1, which test, the verification illustrates the duplication of the system components. Digital Technical Journal Vol. 3 No. 1 Winter 1991 1 Verification of the First Fault-tolerant VAX System The two independent zones are connected by duplicate cross-link cables. The cabinet of each zone also includes battery, power regulator, cooling fans, and AC power input. Each zone's hardware has sufficient error checking to detect all single faults within that zone. Figure 2 is a block diagram of hardware performing the of a single zone with same operations. Correct one I/O adapter. Note operation is verified by the portions of the zone comparison. The fault- labeled dual-rail and detection mechanism for single-rail. The dual- the single-rail I/O modules rail portions of the system combines checking codes and have two independent sets communication protocols. The system performs I/O system. A defective CPU operations by sending and module and its memory are receiving message packets. automatically removed from The packets are exchanged service by the hardware, between the CPU and various and the remaining CPU servers that include disks, and memory continues Ethernet, and synchronous processing. lines. These message packets are formed and interpreted in the dual- rail portion of the system. They are protected in the single-rail portion of the machine by check codes which are generated and checked in the dual-rail portion of the machine. Corrupted packets can be retransmitted through the same or alternate paths. In the normal mode of fault-tolerant operation, both zones execute the same instruction at the same time. The four processors (two in each zone) appear to the operating system as a single logical CPU. The hardware supplies the detection and recovery facilities for faults detected in the CPU and memory portions of the 2 Digital Technical Journal Vol. 3 No. 1 Winter 1991 Verification of the First Fault-tolerant VAX System Error handling for the frequently in computer I/O interconnections is system verification managed differently. The and follow a strict paths to and from I/O test sequence. Complex adapters are duplicated systems, however, cannot for checking purposes. be completely verified in If a fault is detected, a directed fashion. [1] the hardware retries the As a case in point, an operation. If successful, operating system running on the error is logged, and a processor has innumerable operation continues without states. Directed tests software assistance. If verify functional operation the retry is unsuccessful, under a particular set of the Fault-tolerant System conditions. They may not, Services (FTSS) software however, be used to verify performs error recovery. that same functionality FTSS is a layer software under all possible system product that is utilized conditions. with every VAXft 3000. In comparison, random It provides the software testing allows multiple necessary to complete test processes to interact system error recovery. in a pseudo-random or For system recovery from random fashion. In random a failed IO device, an testing, test coverage is alternate path or device increased with additional is used. All recoverable run-time. Thus, once the faults have an associated proper test processes maximum threshold value. If are in place, the need to this threshold is exceeded, develop additional tests in FTSS performs appropriate order to increase coverage device reconfiguration. is eliminated. This type of testing also reduces the Verification of a effects of the biases of Fault-tolerant VAX System the engineers generating This section entails a the tests. While directed discussion of the types testing can provide only a of system tests and the limited level of coverage, fault-insertion techniques this coverage level can used to ensure the correct be well understood. operation of the VAXft Random testing offers system. In addition, the a potentially unbounded four-phase verification level of coverage; however, strategy and the procedures quantifying this coverage involved in each phase are is difficult if not reviewed. impossible. There are two types of To achieve the proper level system tests: directed of verification, the VAXft and random. Directed verification utilized a tests, which test specific balance of directed and hardware or software random testing. Directed features, are used most testing was used to achieve a certain base level of Digital Technical Journal Vol. 3 No. 1 Winter 1991 3 Verification of the First Fault-tolerant VAX System functionality, and random insertion, a mechanism must testing was used to expand either be designed into the level of coverage. the system, or an external To permit testing of insertion device must be system fault tolerance developed once the hardware in a practical amount of is available. Given the time, some form of fault physical feature size of insertion is required. The the components used today, reliability of components it is virtually impossible used in computer systems to achieve adequate fault- has been improving, and insertion coverage through more importantly, the an external fault-insertion number of components used mechanism. to implement any function The error detection has been dramatically and recovery mechanism decreasing. These determines which fault factors have produced a insertion technique corresponding reduction is suitable for each in system failure rates. component. Some examples Given the high reliability illustrate this point. of today's machines, it For the lockstep portion is not practical from a of the VAXft 3000 CPUs, verification standpoint to software fault insertion verify a system by letting is not suitable because it run until failures the lockstep functionality occur. prevents corruption of Conceptually, faults can memory or registers when be inserted in two ways. faults occur. Therefore, First, memory locations and hardware faults cannot registers can be corrupted be mimicked by modifying to mimic the results of memory contents. However, gate-level faults (software the software fault- fault insertion). Second, insertion technique was gate-level faults may be suitable to test the I/O inserted directly into adapters since the system the hardware (hardware handles faults in the fault insertion). There adapters by detecting are advantages to both the corruption of data. techniques. One advantage Hardware fault insertion of software-implemented was not suitable because fault insertion is that the I/O adapters were no embedded hardware implemented with standard support is required.[2] components that did not The advantage of hardware support hardware fault fault insertion, on the insertion. other hand, is that faults Because the verification are more representative strategy for the 3000 was of actual hardware considered a fundamental failures and can reveal part of the system unanticipated side effects development effort, fault from a gate-level failure. insertion points were To utilize hardware fault built directly into the 4 Digital Technical Journal Vol. 3 No. 1 Winter 1991 Verification of the First Fault-tolerant VAX System system hardware. The amount of logic necessary to implement fault insertion is relatively small. The goals of the fault- insertion hardware were to Digital Technical Journal Vol. 3 No. 1 Winter 1991 5 Verification of the First Fault-tolerant VAX System o Eliminate any corruption was considered sufficient of the environment for data path coverage. under test that could Since a significant result from fault portion of the chip area insertion. For example, is consumed by data paths, if a certain type of a high level of coverage system write operation of each chip was achieved is required to insert a with relatively few fault- fault, then every test insertion points. The case will be done on remaining fault-insertion a system that is in a points could then be "post fault-insertion" applied to the control state. logic. Coverage of this o Enable the user to logic was important because distribute faults control logic faults result randomly across the in error modes that are system more unpredictable than o Allow insertion of data path failures. faults during system The effect that a given operation fault has on the system o Enable testing of depends on the current transient and solid system operation and when faults in that operation the The fault-insertion fault was inserted. In points are accessed the 3000, for example, through a separate serial a failure of bit 3 in interface bus that is a data path will have isolated from the operating significantly different hardware. This separate behavior depending upon interface ensures that whether the data bit the environment under was incorrect during test is unbiased by fault the address transmission insertion. portion of a cycle or during the succeeding Even with hardware support data portion. Therefore, for fault insertion, only the timing of the fault a small number of fault- insertion was pseudo- insertion points can be random. The choice of implemented relative to pseudo-random insertion was the total number possible. based on the fact that the Where the number of fault- fault-insertion hardware insertion points is small, operated asynchronously to the selection of the the system under test. This fault-insertion points meant that faults could is important to achieve be inserted at any time, a random distribution. without correlation to Fault-insertion points were the activity of the system designed into most of the under test. custom chips in the VAXft system. When choosing the fault-insertion points, a single bit of a data path 6 Digital Technical Journal Vol. 3 No. 1 Winter 1991 Verification of the First Fault-tolerant VAX System Faults may be transient or Each of the four solid in nature. For design verification phases built purposes, a solid fault upon the previous phases. was defined as a failure 1. Hardware verification that will be present on under simulation retry of an operation. A transient fault was 2. Hardware verification defined as a fault that with system exerciser will not be present on and fault insertion retry of the operation. 3. System software Transient faults do not verification with fault require the removal of the insertion device that experienced the fault; solid faults 4. System application do require device removal. verification with fault Since the system reacts insertion differently to transient Figure 3 shows the and hard faults, both functional layers of types of faults had to the VAXft 3000 system be verified in the VAXft in relation to the system. Therefore, it was verification phases. required that the fault- The numbered brackets to insertion hardware be the right of the diagram capable of inserting solid correlate to the testing or transient faults. Solid coverage of each layer. faults were inserted by For example, the system continually applying the software verification, fault-insertion signal. phase 3, verified the VMS Transient faults were system, Fault-tolerant inserted by applying the System Services (FTSS), and fault-insertion signal only the hardware platform. until the machine detected an error. As noted earlier, the verification strategy utilized both hardware and software fault insertion. The hardware fault- insertion mechanisms allowed faults to be inserted into any system environment, including diagnostics, exercisers, and the VMS operating system. As such, it was used for initial verification as well as regression testing of the system. The verification strategy for the 3000 involved a multiphase effort. Digital Technical Journal Vol. 3 No. 1 Winter 1991 7 Verification of the First Fault-tolerant VAX System The following sections The simulation controller briefly describe the provided the following four phases of the VAXft control over the testing: verification. o Initialization of all Hardware Verification under memory elements and Simulation certain system registers to reduce test time Functional design o Setup of all memory data verification using software buffers to be used in simulation is inherently testing slow in a design as large o Automated test execution as the VAXft 3000 system. To use resources most o Automated checking of efficiently, a verification test results effort must incorporate o Log of test results a number of different modeling levels, which For each test case, means trading off detail to the test environment achieve other goals such as was selected from the speed.[3] following: memory testing, VAXft 3000 simulation I/O register access, occurred at two levels: direct memory access (DMA) the module level and the traffic, and interrupt system level. Module-level cycles. In any given test simulation verified the case, any number of the base functionality of previous tests could be each module. Once this run. These environments verification was complete, could be run with or a system-level model was without faults inserted. In produced to validate the addition, each environment intermodule functionality. consisted of multiple The system-level model test cases. In an error consisted of a full dual- handling test case, the rail, dual-zone system proper system environment with an I/O adapter in each required for the test was zone. At the final stage, set, and then the fault was full system testing was inserted into the system. performed. The logic simulator used was designed to verify Over 500 directed logic design. When an error test cases were illegal logic condition developed for gate- was detected, it produced level system simulation. an error response. When a For each test, the test fault insertion resulted in environment was set up an illegal logic condition, on a fully operational the simulator responded system model and then by invalidating the test. the fault was inserted. Because of this, a great A simulation controller deal of time was spent to was developed to coordinate ensure that faults were the system operations in inserted in a way that the simulation environment. would not generate illegal 8 Digital Technical Journal Vol. 3 No. 1 Winter 1991 Verification of the First Fault-tolerant VAX System conditions. Each test case suite of tests worked was considered successful correctly, fault insertion only when the system error was performed while registers contained the the system continually correct data and the system switched between all had the ability to continue functions. This testing operation after the fault. was more representative of Hardware Verification with actual faults in customer System Exerciser and Fault environments, but was less Insertion reproducible. After the prototypes As previously mentioned, were available, the the hardware fault- verification effort shifted insertion tool allowed the from simulation to fault insertion of both transient insertion on the hardware. and solid failures. The The goal was to insert VAXft 3000 hardware faults using an exerciser recovers from transient that induced stressful, failures and utilizes reproducible hardware software recovery for hard activity and that allowed failures. Since the goal us to analyze and debug the of phase 2 testing was to fault easily. verify the hardware, the focus was on transient Exerciser test cases fault insertion. Two were developed to stress criteria for each error the various hardware case determined the success functions. The tests were of the test. First and designed to create maximum foremost, the system must interrupt and data transfer continue to run and to activity between the CPU produce correct results. and the I/O adapters. Second, the error data These functions could that the system captures be tested individually must be correct based or simultaneously. The on the fault that was exerciser scheduler inserted. Correct error provided a degree of data is important because randomness such that the it is used to identify the interaction of functions failing component both for was representative of a software recovery and for real operating system. The servicing. fault-insertion hardware Although the simulation was used to achieve a environment of phase 1 random distribution was substantially slower of fault cases across than phase 2, it provided the system. Because it the designers with more was possible to insert information. Therefore when initial faults while problems were discovered specific functions were on the prototypes used performed, a great degree in phase 2, the failing of reproducibility was case was transferred achieved that aided debug to the simulator for efforts. Once the full further debug. The hardware Digital Technical Journal Vol. 3 No. 1 Winter 1991 9 Verification of the First Fault-tolerant VAX System verification also validated was running. The completion the models and test criteria for tests included procedures used in the the following: simulation environment. o Detection of the fault System Software o Isolation of the failed Verification with Fault hardware Insertion o Continuation of the In parallel with hardware test processes without verification, the VAXft interruption 3000 system software error System Application handling capabilities Verification with Fault were tested. This phase Insertion represented the next higher level of testing. The The goal for the final goal was to verify the VAX phase of the VAXft 3000 functionality of the 3000 verification was to as well as the software run an application with recovery mechanisms. fault insertion and to Digital has produced demonstrate that any various test packages to system fault recovery verify VAX functionality. action had no effect on Since the VAXft 3000 system the process integrity incorporates a VAX chip and data integrity of set used in the VAX 6000 the application. The series, it was possible to application used in the use several standard test testing was based on packages that had been used the standard DebitCredit to verify that system.[1] banking benchmark and was implemented using the Fault-tolerant DECintact layered product. verification, however, was The bank has 10 branches, not addressed by any of the 100 tellers, and 3,600 existing test packages. customer accounts (10 Therefore, additional tellers and 360 accounts tests were developed by per branch). Traffic on combining the existing the system was simulated functional test suite using terminal emulation with the hardware fault- process (VAX RTE) scripts insertion tool and software representing bank teller fault-insertion routines. activity. The transaction Test cases used included rate was initially 1 cache failure, clock transaction per second failure, memory failure, (TPS) and was varied up interconnect failures, to the maximum TPS rate to and disk failures. These stress the system load. failures were applied to the system during various system operations. In addition, servicing errors were also tested by removing cables and modules while the system 10 Digital Technical Journal Vol. 3 No. 1 Winter 1991 Verification of the First Fault-tolerant VAX System The general test process The proof of Data Integrity can be described as consisted of using the follows: following consistency rules 1. Started application for transactions: execution. The terminal 1. The sum of the account emulation processes balances is equal to emulating the bank the sum of the teller tellers were started balances, which is equal and continued until the to the sum of the branch system was operating at balances. the desired TPS rating. 2. For each branch, the sum 2. Invoked fault-insertion. of the teller balances A fault was selected at is equal to the branch random from a table of balance. hardware and software 3. For each transaction faults. The terminal processed a new record emulation process must be added to the submitted stimuli to history file. the application before, during, and after fault Application verification insertion. under fault insertion 3. Stopped terminal served as the final emulation process. The level of fault-tolerant application was run validation. Whereas until a quiescent state the previous phases was reached. ensured that the various components required for 4. Performed result fault tolerance operated validation. The properly, the system process integrity application verification and data integrity of demonstrated that these the application was components could operate validated. together to provide a fully All of the meaningful fault-tolerant system. events were logged and time-stamped during the Conclusions experiments. Process The process of verifying integrity was proved by fault tolerance requires verifying continuity of a well architected test transaction processing plan. This plan must be through failures. The time developed early in the stamps on the transaction design cycle because executions and the system hardware support for error logs allowed these testing may be required. two independent processes The verification plan must to be correlated. demonstrate cognizance of the capabilities and limitations at each phase of the development cycle. For example, the speed of simulation prohibits Digital Technical Journal Vol. 3 No. 1 Winter 1991 11 Verification of the First Fault-tolerant VAX System verification of software The test of any fault error recovery in a tolerant system is to simulation environment. survive a real fault Also, when a system is while running a customer implemented with VLSI application. Although technology, the ability pulling a module out to physically insert of a machine may seem faults into the system impressive, machines by means of an external rarely fail as a result mechanical mechanism may of modules falling out not be adequate to properly of the backplane. The verify the correct system intitial test effort of the error recovery. These VAXft 3000 showed that the and other issues must system survived most of the be addressed before the faults introduced. However, chips are fabricated or problems were found which adequate error recovery would have resulted in verification may not be a system outages if possible. Inadequate error left uncorrected. System recovery verification enhancements were made directly increases the both in the area of system risk of real, unrecoverable recovery actions and the faults resulting in system repair call out. While some outages. of the problems were simple The verification plan coding errors, others for the VAXft 3000 system were errors in carefully consisted of the following reviewed and documented phases and objectives: algorithms. Simply put, the collective wisdom of the o Hardware simulation with designers was not always fault insertion verified sufficient to reach the error detection, degree of accuracy desired hardware recovery, and for this fault tolerant error data capture. system. o System exerciser with As the VAXft product family fault insertion enhanced evolves, performance and the coverage of the functional enhancements hardware simulation will be available. The effort. test processes described in o System software with this paper will remain in fault insertion verified use, so that every future software error recovery release of software will and reporting. be better than the previous o System software version. The combination of verification with fault hardware and software fault insertion verified the insertion, coupled with transparency of the physical system disruption system error recovery to allows testing to occur at the application running such a greatly accelerated on the system. rate, that all testing performed will be repeated for every new release. 12 Digital Technical Journal Vol. 3 No. 1 Winter 1991 Verification of the First Fault-tolerant VAX System 2. J. Barton, E. Czeck, Z. Segall, and D. Siewiorek, "Fault Injection Experiments Using FIAT (Fault Injection-based Automated Testing)" IEEE Transactions on Computers vol. 39, no. 4 References (April 1990). 1. J. Croll, L. Camilli, 3. R. Calcagni and W. and A. Vaccaro, "Test Sherwood, "VAX 6000 and Qualification of Model 400 CPU Chip the VAX 6000 Model Set Functional Design 400 System," Digital Verification, Digital Technical Journal, Technical Journal, vol. vol.2, no.2 (Spring 2, no. 2 (Spring 1990): 1990): 73-83. 64-72. Digital Technical Journal Vol. 3 No. 1 Winter 1991 13 ============================================================================= Copyright 1991 Digital Equipment Corporation. Forwarding and copying of this article is permitted for personal and educational purposes without fee provided that Digital Equipment Corporation's copyright is retained with the article and that the content is not modified. This article is not to be distributed for commercial advantage. Abstracting with credit of Digital Equipment Corporation's authorship is permitted. All rights reserved. =============================================================================