| |
|
Architecture at HP: Two Decades of Innovation
Microprocessor Forum October 14, 1997 San Jose, California
by Joel Birnbaum
Director of Hewlett-Packard Laboratories, Senior Vice President of
Research and Development
I was quite pleased when Peter Christy asked me to present a short retrospective of the architecture work at HP from the early '80s until our decision to form the IA-64 alliance in 1994 with Intel. He asked me to describe the research precursors to our alliance, why we had decided to do it, and how we hoped to benefit from it. During this period, HP grew from a small niche player in the computer industry to the second largest U.S. computer company, and I like to think that our architectural innovations had a lot to do with our rapid growth.
After I present a brief summary of our major motivations and innovations during this period, I'll close with some observations about the dramatic changes we foresee for the next decade and the architecture and systems research we are doing at HP to address them.
In the late '70s and early '80s, HP supported three disparate but individually successful computer architecture families. The HP-3000 was a 16-bit stack architecture, an early distributed commercial transaction processing system; the HP-1000 was a leading process control and data acquisition system; and the 9000 series of workstations and controllers was based on the Motorola 68000 and used primarily for technical computing. All had proprietary operating systems, compilers, I/O, and networks. The development cost and complexity were fast becoming unmanageable; and so in 1981, a challenge was given to HP Labs, the company's central research organization, to consolidate these architectures and implementations into a single, scalable framework.
The goals were price/performance as well as performance leadership across a broad range of applications. The Spectrum program, Precision Architecture, and HP's subsequent development and extension of PA-RISC technology were the results, and this, together with our good intuition about the trend to open client/server, UNIX-based computing led to HP's emergence as a major player.
In the late '80s and early '90s, as we sought to extend PA-RISC by adding concurrency, we became concerned that we were approaching a level of design complexity that would prove limiting and so, with the firm requirement of compatibility with the now-large installed base of PA-RISC systems, HP Labs was again chartered to explore new approaches to scalable parallelism beyond those achievable by the superscalar, vector, and VLIW machines of the day. The result of that research, known internally as Wide-Word and then as SP-PA, Super-Parallel Processor Architecture, served as the starting point for the Intel alliance. We think the refinements and compatibility features developed during our collaboration provide the basis for architectural differentiation at the systems-level in the next decade, a point I'll address later.
In retrospect, HP has benefited greatly from unusual stability over almost two decades in terms of the core of architects, engineers, and technologists, in both HP Labs and the product divisions. In particular, Bill Worley of HP Labs has headed the stages of both the Precision Architecture and Wide-Word efforts, and his ingenuity is evident throughout. Many other people have also participated in the creation of both architectures and have added continuity as the architecture evolved. Still others have joined us and added perspectives gained at other companies on related technology issues. I wish I could acknowledge them all, but let me mention particularly Rajiv Gupta, who was HP Labs' technical lead in our collaboration with Intel. The contributions of the people of HPL and their architectural colleagues in the product divisions more than ever lie at the heart of our competitive differentiation for the future.
Perhaps the most important thing we did in the early years of the Spectrum program was to articulate a set of basic principles, and they have served us without much change for the new architecture as well. They helped us to make difficult tradeoffs, and the continuity of personnel ensured consistent taste over a long period of time. In the architecture stage, all of our work has always been done with teams, including hardware, software, and technology experts, often sitting side-by-side. At the heart of our principles is the synergy between the compiler and the hardware, with the compiler relied upon to help avoid hardware bottlenecks and critical paths, and the architectural hardware mechanisms developed to reduce stalls, delays, and critical paths in the code. Like William of Occam, we have believed that the simplest solution is best but, like Albert Einstein, we also believed that everything should be as simple as possible, but not simpler. Hewlett-Packard is the world's largest measurement company; which perhaps explains why we have been paranoid about measuring everything that might be relevant, requiring that any new instruction or feature be able to demonstrate commensurate throughput gains on application workloads of interest. Our guiding principle has been the most work in the least time – a focus on the speedometer and not the tachometer. Scalability has been our watchword, compatibility our constraint; together they are the sine qua non of our philosophy. We depended on the technology independence of the framework and a variety of attached processors to deliver an unprecedented dynamic range of price and performance. I'll say more about compatibility in a few moments.
Technology enablers that exist, or can be projected to exist, in the timeframe of interest must back up principles and philosophy. The progress in VLSI enabled us to make an investment in high-speed registers and large cache memories. The advances in compiler technology had reached the point where compilers could produce programs almost as good as the best hand code. Surprisingly, it became clear that the simple RISC instructions were a better target for compilers than the more complicated predecessors. This suggested that it did not make sense to hide the cache and led to our decision to create a non-Von Neumann architecture with separate instruction and data caches. The progress in VLSI and globally optimizing compilers drove the RISC revolution, but it wouldn't have happened if the technology to measure and understand the performance limiters for specific workloads and to evaluate architectural alternatives hadn't evolved concurrently.
It is interesting that updated versions of these same three factors are what we believe will drive the EPIC architecture revolution. Of course, the workloads are different and, as you've heard, the mechanisms for concurrency are new, but at the heart of the innovation are the compiler/hardware tradeoffs and the ability to evaluate them.
This slide addresses the two major challenges that dominated our attention in the RISC era. The technical press and competitors amused themselves spelling RISC with a "K" instead of a "C" and speculated widely on its inability to address COBOL-based commercial computing and real-time, among other purported limitations. However, the accuracy and reliability of the compiler and our ability to migrate legacy code, much of it in object format, were what kept us up late at night. Not surprisingly, these are still the major challenges for the next generation, but we are as confident today as we were 15 years ago that we have solutions in hand.
We had to do a good deal of innovation in PA-RISC to achieve our goals, and this slide lists some of the major inventions that went beyond conventional RISC. As we traced billions of instructions across very diverse workloads, we found several instructions that almost always appeared together as pairs – this led to the creation of powerful, efficient compound instructions. To minimize the penalties caused by branching, we experimented with branch elimination through conditional, instruction nullification; features for both branch prediction and delayed branch instructions to minimize memory latency were also included. We found that all these greatly enhanced performance at low cost. To address the crucial migration issues, we did several things. Many customers needed to run existing object code; binary code translation, an old and largely discarded idea because of path length explosion, was revived and made practical through sophisticated compiler optimization techniques. We found that an all-software translation could usually achieve a 1:1 performance goal, and often did better. Creation of optimized pre-coded instruction clusters, software we dubbed millicode, helped greatly. It worked like microcode, but at a higher level and with no hardware overhead. In effect, the systems storage hierarchy had replaced the writable control store of the CISC era. We also built a network of migration centers across the United States and eventually across the globe which would profile existing applications indicating which parts of routines or applications would most benefit from recompilation and recoding, and a set of conversion tools which, when mature, often accomplished migration to the new architecture in considerably less than a week, sometimes even within a day. Although we didn't publicize it appropriately, PA-RISC had segmented 64-bit addressing from the outset and was among the very first to extend the instruction set with specific multimedia capabilities, being the first to achieve all software MPEG decoding in real time.
Toward the late '80s, we observed an almost insatiable demand for more performance from our customers and noted with concern that the widening speed gap between processor and memory was driving us towards more bandwidth, more registers, and more complex schemes for overlapping memory latencies. We became convinced that a new architectural approach to get more instructions/cycle without paying the crushing price of exploding microprocessor design complexity was needed.
We surveyed our own and the industry's solutions. We found that superscalar RISC did improve Instruction Level Parallelism (ILP), but with significant hardware complexity and without compiler transparency; since scheduling was done in hardware at run-time, a gain much beyond 1.5-3 instructions per cycle seemed unlikely. Furthermore, the idea of superscalar violated one of our fundamental architectural principles. We were using complex hardware to do a job much more appropriately performed by the compiler. Similarly, VLIW and Vector architectures, in both of which we had considerable expertise and investment, could produce higher ILP by using parallelizing compilers that could exploit explicit parallelism. But, reverting to our enduring architectural principles, we saw no way to satisfy scalability in performance and application and to address compatibility. These again drove the need for a new architecture.
I think this slide dramatically shows that since 1993, when the shaded data was compiled by IEEE Spectrum, architecture innovations have not kept pace with VLSI advances. Until that point, they contributed about equally to microprocessor performance; but in the last 4-5 years, even being very conservative on clock speeds and a little aggressive on instructions per cycle, architecture has contributed about 2X while VLSI has contributed 5X.
This has led us to the inescapable conclusion that making substantial progress in Instruction Level Parallelism requires a new architectural approach. Our work at HP Labs in 1989-90 convinced us that high Instruction Level Parallelism requires that the code be explicitly scheduled and that the scheduling is best done in software by the compiler, and not in fast hardware at run time. We feel that the architecture must expose the parallelism and we take as a major requirement, just as we did with PA-RISC, that scalability across implementations and applications is essential.
Another way of looking at the situation for scalable architectures is shown on this slide in which we can observe the enhancements from the RISC development, which got us to something less than or equal to 1 instruction per cycle. The improvements that came with superscalar, through the addition of redundant functional units and hardware-based scheduling, are reaching an asymptote somewhere between 2 and 3 instructions per cycle. Starting in 1990, HP Labs began to design an architecture with innovations to take advantage of all the information known to the compiler. Traditional architectures only allow the exploitation of certainty; we experimented with statistical parallelism and speculation. We developed innovations in generalized predication to eliminate branches and to minimize both branch mispredictions and the effects of branches on performance. Looking to the future, we researched mechanisms to enable the number and speed of functional units to scale. Very importantly, we worked to ensure that features, such as speculation to hide memory latency and predication to remove branches, were synergistic – that is, the existence of one helps expose opportunities available to the other, particularly as we issue more instructions per cycle. We developed mechanisms that allowed the compiler to communicate decisions to the hardware and compilers that not only exposed parallelism, but also enhanced and exploited it. In short, by the early '90s, we felt confident that we had a winning, compatible successor to the PA-RISC architecture.
Why then, the Intel alliance? Our reluctant conclusion was that working by ourselves on another proprietary architecture would most likely fail for two reasons. We didn't think that we would be able to afford, with the chip volumes that we could forecast, to pay for the fabs necessary to produce the sub-micron technologies needed to produce the chip density required for the new technology. Even more importantly, we doubted that software vendors would be willing to develop many applications for architectures beyond X86 and the Power PC. We were not willing to take the chance that a third architecture could survive. By teaming with Intel, we have brought together the architecture, design, and fabrication expertise of Intel with the architecture, design, and systems excellence of HP. We think that this gives the industry and us a scalable common hardware platform for UNIX and NT -- a new Industry-Standard Architecture for the open systems of the future.
This alliance creates a flat playing field at the microprocessor platform level for all who choose to buy it from Intel. HP's challenge, then, is to add value and differentiation to the rest of the system. Returning to our architectural principles, we have been engaged since the start of the alliance in a program of research to understand trends in workloads at the system and application levels and to design appropriate system level innovations around the IA-64 processor.
Our intuition that putting a Ferrari engine into a Volkswagen body wouldn't produce a fast car is confirmed by simple experiments and analyses. This figure graphs the performance of a Web benchmark (SpecWeb) with increasing processor performance (as measured in SpecInt95 terms). We see that the application performance (as measured in thousands of connections per second) levels out very early and we gain no further benefit of a faster processor if the I/O bandwidth of the system is low. However, a commensurately high system I/O bandwidth allows for a near linear improvement in application-level performance with increasing processor performance.
The emergence of the Internet, and Web applications built upon it, has major consequences for system design, as shown in this slide, derived from measurements at HP Labs. We can expect at least order of magnitude changes in systems loads and bandwidth requirements as more and more of the Web content goes from static to dynamic, as e-commerce becomes commonplace, and as multimedia connection drives the need for I/O bandwidth much more rapidly than processor power.
From a system architecture and design perspective, not only are the component technologies and workloads changing, but so are the cost-value expectations of our customers. We all know only too well the low performance predictability and reliability of the Web today. The graphs show early results of our research and suggest that great improvements in Web predictability and reliability are achievable. Similarly, we can see the stress that encryption and security will place on our systems and, again, large improvements accomplishable by appropriate system design.
We think that a new system design paradigm is needed to address these changing needs. We are working on systems with new control points, moving from "processor-centric" systems to systems with intelligence distributed into semi-autonomous subsystems. This systems disaggregation drives the necessary innovations in the memory, communication, and I/O subsystems, which then become the new system control points, together with the technologies of the interconnection.
Our company philosophy demands that these new control points be created within an open industry standard framework, one that engenders, rather than stifles, innovation. The industry needs to define open, high-level interfaces between the semi-autonomous subsystems in this new system architecture. By specifying "what" at the interface rather than "how," this allows companies large and small to add value and accelerate competition. The Intelligent I/O initiative (I2O), an HP Labs project that we helped drive to an industry standard, will allow systems and technology vendors, both hardware and software, to create value-added systems (for example, ones with high-speed compression and/or encryption of selective I/O) without requiring changes to the OS or to other subsystems.
Let me summarize. The new challenges for the industry are to exploit computing components like the EPIC processors and other new component technologies to create the disaggregated computing systems that will satisfy the workloads, usage models, and cost-value requirements of the future. At a broader level, the challenges are to create the infrastructures for deploying distributed heterogeneous computing systems to create computing services. These information and computing "utilities" - analogous to the electric and water utilities – will transform computing from a captive investment to a competitive service.
These utilities, in turn, will impose new system requirements, will enable new types of interactions and usage paradigms, and will therefore motivate new types of systems, such as servers to power the utilities and appliances with new form factors and capabilities enabled by the utilities. The utility-enabling and utility-enabled systems will themselves motivate new computing components. As examples, we will see new types of embedded processors, combinations of system capabilities-on-a-chip, and functionalities to power the new control points.
At HP and HP Labs, we are excited at the promise of the EPIC-class processors, and we see them as key building blocks in the systems of the next century. We are convinced that architecture and system design will be more important than ever before and that a new era of competition through systems-level differentiation is about to begin. We can't wait!
Joel Birnbaum Bio.
|