Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP

hp.com home

Researchers Developing New Ways to Predict System Reliability

July 2003
Content starts here

printable version

HP Labs

» Research
» News and events
» Technical reports
» About HP Labs
» Careers @ HP Labs
» Worldwide sites
» Downloads
Content starts here

The aim is to take techniques that work with 20,000 or even 200,000 processors, and make them available for HP's large enterprise customers.

Keeping supercomputers up to date is a constant race because of the steady advance of microprocessors.

"What has always happened in the past," says Richard Taylor, a researcher in HP Labs Bristol, "is that someone designs a supercomputing system with fantastic technology. And within one, maybe two generations, someone like Intel has caught up. That's a real problem!"

The answer is to build supercomputers out of those same microprocessors -- commodity components working together -- so the supercomputer can keep pace. But that solution brings a new problem that HP researchers are now working to solve: how to build and predict the reliability of supercomputers made from as many as 100,000 processors, running as a tightly coupled package.

Helping Enterprise Customers

Supercomputers are used, primarily by scientists, to work on some of the biggest, most compute-intensive calculations necessary for such fields as space exploration, geophysics and genetics.

Researchers hope their work will eventually benefit HP's enterprise customers as well, particularly those running applications such as stock trading or banking, which demand high reliability.

In a recent instance, scientists at Los Alamos National Labs required 1.39 million hours of supercomputer calculations to determine the impact of a comet hitting the Yucatan Peninsula 65 million years ago.

In such cases, even if even one processor fails during such a large calculation, the entire process must be restarted -- unless researchers can accurately predict and prepare for such a failure before it takes place. And with 20,000 processors running together, says Taylor, "the probability that one component is going to fail becomes very high indeed.

Approaches to Failure

There are several approaches to take to failure, says Taylor, who is principal scientist in HP Labs' Performance Engineering Group. One is to replace the part and rerun all the calculations from the beginning. "That's not a particularly useful way to do things," he says.

Another is to run software backups at regular intervals or write out intermediate results and then restart the calculations based on the recovered or intermediate data. The problem with those methods, says Taylor, is that they add additional compute hours -- and additional cost -- to the calculation.

"It may take you a total of two million hours of computation, 700,000 of which are overhead, to do a 1.3-million-hour calculation," he says.

Calculate, Don't Simulate

The traditional way to predict crashes is through simulation - creating a computer program to mimic how the system will run. But computer simulations have been both empowering and devastating according to the Performance Engineering Group's Chris Tofts. He and other HP researchers are looking at new ways to predict the reliability of supercomputers and other systems. On one hand, says Tofts, a simulation allows people to model complex things, but it also allows them not to give the models much thought.

"And that's a very bad thing," says Tofts. His approach is very different -- to calculate, rather than simulate, system reliability.

"Basically, you do some hard sums," he says. Tofts has spent the last 13 years pursuing the mathematics of modeling complex systems that would traditionally be simulated.

"Simulation is effectively an experimental technology," he says. "And is exposed to all of the well-known problems of an experiment. For instance, you cannot prove a negative. With a mathematical proof or an analytical calculation, if you don't get it, then it isn't there."

How it Works

So far, Tofts has found four different ways to arrive at the same number for judging system reliability -- and he's still counting.

"I don't place a lot of faith in my own code unless I can do it at least two or three different ways," he says. "So if I'm going to be rude about other people's code as having potential errors in it, I ought to be aware of the plank in my own eye."

Rather than creating an analog simulation of how the system will work, Tofts calculates the likely probability of failure based on the specific properties of the system he's examining.

More Problems with Simulations

One of the problems he's currently working on is using techniques for trying to avoid trivial errors from creeping into the models. Most of the test models are generated randomly to give Tofts' calculations the most rigorous workout possible.

"The question is, how do you stop silly errors from creeping in?" Tofts says. "Because one of the things that people forget is that it tends to be the silly errors that hit you."

Another problem with simulation, Taylor adds, is that people usually overcomplicate the models of the systems they want to simulate. And when the number of processors is "pushed to the limits" needed for building a supercomputer, scientists have to work with what are described as "rare" or "near-rare" events. At that level, he says, the systems being simulated are so complex, it's almost impossible to infer from the results what is significant and what is just noise.

"Unfortunately, because simulation appears to be so easy to do, it is very rarely done properly," says Taylor.

Giving Customers More Reliable Systems

As a result, he says, many institutions "over provision," or buy more redundancy and fault tolerance in their computer systems than they might actually need. By coming up with more reliable ways to predict system failure, Tofts and Taylor hope to help customers choose the right systems to do the job they need done.

In addition, these methods for predicting reliability can trickle down from supercomputers to smaller systems used by a wide range of HP customers in business, government and academia. Just as supercomputer design technologies, such as super pipelining, and Very Long Instruction Word (VLIW) architectures have migrated to commercial and consumer devices, HP expects that the analysis and service technologies being deployed for supercomputers will also find their way out of the rarified atmosphere of technical supercomputing and into high-reliability commercial and academic computing systems.

"The aim of the research," says Taylor, "is to take techniques that work with 20,000, maybe even 100,000 or 200,000 processors, and make them available for all of HP's large enterprise computing needs and enterprise computing customers."

reseachers Chris Tofts and Richard Taylor
Privacy statement Using this site means you accept its terms Feedback to HP Labs
© 2009 Hewlett-Packard Development Company, L.P.