|

Keeping supercomputers up to date is a constant race because
of the steady advance of microprocessors.
"What has always happened in the past," says Richard
Taylor, a researcher in HP Labs Bristol, "is that someone
designs a supercomputing system with fantastic technology.
And within one, maybe two generations, someone like Intel
has caught up. That's a real problem!"
The answer is to build supercomputers out of those same microprocessors
-- commodity components working together -- so the supercomputer
can keep pace. But that solution brings a new problem that
HP researchers are now working to solve: how to build and
predict the reliability of supercomputers made from as many
as 100,000 processors, running as a tightly coupled package.
Helping Enterprise Customers
Supercomputers are used, primarily by scientists, to work
on some of the biggest, most compute-intensive calculations
necessary for such fields as space exploration, geophysics
and genetics.
Researchers hope their work will eventually benefit HP's
enterprise customers as well, particularly those running applications
such as stock trading or banking, which demand high reliability.
In a recent instance, scientists at Los Alamos National Labs
required 1.39 million hours of supercomputer calculations
to determine the impact of a comet hitting the Yucatan Peninsula
65 million years ago.
In such cases, even if even one processor fails during such
a large calculation, the entire process must be restarted
-- unless researchers can accurately predict and prepare for
such a failure before it takes place. And with 20,000 processors
running together, says Taylor, "the probability that
one component is going to fail becomes very high indeed.
Approaches to Failure
There are several approaches to take to failure, says Taylor,
who is principal scientist in HP Labs' Performance Engineering
Group. One is to replace the part and rerun all the calculations
from the beginning. "That's not a particularly useful
way to do things," he says.
Another is to run software backups at regular intervals or
write out intermediate results and then restart the calculations
based on the recovered or intermediate data. The problem with
those methods, says Taylor, is that they add additional compute
hours -- and additional cost -- to the calculation.
"It may take you a total of two million hours of computation,
700,000 of which are overhead, to do a 1.3-million-hour calculation,"
he says.
Calculate, Don't Simulate
The traditional way to predict crashes is through simulation
- creating a computer program to mimic how the system will
run. But computer simulations have been both empowering and
devastating according to the Performance Engineering Group's
Chris Tofts. He and other HP researchers are looking at new
ways to predict the reliability of supercomputers and other
systems. On one hand, says Tofts, a simulation allows people
to model complex things, but it also allows them not to give
the models much thought.
"And that's a very bad thing," says Tofts. His approach
is very different -- to calculate, rather than simulate, system
reliability.
"Basically, you do some hard sums," he says. Tofts
has spent the last 13 years pursuing the mathematics of modeling
complex systems that would traditionally be simulated.
"Simulation is effectively an experimental technology,"
he says. "And is exposed to all of the well-known problems
of an experiment. For instance, you cannot prove a negative.
With a mathematical proof or an analytical calculation, if
you don't get it, then it isn't there."
How it Works
So far, Tofts has found four different ways to arrive at the
same number for judging system reliability -- and he's still
counting.
"I don't place a lot of faith in my own code unless
I can do it at least two or three different ways," he
says. "So if I'm going to be rude about other people's
code as having potential errors in it, I ought to be aware
of the plank in my own eye."
Rather than creating an analog simulation of how the system
will work, Tofts calculates the likely probability of failure
based on the specific properties of the system he's examining.
More Problems with Simulations
One of the problems he's currently working on is using techniques
for trying to avoid trivial errors from creeping into the
models. Most of the test models are generated randomly to
give Tofts' calculations the most rigorous workout possible.
"The question is, how do you stop silly errors from
creeping in?" Tofts says. "Because one of the things
that people forget is that it tends to be the silly errors
that hit you."
Another problem with simulation, Taylor adds, is that people
usually overcomplicate the models of the systems they want
to simulate. And when the number of processors is "pushed
to the limits" needed for building a supercomputer, scientists
have to work with what are described as "rare" or
"near-rare" events. At that level, he says, the
systems being simulated are so complex, it's almost impossible
to infer from the results what is significant and what is
just noise.
"Unfortunately, because simulation appears to be so
easy to do, it is very rarely done properly," says Taylor.
Giving Customers More Reliable Systems
As a result, he says, many institutions "over provision,"
or buy more redundancy and fault tolerance in their computer
systems than they might actually need. By coming up with more
reliable ways to predict system failure, Tofts and Taylor
hope to help customers choose the right systems to do the
job they need done.
In addition, these methods for predicting reliability can
trickle down from supercomputers to smaller systems used by
a wide range of HP customers in business, government and academia.
Just as supercomputer design technologies, such as super pipelining,
and Very Long Instruction Word (VLIW) architectures have migrated
to commercial and consumer devices, HP expects that the analysis
and service technologies being deployed for supercomputers
will also find their way out of the rarified atmosphere of
technical supercomputing and into high-reliability commercial
and academic computing systems.
"The aim of the research," says Taylor, "is
to take techniques that work with 20,000, maybe even 100,000
or 200,000 processors, and make them available for all of
HP's large enterprise computing needs and enterprise computing
customers."
|