by Steve Towns
Editor, HP Government Solutions magazine
Research under way at HP Labs may one day help pinpoint performance bottlenecks
in complex computing systems.
Using network-monitoring technology and sophisticated algorithms, HP researchers
are attempting to trace the path of a request as it travels through the maze
of software that constitutes many IT networks. The technique
may reduce finger-pointing on multivendor IT projects and yield better performance
for applications.
The research attacks a vexing problem for businesses, government agencies,
univerities and others operating large enterprise systems, particularly
Web-based applications built of multiple components from different vendors.
Not only are these systems
notoriously hard to debug, they';re becoming more common as institutions
rely more heavily on Web-based services allow people to buy books, run auctions, request IT help and perform many other kinds of transactions online.
"These days, people construct applications out of pieces
from various manufacturers. They're built from a bunch of
computers talking to each other over a network of some sort," said
Jeffrey Mogul, an HP Fellow in the company's
Internet Systems and Storage Lab. "It's often hard
to get these things to work, and you can have a devil of a time
figuring out where the problem lies, especially if it's
a performance problem.”
Such systems often string together "black boxes" --
widely distributed servers, storage
arrays, etc. -- that are difficult or impossible for IT managers
to examine closely. Mogul is part of a team of HP Labs researchers
developing a method to locate problems and
performance issues in complex systems without delving into sophisticated
components or scouring source code.
The research is part of an overall effort at HP to maximize the
agility, efficiency and reliability of enterprise computing resources. Advanced techniques
being explored in HP Labs may result in better performance for heterogeneous computing
environments common to large organizations.
Today, isolating performance problems in intricate applications
is extremely
time-consuming and can demand integration expertise that's
both expensive and hard to
find. That's because distributed systems may include front-end
Web servers, Web application servers, ERP systems, credit card
authorization systems and other technologies. What's more,
these separate parts may come from different, perhaps competing,
manufacturers.
"You’re probably not going to have source code for most of those things," said Mogul.
"Even if you did, it would be too much for any one person to understand."
The diagnostic technique being developed by HP Labs traces the
route of network messages as they travel through distributed
systems and measures the speed of various tasks performed along
the way.
"You have this path of messages through the system, where
each message is causing a successor,' Mogul says. For example,
a message from a client arrives at a Web server, which sends a
message to a back-end applications server, which then might
interact with an authentication server.
The idea is to map this chain of events and spot operations that take longer than they should.
"Our hypothesis is that if we can point you to someplace
where there often is a lot of
latency, than that's the box you should open up to figure
out what's going wrong inside,
Mogul said. "It doesn't tell you how to fix the problem,
but it tells you where to look."
Similar diagnostic tools already are available for homogenous
systems, all Java or all
.NET applications, for example. It's
been more difficult to develop a tool for more complex environments,
in part because support and documentation for multivendor components
may be difficult to compile. So researchers tried to create something
that requires neither support from vendors nor extensive knowledge
of system components.
"We decided we needed to do this as noninvasively as possible," Mogul said. "We don’t need to know anything about the application ahead of time, and we don’t inject our own
traffic into the system."
They start by looking at the traffic carried by network switching equipment.
Modern network switches include a feature called port monitoring, which allows
a monitoring
system to see a copy of each packet on the network.
Researchers use that information to trace network traffic over a period of
time -- anywhere from minutes to hours, depending on how changeable the application
is -- saving for each packet only a time-stamp and information about the sender
and receiver. They avoid saving the full packet data, to protect data privacy.
Then researchers apply algorithms to the packet trace that allow them to sketch out
relationships between network components and spot time lags between operations.
The approach eventually may reduce the time needed to deploy and debug complex
systems. It also could give IT professionals better insight into system behavior,
allowing them to create more reliable and responsive applications.
Limited tests of the technique show promise.“Preliminary results suggest
we’re on the
right track,”Mogul said.“But we haven’t gotten to the point
in our research where we’ve
taken a complex live system and found out something the owners of the system
didn’t
already know.”
Their work gained research community attention late last year when a scientific
paper based on the project was published at the 19th Association for Computing
Machinery Symposium on Operating Systems Principles. The event is the world’s
premier forum for researchers and developers working on operating system technology.
“In artificial traces,we’ve been able to reconstruct pictures
of network relationships fairly
accurately,”said Mogul, adding that he hopes to give the tool a real-world
test by unleashing it on some of HP’s internal applications.
Products based on this research eventually may reduce the programming
expertise needed to isolate performance problems in complex network systems. Or
they may allow highly skilled systems integrators to work more
efficiently by helping them locate technical glitches quickly and
precisely.
That means faster and less expensive resolutions of some of the
technology industry’s toughest challenges, and possibly an
end to a huge headache for IT managers: the blame game played by
multiple vendors involved in complex systems when something goes
wrong.
“You have this complicated system, and you’ve basically
got it running in the sense that most of the time it’s giving
you right answers. But it’s not fast enough, and you’re
trying to figure out why,” Mogul said. “By creating
tools that isolate performance problems in complex systems, we
hope to help solve the finger-pointing problem.”
|