

To call Pedro Moreno and his HP Labs colleagues ambitious
is putting it mildly.
In late 1999, they launched a prototype for SpeechBot, the
first Web search engine that finds and indexes spoken audio
rather than text. Later, they expanded the technology to
search for video and music as well.
By early June 2003, the seven-member group, based at HP's
Cambridge Research Laboratory in Cambridge, Mass. (USA) had
catalogued more than 17,000 hours of multimedia content
making SpeechBot the largest multimedia index in the world.
Multimedia Search Engine
While still experimental, SpeechBot (a contraction of "speech
robot") reflects the Internet's evolution beyond a text-based
medium.
"There are huge amounts of audio and video available,"
says Moreno, a senior researcher at CRL, based in a high-rise
office building on the edge of the Massachusetts Institute
of
Technology campus. "This focuses on technologies to search
for them."
Using speech-recognition technology, SpeechBot finds and
indexes multimedia content from selected sources ranging from
the U.S. government to National Public Radio to something
called Scuba Radio, "the world's first radio show devoted
to diving."
Internet users can search for specific content using SpeechBot's
public site, https://www.speechbot.com
which looks and works pretty much like any other Web-based
search engine.
Using SpeechBot
The home page directory lists what's available, which includes
National Public Radio's "Car Talk" automotive-advice
show, White House videos of presidential press briefings and
speeches and InternetNews Radio (interactive content from
Internet.com, Jupitermedia Corp.'s online technology-news
service).
Visitors simply type in keywords for instance, "suicide
bombing" or "Iraq reconstruction" for Middle
East news, or "Martin Sheen" for interviews with
the star of TV's "The West Wing." SpeechBot returns
a directory of hits matching the query, with information about
the source of the media such as "U.S. Department of Defense
Briefings," a time stamp and a brief excerpt from a text
transcript. Choices can be sorted or limited by source, date
or relevance.
Users can then play the audio or video or they can click
on "show me more." That option returns a larger
chunk of text transcription and a timeline illustrating the
length of the selection and how often and when the keyword
appears. Meanwhile, the site displays a text transcript of
the same material.
Finding the Needle in the Haystack
Here's an example, using the "Iraq reconstruction"
query. The "show me more" option for National Public
Radio's "On Point" program of May 6, 2003, indicates
that speakers uttered the word "reconstruction"
20 times in 51 minutes. Each reference appears as a spot along
the timeline, and clicking on a spot plays the referenced
excerpt.
Thanks to that capability, SpeechBot helps solve a long-standing
problem with streaming audio or video: finding, say, a particular
one- or two-minute segment of a much longer piece. Now, says
Senior researcher Jean-Manuel Van Thong, "when you have
an hour or more of content, we can get you to the precise
point that you want."
SpeechBot, which can handle up to 1 million page views daily,
doesn't actually archive the multimedia material, a function
that would create technical and copyright nightmares. Instead,
like Google or other search engines, it finds and indexes
data for users.
"We don't hijack the content," Van Thong says.
"It's served up from the original site."
Audio to Text
However, SpeechBot does transform the audio into text, and
those transcripts appear on users' screens as the selection
plays. Why bother with written words in a research project emphasizing sound and motion?
"It helps the user to have a little bit of transcript
to eyeball," Moreno explains. "The human interface
is still faster at processing visually than by listening.
So we tried to give them as much visual information as possible."
Most transcripts consist of raw verbiage. SpeechBot doesn't
yet punctuate, capitalize or distinguish between speakers,
so excerpts tend to appear as one long lower-case sentence. And the translation isn't always perfect: "Iraq"
sometimes appears as "rock," "a rock"
or "the rock," for instance. But the text does align
quite precisely with the audio version, making it easier for
users to scan for exactly what they want or to follow along
as the clip plays.
Error Rates
The transcripts illustrate both the progress and the limitations
of speech-recognition technology. When dealing with high-quality
content, such as audio from an in-studio radio show, SpeechBot
translates material correctly all but 20 percent of the time,
Moreno says. Speakers with unusual accents or verbal tics
such as constant repetition of "um" or "er"
can push the error rate to 30 percent. Throw in a telephone
interview or audio recorded in a noisy environment, and the
error rate rises to 50 percent.
Still, given what SpeechBot does recognize, and the increasing,
continually updated volume of indexed material, odds are good
that any SpeechBot search will turn up something useful. As Moreno puts it: "The chance that
you would miss all the places with a particular keyword is
very low."
Indexing Music by Content
The group's latest work, BoogeeBot, involves seeking and indexing
music archives. The real breakthrough: using similar technologies
to SpeechBot to recognize other sound patterns, such as rhythm
or use of a particular instrument.
"The technology provides a way of organizing music based
on content," rather than by artist, album, song title,
or even genre, says researcher Beth Logan.
Thus, Aerosmith's classic rocker "Walk This Way"
sounds similar at least to BoogeeBot to songs
by the Eagles, the Blues Brothers, Meat Loaf and Marilyn
Manson. The common element might be something as simple as
crowds clapping rhythmically during each band's live performance.
In another example, BoogeeBot catalogs traditional Greek music
with tunes by Jethro Tull and Jimmy Buffett probably because all
use stringed instruments.
"It's subjective," Moreno acknowledges.
BoogeeBot can search among 18,000 songs in styles from Cajun
to classical. But that capability isn't yet available on the
public Web site. For now, because of strict music copyright
laws, the team can demonstrate music indexing only inside
the research lab.
Commercial Applications
Currently, SpeechBot focuses largely on indexing public news
sites. But the team believes the technology has plenty of
commercial potential as well.
Among the possibilities:
- Businesses could use the technology in corporate intranets
or to archive audiotapes of meetings.
- College professors could record their lectures, then store
them online for student reference.
- Customer-service centers might index recorded telephone
calls, analyzing them to find complaint patterns or to evaluate
their agents' performances.
- Medical personnel could use SpeechBot for content analysis
and indexing in the life-sciences domain.
Despite the technology's current limitations, it offers an
unparalleled way to access, catalog and analyze the Web's
ever-expanding wealth of multimedia resources, its creators
say. "Most of the time, when you keep a recording, you
do nothing with it," Moreno says. "It's kept in
a tomb that's never opened."
SpeechBot let users access those vaults and find those
treasures.
by Anne Stuart
|