Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP

hp.com home

SpeechBot: A Search Engine for Sound

Cambridge team's creation seeks spoken words on the Web

June 2003

printable version

HP Labs

» Research
» News and events
» Technical reports
» About HP Labs
» Careers @ HP Labs
» Worldwide sites
» Downloads
Content starts here

SpeechBot offers an unparalleled way to access, catalog, and analyze the Web's wealth of multimedia resources.

To call Pedro Moreno and his HP Labs colleagues ambitious is putting it mildly.

In late 1999, they launched a prototype for SpeechBot, the first Web search engine that finds and indexes spoken audio rather than text. Later, they expanded the technology to search for video and music as well.

By early June 2003, the seven-member group, based at HP's Cambridge Research Laboratory in Cambridge, Mass. (USA) had catalogued more than 17,000 hours of multimedia content – making SpeechBot the largest multimedia index in the world.

Multimedia Search Engine

While still experimental, SpeechBot (a contraction of "speech robot") reflects the Internet's evolution beyond a text-based medium.

"There are huge amounts of audio and video available," says Moreno, a senior researcher at CRL, based in a high-rise office building on the edge of the Massachusetts Institute of Technology campus. "This focuses on technologies to search for them."

Using speech-recognition technology, SpeechBot finds and indexes multimedia content from selected sources ranging from the U.S. government to National Public Radio to something called Scuba Radio, "the world's first radio show devoted to diving."

Internet users can search for specific content using SpeechBot's public site, https://www.speechbot.com which looks and works pretty much like any other Web-based search engine.

Using SpeechBot

The home page directory lists what's available, which includes National Public Radio's "Car Talk" automotive-advice show, White House videos of presidential press briefings and speeches and InternetNews Radio (interactive content from Internet.com, Jupitermedia Corp.'s online technology-news service).

Visitors simply type in keywords – for instance, "suicide bombing" or "Iraq reconstruction" for Middle East news, or "Martin Sheen" for interviews with the star of TV's "The West Wing." SpeechBot returns a directory of hits matching the query, with information about the source of the media such as "U.S. Department of Defense Briefings," a time stamp and a brief excerpt from a text transcript. Choices can be sorted or limited by source, date or relevance.

Users can then play the audio or video or they can click on "show me more." That option returns a larger chunk of text transcription and a timeline illustrating the length of the selection and how often and when the keyword appears. Meanwhile, the site displays a text transcript of the same material.

Finding the Needle in the Haystack

Here's an example, using the "Iraq reconstruction" query. The "show me more" option for National Public Radio's "On Point" program of May 6, 2003, indicates that speakers uttered the word "reconstruction" 20 times in 51 minutes. Each reference appears as a spot along the timeline, and clicking on a spot plays the referenced excerpt.

Thanks to that capability, SpeechBot helps solve a long-standing problem with streaming audio or video: finding, say, a particular one- or two-minute segment of a much longer piece. Now, says Senior researcher Jean-Manuel Van Thong, "when you have an hour or more of content, we can get you to the precise point that you want."

SpeechBot, which can handle up to 1 million page views daily, doesn't actually archive the multimedia material, a function that would create technical and copyright nightmares. Instead, like Google or other search engines, it finds and indexes data for users.

"We don't hijack the content," Van Thong says. "It's served up from the original site."

Audio to Text

However, SpeechBot does transform the audio into text, and those transcripts appear on users' screens as the selection plays. Why bother with written words in a research project emphasizing sound and motion?

"It helps the user to have a little bit of transcript to eyeball," Moreno explains. "The human interface is still faster at processing visually than by listening. So we tried to give them as much visual information as possible."

Most transcripts consist of raw verbiage. SpeechBot doesn't yet punctuate, capitalize or distinguish between speakers, so excerpts tend to appear as one long lower-case sentence. And the translation isn't always perfect: "Iraq" sometimes appears as "rock," "a rock" or "the rock," for instance. But the text does align quite precisely with the audio version, making it easier for users to scan for exactly what they want or to follow along as the clip plays.

Error Rates

The transcripts illustrate both the progress and the limitations of speech-recognition technology. When dealing with high-quality content, such as audio from an in-studio radio show, SpeechBot translates material correctly all but 20 percent of the time, Moreno says. Speakers with unusual accents or verbal tics such as constant repetition of "um" or "er" can push the error rate to 30 percent. Throw in a telephone interview or audio recorded in a noisy environment, and the error rate rises to 50 percent.

Still, given what SpeechBot does recognize, and the increasing, continually updated volume of indexed material, odds are good that any SpeechBot search will turn up something useful. As Moreno puts it: "The chance that you would miss all the places with a particular keyword is very low."

Indexing Music by Content

The group's latest work, BoogeeBot, involves seeking and indexing music archives. The real breakthrough: using similar technologies to SpeechBot to recognize other sound patterns, such as rhythm or use of a particular instrument.

"The technology provides a way of organizing music based on content," rather than by artist, album, song title, or even genre, says researcher Beth Logan.

Thus, Aerosmith's classic rocker "Walk This Way" sounds similar – at least to BoogeeBot – to songs by the Eagles, the Blues Brothers, Meat Loaf and Marilyn Manson. The common element might be something as simple as crowds clapping rhythmically during each band's live performance. In another example, BoogeeBot catalogs traditional Greek music with tunes by Jethro Tull and Jimmy Buffett – probably because all use stringed instruments.

"It's subjective," Moreno acknowledges.

BoogeeBot can search among 18,000 songs in styles from Cajun to classical. But that capability isn't yet available on the public Web site. For now, because of strict music copyright laws, the team can demonstrate music indexing only inside the research lab.

Commercial Applications

Currently, SpeechBot focuses largely on indexing public news sites. But the team believes the technology has plenty of commercial potential as well.

Among the possibilities:

  • Businesses could use the technology in corporate intranets or to archive audiotapes of meetings.
  • College professors could record their lectures, then store them online for student reference.
  • Customer-service centers might index recorded telephone calls, analyzing them to find complaint patterns or to evaluate their agents' performances.
  • Medical personnel could use SpeechBot for content analysis and indexing in the life-sciences domain.

Despite the technology's current limitations, it offers an unparalleled way to access, catalog and analyze the Web's ever-expanding wealth of multimedia resources, its creators say. "Most of the time, when you keep a recording, you do nothing with it," Moreno says. "It's kept in a tomb that's never opened."

SpeechBot let users access those vaults and find those

by Anne Stuart

» SpeechBot
Researchers left to right Pedro Moreno, Beth Logan and Jean-Manuel Van Thong.
Privacy statement Using this site means you accept its terms Feedback to HP Labs
© 2009 Hewlett-Packard Development Company, L.P.