Speechless

Google Voice Search for the iPhone launched today. I’m not convinced that it’s a particularly useful addition to the existing Google app (you have to use the touchscreen in order to launch the app, and by that time surely it’s just as quick to type the word in) but it’s certainly an interesting demonstration of the technology — and very entertaining to test.

The interface is simple: just lift the phone to your ear (a small bleep lets you know that the motion has been detected and it’s ready to start) and speak your search terms. A nice touch is that the ‘soundwave’ icon displayed while it’s processing the input actually does change with each search; in the picture below, the search being performed is actually “parrot sketch” (not the previous search, which is still displayed in the search box at the top of the screen), and it’s a reasonably plausible shape for that phrase:

Voice search shows the shape of your words

Voice search shows the shape of your words

It only very occasionally concedes defeat altogether, with a laconic “didn’t get that” (not to be confused with “didn’t go through”, which seems to be a momentary failure to connect):

Google Voice Search failure modes

Google Voice Search failure modes

So how accurate are the results when it does find something? Once I’d tried a few different searches, including my name (an awkward pileup of consonants at the best of times, but it made a valiant attempt) and the name of this blog (reliably recognised, I’m pleased to say!), I decided to try something a bit more systematic. I was hoping to find some kind of list of words used to calibrate speech recognition software, but eventually found a spondee list for Speech Reception Threshold testing (Stanley A. Gelfand, Essentials of audiology, New York : Thieme, 2001. Appendix B). Recipients of the test are expected to be familiar with these words/phrases already, but if that’s the case, Google should be familiar with them too; and if not, then it’s as good as any other arbitrary selection. I tried each word twice, and recorded the results:

word/phrase first guess second guess
airplane sam
armchair comcast amtrak
backbone experian experian
baseball
birthday sta spa
blackboard
cookbook cooks
cowboy calpoly
doormat doormats
drawbridge corporate old bridge
duck pond
eardrum income its
earthquake escalate s clinic
eyebrow
greyhound
hardware holland flag hot
headlight
horseshoe ocean
hotdog pa
ice cream
inkwell
mousetrap myspac
mushroom machine schwinn
northwest southwest
nutmeg nuts mag netflix
oatmeal betrayal israel
outside
padlock hotchalk adult
pancake
playground
railroad nile virus male names
stairway skyway amway
sunset chat
toothbrush flash
whitewash white phlox squash
woodwork wood flat flat

NB: the application does warn that “Voice Search only works in English, and works best for North American English accents”; I didn’t attempt to fake a North American English accent, but I did try to speak clearly and minimise background noise.

It’s interesting how many of the incorrect results were company or brand names: Comcast, Amtrak, Experian, Schwinn, Netflix, Hotchalk, Amway. They don’t all get more hits on Google than the corresponding correct word, either (there are more armchairs on Google than Comcasts, and more backbones than Experians), though perhaps they do get more hits than other incorrect guesses which the voice recognition rejects.

In most cases, the incorrect result is similar in shape to the search word: it’s easy to see how one gets from “drawbridge” to “corporate”, from “horseshoe” to “ocean”, from “mushroom” to “machine”, or even from “railroad” to “male names”. I would say that some of the incorrect guesses have more syllables than the original words, but syllable counting is notoriously difficult; and when allowing for the difference between British English and North American English accents as well, all bets are off.

However, there are some really baffling guesses: “hotdog” only shares at most one vowel with Google’s guess of “pa”, and “birthday” is a lost cause — only the second half of the word seems to come through, with the ‘thday’ /TteI/ being rendered as ‘sta’ (/steI/) and ‘spa’ (/speI/). That’s my best guess, anyway. And I really can’t see how you get from “airplane” to “sam”.

The one really frustrating thing, though, is not being able to ‘teach’ the search: there’s no way to teach the application what your voice sounds like with a series of reference words; and there’s no way to tell Google what you were really searching for, not even the usual “did you mean…” option — though it’s possible that they use clickthroughs from searches as a rough indicator of success. Google could, if they recorded each search and allowed users to ‘transcribe’ their searches at the same time, amass a vast corpus of spoken English words and their written forms — in fact, this is apparently what they intended to do with the previous incarnation of Google Voice Search — but the privacy implications of this are problematic, particularly given that the iPhone Google app has to be downloaded via iTunes and hence via a personal and extremely trackable account.

Incidentally, the title of this post is what I got when using Voice Search to search for “speech recognition” — Google Voice Search is not quite speechless, but it’s also not quite there yet.

One Response to Speechless

  1. Art says:

    A lot of the errors I saw were fixable with an American accent, as you say. Brings back memories of trying to drive Apple’s Plaintalk speech recognition back in the nineties. “Open Word!” “Open Worrrrrrd!” “Gee, open Worrrrrd, dude!”

Leave a comment