Google Voice Search for the iPhone launched today. I’m not convinced that it’s a particularly useful addition to the existing Google app (you have to use the touchscreen in order to launch the app, and by that time surely it’s just as quick to type the word in) but it’s certainly an interesting demonstration of the technology — and very entertaining to test.
The interface is simple: just lift the phone to your ear (a small bleep lets you know that the motion has been detected and it’s ready to start) and speak your search terms. A nice touch is that the ‘soundwave’ icon displayed while it’s processing the input actually does change with each search; in the picture below, the search being performed is actually “parrot sketch” (not the previous search, which is still displayed in the search box at the top of the screen), and it’s a reasonably plausible shape for that phrase:
It only very occasionally concedes defeat altogether, with a laconic “didn’t get that” (not to be confused with “didn’t go through”, which seems to be a momentary failure to connect):
So how accurate are the results when it does find something? Once I’d tried a few different searches, including my name (an awkward pileup of consonants at the best of times, but it made a valiant attempt) and the name of this blog (reliably recognised, I’m pleased to say!), I decided to try something a bit more systematic. I was hoping to find some kind of list of words used to calibrate speech recognition software, but eventually found a spondee list for Speech Reception Threshold testing (Stanley A. Gelfand, Essentials of audiology, New York : Thieme, 2001. Appendix B). Recipients of the test are expected to be familiar with these words/phrases already, but if that’s the case, Google should be familiar with them too; and if not, then it’s as good as any other arbitrary selection. I tried each word twice, and recorded the results:
word/phrase | first guess | second guess |
---|---|---|
airplane | ☓ | sam |
armchair | comcast | amtrak |
backbone | experian | experian |
baseball | ✓ | ✓ |
birthday | sta | spa |
blackboard | ✓ | ✓ |
cookbook | cooks | ✓ |
cowboy | calpoly | ✓ |
doormat | doormats | ✓ |
drawbridge | corporate | old bridge |
duck pond | ✓ | ✓ |
eardrum | income | its |
earthquake | escalate | s clinic |
eyebrow | ✓ | ✓ |
greyhound | ✓ | ✓ |
hardware | holland flag | hot |
headlight | ✓ | ✓ |
horseshoe | ocean | ✓ |
hotdog | ✓ | pa |
ice cream | ✓ | ✓ |
inkwell | ✓ | ✓ |
mousetrap | ✓ | myspac |
mushroom | machine | schwinn |
northwest | ✓ | southwest |
nutmeg | nuts mag | netflix |
oatmeal | betrayal | israel |
outside | ✓ | ✓ |
padlock | hotchalk | adult |
pancake | ✓ | ✓ |
playground | ✓ | ✓ |
railroad | nile virus | male names |
stairway | skyway | amway |
sunset | ✓ | chat |
toothbrush | ✓ | flash |
whitewash | white phlox | squash |
woodwork | wood flat | flat |
NB: the application does warn that “Voice Search only works in English, and works best for North American English accents”; I didn’t attempt to fake a North American English accent, but I did try to speak clearly and minimise background noise.
It’s interesting how many of the incorrect results were company or brand names: Comcast, Amtrak, Experian, Schwinn, Netflix, Hotchalk, Amway. They don’t all get more hits on Google than the corresponding correct word, either (there are more armchairs on Google than Comcasts, and more backbones than Experians), though perhaps they do get more hits than other incorrect guesses which the voice recognition rejects.
In most cases, the incorrect result is similar in shape to the search word: it’s easy to see how one gets from “drawbridge” to “corporate”, from “horseshoe” to “ocean”, from “mushroom” to “machine”, or even from “railroad” to “male names”. I would say that some of the incorrect guesses have more syllables than the original words, but syllable counting is notoriously difficult; and when allowing for the difference between British English and North American English accents as well, all bets are off.
However, there are some really baffling guesses: “hotdog” only shares at most one vowel with Google’s guess of “pa”, and “birthday” is a lost cause — only the second half of the word seems to come through, with the ‘thday’ /TteI/ being rendered as ‘sta’ (/steI/) and ‘spa’ (/speI/). That’s my best guess, anyway. And I really can’t see how you get from “airplane” to “sam”.
The one really frustrating thing, though, is not being able to ‘teach’ the search: there’s no way to teach the application what your voice sounds like with a series of reference words; and there’s no way to tell Google what you were really searching for, not even the usual “did you mean…” option — though it’s possible that they use clickthroughs from searches as a rough indicator of success. Google could, if they recorded each search and allowed users to ‘transcribe’ their searches at the same time, amass a vast corpus of spoken English words and their written forms — in fact, this is apparently what they intended to do with the previous incarnation of Google Voice Search — but the privacy implications of this are problematic, particularly given that the iPhone Google app has to be downloaded via iTunes and hence via a personal and extremely trackable account.
Incidentally, the title of this post is what I got when using Voice Search to search for “speech recognition” — Google Voice Search is not quite speechless, but it’s also not quite there yet.
A lot of the errors I saw were fixable with an American accent, as you say. Brings back memories of trying to drive Apple’s Plaintalk speech recognition back in the nineties. “Open Word!” “Open Worrrrrrd!” “Gee, open Worrrrrd, dude!”