Speechless

November 19, 2008

Google Voice Search for the iPhone launched today. I’m not convinced that it’s a particularly useful addition to the existing Google app (you have to use the touchscreen in order to launch the app, and by that time surely it’s just as quick to type the word in) but it’s certainly an interesting demonstration of the technology — and very entertaining to test.

The interface is simple: just lift the phone to your ear (a small bleep lets you know that the motion has been detected and it’s ready to start) and speak your search terms. A nice touch is that the ‘soundwave’ icon displayed while it’s processing the input actually does change with each search; in the picture below, the search being performed is actually “parrot sketch” (not the previous search, which is still displayed in the search box at the top of the screen), and it’s a reasonably plausible shape for that phrase:

Voice search shows the shape of your words

Voice search shows the shape of your words

It only very occasionally concedes defeat altogether, with a laconic “didn’t get that” (not to be confused with “didn’t go through”, which seems to be a momentary failure to connect):

Google Voice Search failure modes

Google Voice Search failure modes

So how accurate are the results when it does find something? Once I’d tried a few different searches, including my name (an awkward pileup of consonants at the best of times, but it made a valiant attempt) and the name of this blog (reliably recognised, I’m pleased to say!), I decided to try something a bit more systematic. I was hoping to find some kind of list of words used to calibrate speech recognition software, but eventually found a spondee list for Speech Reception Threshold testing (Stanley A. Gelfand, Essentials of audiology, New York : Thieme, 2001. Appendix B). Recipients of the test are expected to be familiar with these words/phrases already, but if that’s the case, Google should be familiar with them too; and if not, then it’s as good as any other arbitrary selection. I tried each word twice, and recorded the results:

word/phrase first guess second guess
airplane sam
armchair comcast amtrak
backbone experian experian
baseball
birthday sta spa
blackboard
cookbook cooks
cowboy calpoly
doormat doormats
drawbridge corporate old bridge
duck pond
eardrum income its
earthquake escalate s clinic
eyebrow
greyhound
hardware holland flag hot
headlight
horseshoe ocean
hotdog pa
ice cream
inkwell
mousetrap myspac
mushroom machine schwinn
northwest southwest
nutmeg nuts mag netflix
oatmeal betrayal israel
outside
padlock hotchalk adult
pancake
playground
railroad nile virus male names
stairway skyway amway
sunset chat
toothbrush flash
whitewash white phlox squash
woodwork wood flat flat

NB: the application does warn that “Voice Search only works in English, and works best for North American English accents”; I didn’t attempt to fake a North American English accent, but I did try to speak clearly and minimise background noise.

It’s interesting how many of the incorrect results were company or brand names: Comcast, Amtrak, Experian, Schwinn, Netflix, Hotchalk, Amway. They don’t all get more hits on Google than the corresponding correct word, either (there are more armchairs on Google than Comcasts, and more backbones than Experians), though perhaps they do get more hits than other incorrect guesses which the voice recognition rejects.

In most cases, the incorrect result is similar in shape to the search word: it’s easy to see how one gets from “drawbridge” to “corporate”, from “horseshoe” to “ocean”, from “mushroom” to “machine”, or even from “railroad” to “male names”. I would say that some of the incorrect guesses have more syllables than the original words, but syllable counting is notoriously difficult; and when allowing for the difference between British English and North American English accents as well, all bets are off.

However, there are some really baffling guesses: “hotdog” only shares at most one vowel with Google’s guess of “pa”, and “birthday” is a lost cause — only the second half of the word seems to come through, with the ‘thday’ /TteI/ being rendered as ‘sta’ (/steI/) and ‘spa’ (/speI/). That’s my best guess, anyway. And I really can’t see how you get from “airplane” to “sam”.

The one really frustrating thing, though, is not being able to ‘teach’ the search: there’s no way to teach the application what your voice sounds like with a series of reference words; and there’s no way to tell Google what you were really searching for, not even the usual “did you mean…” option — though it’s possible that they use clickthroughs from searches as a rough indicator of success. Google could, if they recorded each search and allowed users to ‘transcribe’ their searches at the same time, amass a vast corpus of spoken English words and their written forms — in fact, this is apparently what they intended to do with the previous incarnation of Google Voice Search — but the privacy implications of this are problematic, particularly given that the iPhone Google app has to be downloaded via iTunes and hence via a personal and extremely trackable account.

Incidentally, the title of this post is what I got when using Voice Search to search for “speech recognition” — Google Voice Search is not quite speechless, but it’s also not quite there yet.


Well-formed

November 17, 2008

The TEI Members’ Meeting earlier this month gave me a perfect opportunity to show off my XML earrings:

<head> </head>

<head> </head>

They were simple to make; I just printed out the tags on ordinary white paper, cut the paper around them to a triangular template (based on some earrings I already had, so I knew they wouldn’t be too big to wear) and laminated the result (leaving a reasonable margin, partly to prevent the laminated layers from separating and partly to leave room to attach the actual earring hooks). The laminating plastic is thin enough that I could just use a needle to punch through that ‘margin’ to insert the hooks.

The problem with using XML tags for decorative purposes is that anything requiring symmetry is always thwarted by the fact that the closing tag is always one character longer than its opening counterpart: there’s no way to make your <tag> and </tag> line up exactly. I’d already encountered this problem when making Christmas cards for colleagues last year, too:

First drafts of an XML-ish Christmas card

First drafts of an XML-ish Christmas card

While hand-lettering makes it easy to compensate for the asymmetry with creative kerning, the result doesn’t quite look like XML any more.

There is something iconic about markup, though, beyond the punning potential of <head> tags on earrings or hats, and <body> tags (or perhaps <front/> and <back/>?) on tshirts, and so on. Maybe it’s just the retro cool of monospace text; or maybe it’s more that it appeals to our desire to name things, to label them, to impose on them our interpretation of them.

Whatever the reason, I’m pleased to say that the earrings got a good receptionl! I’m happy to make more for other people on a best-effort basis, but equally happy for other people to copy the idea — and I note that someone else is selling a much more robust-looking version over on Etsy. XML: the iconic designer brand that anybody can use!