The business value of language technologies: Voice Control is coming back … will it succeed this time? With or without dialogue? With or without error-recovery mechanism?

The idea of Voice Control (also called Voice Command = VC) is not new.

I recall 1998 or 1999 … It was at the beginning of the Automatic Speech Recognition market. A time where we dreamed of being able to browse on the web using speech. I met several times with a German company in Munich. They wanted to automatically speech enable the whole web. Without any human being in between … so, taking the HTML code of a webpage and automatically generating a voice presentation of that webpage for a person who would be on the phone and not in front of his PC.

The idea was straightforward: detect entry fields on a web page and speech enable them by automatically pronouncing (prompting) the name of the field while at the same time automatically loading the speech recognizer with the vocabulary expected by the field. For example, let’s take an entry field asking for airport destinations on the flight-booking webpage of an airline company: The name of this field is “Destination Airport” and it contains all destinations airports and cities of that airline. The voice browser would take the name of that field and ask with a synthesized voice “please enter the destination airport”. Meanwhile, the recognizer would be loaded with the names of all airports and cities covered by the airline. The user pronounces the name of a city and the voice browser goes to the next entry field. Simple, isn´t it? This Munich based company even developed an extension of HTML for that purpose.

Now the thing is that on the booking web page of an airline company, you typically have several forms like that to fill. Date, time, departure and arrival cities at least (and by the way, the time, was it arrival or departure time?). This means the voice browser would need to present each of these forms, one after the other. This also means automatic dialogue definition on the fly. With error recovery strategies at each step … a horror scenario for a “blind” over-the-phone access to the web page. Even today, 12 years later, with all the experience we have on dialogue handling of automated phone systems.

Well you can guess … it did not fly… it could not fly. The company eventually got bought by BurdaDigital … and I lost track of them.

The idea could not fly, for the simple reason that even today, we have just enough understanding to build by hand, let’s be positive, a nice human-machine over-the-phone dialogue-component. Automatically generating human-machine phone-dialogues (as opposed to manually defining them), which was the idea of the Munich company, is still not solved today. And then add the question of presenting on the phone the context of a webpage to a caller, deciding which details of the webpage are interesting for a caller and which not, summarizing the interesting part in such a way that the caller stays on the phone … A lot of work has still to be done here!

Now the idea of the Munich based company could have flown … on the web, not on the phone. On the web, that is, having the web page in front of you: Speech-in, graphic-out. On the web, that is replacing the input keyboard of your internet-connected device with your voice. In 1998 or 1999 the typical internet connection would have a 2 digit Kbit/second rate, far too slow to handle speech. So they could not even try the idea.

12 years later, internet runs at gigabit rates. Even mobile internet. We have smartphones. And we have much more smartphones than PCs at that time. We also have 12 years of experience in speech recognition and dialogue handling. So you may think, we should run for it, shouldn´t we? The answer is, well, we are cautious. Very cautious!

The best examples are given by 2 big names, Apple and Google.

Apple has introduced Voice Control in 2010 on iPhones and iPods. You can use Voice Control to place a call or to control your music library (iPod). In 17 languages plus some variants of some of these language (e.g. US English being a variant to UK English). No dialogue … No error-recovery: in case of error, you pronounce your command again. The decision not to propose error recovery is important: the user tries his chance, if it does not work, he does not want to get involved into error correcting, so Apple’s design concept.

Google goes a little bit further on Android phones. They name it Voice Actions, essentially to differentiate from Apple. In addition to the voice commands of Apple, with Voice Actions you can dictate a text message (or a mail, or a note to yourself) and it integrates with a map and navigation application. Furthermore, it includes Voice Search, so searching the web with your voice instead of typing in keywords on a keyboard. Google´s Voice Actions understand only US-English. No dialogue. … error recovery here is simple: either you pronounce your command again, or you select from a list of proposed alternatives or you correct the entry with the keyboard.

Apple is weaker on the number of command-actions (2 against 5 for Google), but Apple is stronger on the language coverage side (17 against 1 for Google). None of them, Apple nor Google proposes dialogues. We are still far away from the dream of the Munich company at the end of last century.

But will the dream, that of dialoging out of the blue with any website, will that dream become true?
The answer is: Yes and no.

Yes! in Apples's and Google's sense ... that it is now possible to automatically generate a speaker independent, good quality voice-enabled input field together with a text-enabled entry field.

No! in the sense of dialoging with any website. In a first instance, if both Apple and Google want to be successful, they will keep it at the command/action level and not introduce dialogues. At least to train the general public to get use to a voice-enabled user-interface.

What is the next step for both companies? To enlarge the number of commands/actions and the number of languages covered. Will it be 3 or 5 new commands/actions, will it be 10? No. It will be an infinite number. It will be generic and dynamic. Whatever app a user selects, each entry-field of an app will be keyboard-enabled AND voice-enabled. It will be possible to type or to speak the content of each and every entry-field of an app. This step sounds simple but it is a big step forward … because the recognizer has to be configured automatically, at run time, in the context of the entry-field, with all extensions expected by human beings, accepting in the case of an airport destination field users to say for example ”hum yes. To New York … Kennedy Airport not Newark in New Jersey”

So, Voice Control is back, no matter how you name it, Voice Command or Voice Actions. It is back in a reduced form. With no dialogue, just as an input mechanism to replace the keyboard. In terms of feature, a very simple next step … in terms of technology, a big next step. Customers will soon have 2 input alternatives to their smartphones … both with their inherent problems, either keyboard-typos or speech-recognition errors. And then we will see from there. Automatic typo recovery is improving rapidly. Speech-recognition error-recovery will combine speech and keyboard input. We will need to experiment and go by trial and error in order to understand what customers prefer to use what and how.

And what about dialogue handling that overlooks and guides users through a complete website? This is somewhat further behind. But before going that step forward, we need to ask ourselves an important question. Do we really need dialogues? My answer today is to say no: it is too complex for users. Will we ever need dialogue? If yes in what form, what for? Sure I have my idea. Let's put it that way: The perceived relationship we have with a computer is still a master-slave relationship. Whatever communication model we build, we cannot forget this simple statement.

If you are asking yourself when and where to personally communicate with your customer,

when and where automating your customer communication –

contact me any time: dugast@tech2biz.eu.

The business value of language technologies

Search This Blog

16 May 2011

Voice Control is coming back … will it succeed this time? With or without dialogue? With or without error-recovery mechanism?

No comments:

Post a Comment