The business value of language technologies: 05/2011

Subtitle: The impact of cognitive load on the acceptance of new technologies

I love language technologies. Whenever I can, I try to use them, on the phone, on the web. On the phone, for example, when an automatic system (IVR) asks me for my phone number, I will always pronounce it. I will never key it using the phone’s keyboard. In our family, we were among the first to have a very well featured telephone at home. It contains most of the phone numbers we dial more than once in a nice directory. And sure, it allows for voice dialing.

However, each time my wife wanted to call her father, she keyed the 10 digit old family phone number manually. I made her aware of the nice features of our phone, that her father’s phone number was in the directory. She did not care, she still went through the 10 strokes on the small phone-keyboard to call her father. So I asked why: using the recorded address book, she would need to press only 3 or 4 keys. Or even better, speaking the name, she would have to press only one button. She said she was aware of all this …but she preferred to type in the whole number anyway.

Another day she was dialing her father and at the same time she was explaining to me what she wanted to ask her father. She was dialing and talking to me. I know, women are multi-tasking enabled and men aren’t. But so much multi-tasking?

When I call her father … I use voice dialing.

OK , even my wife would not be able to talk to the phone AND to me simultaneously.

A few days later a friend, who is working as a physicist, explained to me one of the reasons why elderly persons recall much better what happened in their childhood compared to what they experienced a few days ago: The brain-energy needed to recall long term memory is much lower than the energy needed to access short term memory.

What a relief! I felt much better … at last, my wife is not so much more multi-tasking enabled than I am: Her dad´s phone number simply belongs to her long term memory. She needs nearly no “CPU” to dial it. I tried it with my mother´s phone number. It came out of my fingers, I had nothing to do!

When dialing with voice or with the address book, the brain is much more active: the information we get back from the phone and to which we need to react is always new, and often, the information we get back is surprising “did it really mix up Müller with Miller? I thought I did not have a Miller in my directory” … we need brain activity, we need much more CPU to handle it.

In other words, the cognitive load for Voice Dialling is higher compared to the cognitive load for "finger" dialing.

Is this a specific problem to Voice Dialling?

Let’s take an example from the kitchen. Do you love a really good sauce hollandaise? I do...

For a good sauce hollandaise, you need the correct temperature (between 50°C and 60°C, never, never above 60°C!) and the right speed at which you pour the butter into the sauce (at the beginning very slowly, after 1-2 minutes you can be very quick).

It’s not very difficult, but it needs some attention. And it tastes so much better than the ready-made sauce hollandaise you can buy in every supermarket. And what people typically do? They buy the ready-made sauce. Their kitchen is the most expensive room in their house, they love cooking (see the increasing number of cooking shows on TV, 43 per week already in Germany!), they will complain against a bad dish in a restaurant ... but they will buy ready-made sauce hollandaise, even it if tastes less creamy and smooth than a handmade one.

This is intrinsic to the human being. Quality sells only if it is a no-brainer. A human being will always prefer a “non-cognitive-loaded” solution, to a “cognitive-loaded” solution even if there is a difference in quality. Actually it is worse … if the user has the perception the new solution needs a higher cognitive load as the one he knows, the preference will go for the old habit. In other words, as soon as some cognitive-load is required, customers will seriously consider competitive methods with no-cognitive-load even if they achieve less good results.

Apply this to a voice dialing application. Because a speech recognizer does not have a 100% recognition rate, the customer, the speaker has to be ready to react to any mis-interpretation of the recognizer. The customer is anxious about something unpredictable coming soon. He still does not know when and how, but he knows it is coming. What he is quite sure of, is that his cognitive-load will jump one level higher at some point of the dialogue. If he knows a way to avoid this, he will go for it. He will choose for a competitive method requiring less cerebral activity. The other way round, deciding for a solution demanding a higher cognitive load will be considered only if the competitive methods are perceived more difficult, e.g. dialing while driving in the car.

Apply this to any voice application. It is obvious the cognitive-load required by the user grows with the number of voice interactions needed to go through the voice application. Add to this that, as soon as the customer experiences a surprise with the reaction of the system it is interacting with, its cognitive load jumps one level higher.

In general, the customer will prefer any alternative method with less cognitive load, even if the end-experience is less smooth.

Actually, this is not a surprise. We all know that speech but also written text is subject to various possible interpretations. Even between human beings.

If the person you are talking to is dumb, you have to be very precise and concise in your formulation: the whole cognitive load is on your side.
If the person you are talking to is agile, you can expect a reaction from his side as soon as you are imprecise in your formulation: the cognitive load is shared between both of you.

The higher the cognitive load required to use your voice application, the lower its market reach.

So the big question is how to keep the cognitive load delivered by a voice application as low as possible … This is a combination of:

Addressing long term memory
Reducing the number of possible surprises
Streamlining
Thinking use case
Mastering processes within and across channels
And quite a bit more … in order to make the machine more intelligent or at least more comprehensive, more agile

Talk with us if you feel your customer communication needs to be improved: dugast@tech2biz.eu

The idea of Voice Control (also called Voice Command = VC) is not new.

I recall 1998 or 1999 … It was at the beginning of the Automatic Speech Recognition market. A time where we dreamed of being able to browse on the web using speech. I met several times with a German company in Munich. They wanted to automatically speech enable the whole web. Without any human being in between … so, taking the HTML code of a webpage and automatically generating a voice presentation of that webpage for a person who would be on the phone and not in front of his PC.

The idea was straightforward: detect entry fields on a web page and speech enable them by automatically pronouncing (prompting) the name of the field while at the same time automatically loading the speech recognizer with the vocabulary expected by the field. For example, let’s take an entry field asking for airport destinations on the flight-booking webpage of an airline company: The name of this field is “Destination Airport” and it contains all destinations airports and cities of that airline. The voice browser would take the name of that field and ask with a synthesized voice “please enter the destination airport”. Meanwhile, the recognizer would be loaded with the names of all airports and cities covered by the airline. The user pronounces the name of a city and the voice browser goes to the next entry field. Simple, isn´t it? This Munich based company even developed an extension of HTML for that purpose.

Now the thing is that on the booking web page of an airline company, you typically have several forms like that to fill. Date, time, departure and arrival cities at least (and by the way, the time, was it arrival or departure time?). This means the voice browser would need to present each of these forms, one after the other. This also means automatic dialogue definition on the fly. With error recovery strategies at each step … a horror scenario for a “blind” over-the-phone access to the web page. Even today, 12 years later, with all the experience we have on dialogue handling of automated phone systems.

Well you can guess … it did not fly… it could not fly. The company eventually got bought by BurdaDigital … and I lost track of them.

The idea could not fly, for the simple reason that even today, we have just enough understanding to build by hand, let’s be positive, a nice human-machine over-the-phone dialogue-component. Automatically generating human-machine phone-dialogues (as opposed to manually defining them), which was the idea of the Munich company, is still not solved today. And then add the question of presenting on the phone the context of a webpage to a caller, deciding which details of the webpage are interesting for a caller and which not, summarizing the interesting part in such a way that the caller stays on the phone … A lot of work has still to be done here!

Now the idea of the Munich based company could have flown … on the web, not on the phone. On the web, that is, having the web page in front of you: Speech-in, graphic-out. On the web, that is replacing the input keyboard of your internet-connected device with your voice. In 1998 or 1999 the typical internet connection would have a 2 digit Kbit/second rate, far too slow to handle speech. So they could not even try the idea.

12 years later, internet runs at gigabit rates. Even mobile internet. We have smartphones. And we have much more smartphones than PCs at that time. We also have 12 years of experience in speech recognition and dialogue handling. So you may think, we should run for it, shouldn´t we? The answer is, well, we are cautious. Very cautious!

The best examples are given by 2 big names, Apple and Google.

Apple has introduced Voice Control in 2010 on iPhones and iPods. You can use Voice Control to place a call or to control your music library (iPod). In 17 languages plus some variants of some of these language (e.g. US English being a variant to UK English). No dialogue … No error-recovery: in case of error, you pronounce your command again. The decision not to propose error recovery is important: the user tries his chance, if it does not work, he does not want to get involved into error correcting, so Apple’s design concept.

Google goes a little bit further on Android phones. They name it Voice Actions, essentially to differentiate from Apple. In addition to the voice commands of Apple, with Voice Actions you can dictate a text message (or a mail, or a note to yourself) and it integrates with a map and navigation application. Furthermore, it includes Voice Search, so searching the web with your voice instead of typing in keywords on a keyboard. Google´s Voice Actions understand only US-English. No dialogue. … error recovery here is simple: either you pronounce your command again, or you select from a list of proposed alternatives or you correct the entry with the keyboard.

Apple is weaker on the number of command-actions (2 against 5 for Google), but Apple is stronger on the language coverage side (17 against 1 for Google). None of them, Apple nor Google proposes dialogues. We are still far away from the dream of the Munich company at the end of last century.

But will the dream, that of dialoging out of the blue with any website, will that dream become true?
The answer is: Yes and no.

Yes! in Apples's and Google's sense ... that it is now possible to automatically generate a speaker independent, good quality voice-enabled input field together with a text-enabled entry field.

No! in the sense of dialoging with any website. In a first instance, if both Apple and Google want to be successful, they will keep it at the command/action level and not introduce dialogues. At least to train the general public to get use to a voice-enabled user-interface.

What is the next step for both companies? To enlarge the number of commands/actions and the number of languages covered. Will it be 3 or 5 new commands/actions, will it be 10? No. It will be an infinite number. It will be generic and dynamic. Whatever app a user selects, each entry-field of an app will be keyboard-enabled AND voice-enabled. It will be possible to type or to speak the content of each and every entry-field of an app. This step sounds simple but it is a big step forward … because the recognizer has to be configured automatically, at run time, in the context of the entry-field, with all extensions expected by human beings, accepting in the case of an airport destination field users to say for example ”hum yes. To New York … Kennedy Airport not Newark in New Jersey”

So, Voice Control is back, no matter how you name it, Voice Command or Voice Actions. It is back in a reduced form. With no dialogue, just as an input mechanism to replace the keyboard. In terms of feature, a very simple next step … in terms of technology, a big next step. Customers will soon have 2 input alternatives to their smartphones … both with their inherent problems, either keyboard-typos or speech-recognition errors. And then we will see from there. Automatic typo recovery is improving rapidly. Speech-recognition error-recovery will combine speech and keyboard input. We will need to experiment and go by trial and error in order to understand what customers prefer to use what and how.

And what about dialogue handling that overlooks and guides users through a complete website? This is somewhat further behind. But before going that step forward, we need to ask ourselves an important question. Do we really need dialogues? My answer today is to say no: it is too complex for users. Will we ever need dialogue? If yes in what form, what for? Sure I have my idea. Let's put it that way: The perceived relationship we have with a computer is still a master-slave relationship. Whatever communication model we build, we cannot forget this simple statement.

If you are asking yourself when and where to personally communicate with your customer,

when and where automating your customer communication –

contact me any time: dugast@tech2biz.eu.

The business value of language technologies

Search This Blog

23 May 2011

What is more easy: to talk to a dumb or to an agile person?

16 May 2011

Voice Control is coming back … will it succeed this time? With or without dialogue? With or without error-recovery mechanism?