Search This Blog

2 Feb 2011

Speech-IVR is dead - Long live speech command ?

Jay Wilpon, Executive Director for Speech Services Research at AT&T stated at the Mobile Voice Conference in San José (California) last week:
“until now we forced speech into where we wanted to have it (i.e. IVR), we are now bringing it to where it belongs to (i.e. mobile)”.

Marcello Typrin, Vice President Products from Yap Inc. a company that provides Voice to Text SMS services, began his talk with the statement “The Phone Call is Dead” citing among others an article from TechCrunch.

These statements (plus a few others in the same spirit, all made during the Mobile Voice Conference 2011) are based on the fact that the number of calls is permanently going down and that the data traffic on phone lines is higher than the voice traffic and increasing. In 2008, for the first time the average number of SMS per user reached the average number of voice calls to be twice has high at the end of 2009. And the gap is growing.

Did we err so many years developing voice enabled IVR applications? Does this mean we have to stop all IVR voice applications? Certainly not, but we do not need to be obsessed about Voice IVR. As several speakers mentioned it, there are good reasons why the number of calls is going down: speech is linear, formal, emotional, slow and implicitly puts a time pressure on callers. As long as the caller can keep the pressure it is OK, as soon as the pressure is reversed (e.g. from the IVR to the caller) the caller will prefer a written communication (e.g. SMS, Email …). On the other side, speech enabled apps or voice search on smart phones are more fault tolerant: multi-modality allows for easy error correction. There certainly is more to think and talk about. I will blog more on this during the coming days.

Further, with the Cloud well established, most mobile applications have their recognition engine in the cloud. It means, the results of the recognition engine are directly accessible to speech recognition vendors. The results … and the spoken data. Speech recognition vendors can now access the spoken data and look for recognition errors, even for application design errors and retrain using that real data.  Ilya Bukchteyn from Microsoft-Tellme said, they can learn from data 350 times a second. Vlad Sejnoha from Nuance Communications reported 1,2 Mio transactions per day with their Dragon Dictate and Dragon Search Apps on the iPhone. This huge amount of data can now be used near real-time to train and adapt technologies, to adapt Voice User Interfaces and immediately see the consequences of corrections made. This is what vendors were always looking at. Vendors will experience an exponential growth of their learning curve what should improve the user experience on mobile applications.

I note that during the conference not much attention has been devoted to embedded speech recognition. With smart phones that now are able to handle a powerful recognizer offline (i.e. a large vocabulary recognizer can now run on the smart phone), the game rules have also changed here: the recognizer can be speaker dependent in all its terms, e.g. acoustically, vocabulary wise, semantic wise, application wise ….just to name a few restrictions that can easily be built in and dramatically improve usability of spoken commands. I am sure a few things will happen here, as much for TV, for cars as for smart phones.

To summarize, here are the most interesting statements that came out of the conference:
  • Speech Recognition goes away from IVR, essentially because the volume of voice calls is going down
  • Speech Recognition goes mobile, essentially with Smart Phones but also in the car (Multi modal application interaction)
  • Speech Recognition is well accepted for web search on Smart Phones
  • Business Models will change, but we do not know how
  • Speech Recognition is better accepted now, essentially because  it is used on applications where recognition errors are acceptable/correctable
  • In terms of technology needs, understanding the meaning of a spoken sentence has to be improved
  • The Cloud allows for exponential growth of the learning curve (Microsoft has 350 possibilities per second to learn)