Tinhat

On the Digital Age

Speech wreck ignition

How many people do you know who use speech recognition on their computer or phone? Yes, very few. And yet it would be very convenient – if only it worked properly. Unfortunately it doesn't and that's not about to change soon.

Speech recognition is the holy grail of software. But the grail hasn't yet been found and it's proving as tricky as anything buried in a mountain in eastern Turkey (just a guess). Technically there are massive challenges in speech recognition. Here's a simple example: "ate him", "A-team", "ATM". The human mind hears the sequence of sounds (called phonemes) and then works out which expression matches the context. Are we talking about cannibalism, classic TV shows or cash? It's an exercise in intelligence. And for software it's pretty much the same thing. Speech recognition and artificial intelligence go hand in hand.

Back in the 1980s and 1990s there were great hopes for speech recognition. Many of the early technical issues were solved, things like digitising the sound and comparing the sound to existing patterns, compensating for slow and fast speech. And then came the realisation that even when you've correctly identified the sounds, you can only get the right words if you understand the context. And that's infinitely more difficult.

The situation isn't helped by marketing departments claiming the solution has arrived, only for poor users to discover it's yet more hyperbole and the systems still aren't good enough. Reviewers should take some stick here too. YouTube and blogs are full of "we're finally there" reviews that don't match reality. Indeed, we've heard Wolf cried so often that when the real wolf arrives it may take a while before we recognise it. Or we might even confuse it with "Wool, ph…"

Progress has been painfully slow. I wrote about speech recognition back in the 1990s and felt sure the solution was just around the corner. Yet all that's happened is that error counts have been chipped away in an unpleasant pattern of diminishing returns. Most of the basic stuff was solved by around 2001, and since then there have been plenty of refinements but still no ultimate solution. And that's because we're still waiting for real artificial intelligence. Only when we have decent AI can we solve the problem of context.

Not surprisingly, the software behind the intelligence element is copious. The overall processing requirement is way beyond a mobile phone and a struggle for most home computers. So it's better to ship the sounds off to a remote server farm somewhere near the Arctic Circle (for efficient cooling) where multiple massive servers can give it a try. And that leads to privacy problems as our words are now travelling to distant destinations to be interpreted, and possibly recorded.

It is, though, the way forward. The first real successes in speech recognition are going to be through the cloud and remote computing. Already in 2015 I've tried a system (which I won't name) that gave good results as long as the remote servers weren't too busy. But the processing requirements were obviously too great for it to grab enough resources all the time. Unfortunately users are not going to make full use of remote processing because of privacy issues. Its main significance will be as a milepost. When we see excellent remote processing then we'll know a good local version should be along soon.

The other milepost will be excellence in dedicated systems, for example automated call centres that never make a mistake when hearing numbers. That too is pretty close. Perhaps we're already there.

As for timescales, I am not at all optimistic. Barring some paradigm shift in AI technology we're stuck with the pattern of diminishing returns that we've had since 2001. It's taken far too long to crack something as simple as listening to numbers – where we don't even need to think about context. If we keep going at that pace it's going to take decades. And indeed that's my guess. By 2025 still only one or two percent of our total communication with machines will be spoken. By 2035 maybe ten percent. That's pretty dismal. As for full conversations in line with the sci-fi movies, that simply won't happen as a progression of the technology we have now. It needs a breakthrough in artificial intelligence.

On the commercial front, the company Nuance dominates proceedings (was that "Nuance", "knew once" or "new aunts"?). It's responsible for Dragon Naturally Speaking and is also behind Apple's Siri and Samsung S-Voice. Around half its revenues come from healthcare transcriptions. Google is following an independent path with greater emphasis on the AI element. Wildcard entries could come from Korea or Japan. Both the Japanese and Chinese (less so the Koreans) have a great incentive for cracking speech recognition, because their written character-sets are so massive they can't fit on a regular keyboard. The keyboard has to be decked on multiple virtual levels, or characters written partially and then completed as a second stage. Speech recognition would solve a lot of their problems.

It will also drag Westerners away from their screens, or at least their keyboards, which will be a great relief. Indeed it will be revolutionary and start a brand new phase of the Digital Age. Eventually.