In 1999, while traveling through Eastern Europe, I was robbed in a relatively remote area of the Czech Republic.
Someone called the police, but we quickly realized we couldn’t communicate: They didn’t speak English and I could offer no Czech. Even the local high school English teacher, who offered her assistance, didn’t speak English well enough for me to effectively communicate with the police.
At a time well before smartphones, I didn’t realize then that technologists were already hard at work on innovations that could one day play a vital role in events like the one I had.
In 1994, several influential computer scientists at Microsoft, led by Xuedong Huang, began setting the groundwork to tackle our global language barrier through technology. Microsoft was developing a new voice recognition team – one of the first in the world.
In the early days of the technology, voice recognition was imperfect. We measure the accuracy of voice recognition with something called the Word Error Rate (WER). The WER measures how many words are interpreted incorrectly. If I say five words and four of them are understood correctly while one word is not, we have a WER of 20 percent. Back in the 1990’s, the WER was nearly 100 percent. Almost every word spoken was incorrectly “heard” by these computer systems.
But computer scientists such as Huang and his team continued to work. Slowly but surely, the technology improved. By 2013, the WER had dropped to roughly 25 percent, an improvement to be sure, but still not sufficient to be truly helpful.
While a WER of 25 percent might seem adequate, imagine the frustration a user might feel in a home automation environment when they say, “turn on the BEDroom lights,” and the LIVINGroom lights go on. Or imagine trying to dictate something and having to correct a quarter of your work after the fact. The long-promised productivity gains simply hadn’t materialized after decades of efforts.
And then the magic of innovation and technology began to kick in.
Over the last three years, the WER has dropped from roughly 25 percent to around five percent.
The team at Microsoft recently declared they had achieved “human parity” with the technology – it was now as good at interpreting human speech as humans are. We have seen more progress in the last 30 months than we saw in the first 30 years.
Many of us have experienced the seeming magic that voice recognition has become. In using voice recognition platforms in recent years, you’ve also likely watched as the words transcribed are updated and changed after additional words are spoken.
Speech recognition is going beyond just individual word recognition to account for context and grammar as well. Network effects are kicking in and the application of big data is enabling the technology to move at a rapid pace unseen in its history.
Today, we talk to computers on an increasingly regular basis. While packing for a trip to Singapore, I “talked” with Google Home’s voice-activated digital assistant to prepare for my trip – going back and forth on everything from weather and history to the religious breakdown of the city-state.
Similarly, Amazon’s Alexa will order you an Uber or a pizza, read off your Fitbit stats or update you on the balance in your bank account. Alexa can help around the house, too, if you ask – dishing you the daily news while you’re in the kitchen or reading you an audiobook before bed. And paired with the right hardware, she’ll lock your front door, turn off your lights, or adjust the temperature in your home.
To be sure, the technology has a long way to go before it is omnipresent. But it is beginning to be deployed in new and interesting ways. And at CES 2017, voice-recognition was one of the clear winners, permeating every corner of the show floor. From Ford and Volkswagen to Martian Watches and LG refrigerators, voice-integration transcended every category. Voice is becoming the common OS stitching together diverse systems across a myriad of user applications.
As we have made these astronomical improvements in voice accuracy, I foresee two important directions voice will go from here.
First, digitization and connectivity will beget personalization. In the future, it won’t be enough that we can talk to the connected objects around us. Each member of a household or office can and will have a unique relationship with voice-enabled objects. Google has started to push Google Home in this direction.
Second, remember that voice is the user-interface layer to a much richer computing environment. Siri, Cortana, Alexa, Google Home and others are bringing individuals face-to-face with an AI-infused computing experience. For many daily tasks where we might use voice today, doing them on our phones or other devices might currently be more efficient because we can see extra information. But the role of AI in these voice systems will begin to transform the user experience.
Context is the next dimension for voice-optimized platforms. For example, when I can open my refrigerator, read off a series of potential ingredients I have on hand, and get back receipt suggestions, I’ll have accomplished something with my voice that would be cumbersome in other computing environments. Context is king and voice will make that more apparent and readily accessible than ever.
I sometimes think back to that incident in Eastern Europe when even the local English teacher couldn’t communicate with me. Today, I could speak to her mobile phone and get a relevant reply in return. The technology now available to us would have changed my experience. And likewise, this technology will forever change how we interact with computing – and with each other.