Yesterday, I outlined how all the big and small LLM AI companies including OpenAI with GPT-4 ‘Omni’, Google with Astra, Apple with Siri , Meta with Meta AI, and others are leaning into Voice has a key AI user interface for mainstream audiences in the billions. We’re at that stage of these early days of the AI Tech Wave when there is a need to speedily on-ramp billions into the ‘magical’ capabilities of AI augmented applications. The OpenAI voice interface above may become as iconic as Microsoft Windows is to us today.
And Voice speaks directly to our emotional brains. It likely accelerates one of the biggest perils of AI that I’ve outlined, the propensity for humans to easily “Anthropomorphize the AIs”. And they will. Until they learn better.
But there’s also a lot of promise. Just watch this video with Khan Academy founder Sal Khan using OpenAI’s just released GPT-4o to tutor his son in Math. One’s perception of AI’s promise changes in three minutes.
Or watch this video where Linkedin founder and Microsoft Board member Reid Hoffman interviews himself with a ‘digital twin’ AI of himself created by Voice AI comnpany ElevenLabs. It even translates his words into Klingon and back to English. It’s a riveting 14 minutes. If you want more, there’s ten more minutes where the digital twin and REID take viewer questions.
And both these examples are the worst these technologies will ever be. Just like these earlier ‘voice assistants’ were just a few years ago. They’re now all being revamped with LLM/Generative AI.
The Voice AI toothpaste is out of the tube. And uncounted companies are racing to add Voice to their AI applications and services.
For a change of pace, here’s AI Voice Unicorn, ElevenLabs giving their take on OpenAI’s Voice augmented GPT 4o, before its official introduction on Monday:
“OpenAI has been expanding its portfolio with new products, and one of the most talked about is their Voice Assistant technology. It's set to revolutionize how we interact with machines using voice, yet much about its broad deployment remains under wraps.”
Allegedly, OpenAI is developing a technology that integrates audio, text, and image recognition capabilities into a single product.”
“What is OpenAI's Voice Assistant?”
The rumoured Voice Assistant is designed to naturally interact with users through speech. It leverages advancements in Automatic Speech Recognition (ASR), Large Language Models (LLMs), and Text to Speech (TTS) systems. The integration of these technologies allows the Voice Assistant to understand spoken input, process the information contextually, and respond in a natural, human-like voice.”
They then go on to give a high-level intro to the underlying technical wizardry involved, and how more than just technical wizardry is needed to excel:
“Almost all voice AI systems follow three steps:
“Speech Recognition ("ASR"): This converts spoken audio to text. An example technology is Whisper.”
“Language Model Processing: Here, a language model determines the appropriate response, transforming the initial text to a response text.”
“Speech Synthesis ("TTS"): This step converts the response text back into spoken audio, with technologies like ElevenLabs or VALL-E as examples.”
“Adhering strictly to these three stages can lead to significant delays. If users have to wait five seconds for each response, the interaction becomes cumbersome and unnatural, diminishing the user experience even if the audio sounds realistic.”
“Effective natural dialogue doesn't operate sequentially:”
“We think, listen, and speak simultaneously.
We naturally interject affirmations like "yes" or "hmm."
We anticipate when someone will finish talking and respond immediately.
We can interrupt or talk over someone in a non-offensive way.
We handle interruptions smoothly.
We can engage in conversations involving multiple people effortlessly.”
“Enhancing real-time dialogue isn't just about speeding up each neural network process; it requires a fundamental redesign of the entire system. We need to maximize the overlap of these components and learn to make real-time adjustments effectively.”
And then of course provide how ElevenLabs does it even more differently:
“ElevenLabs Voice AI”
“One thing that is certain to feature in any advanced voice assistant is cutting-edge voice AI. ElevenLabs models combine proprietary methods for context awareness and high compression to deliver ultra-realistic, lifelike speech across a range of emotions and languages. Our contextual text to speech model is built to understand word relationships and adjusts delivery based on context. It also has no hardcoded features, meaning it can dynamically predict thousands of voice characteristics while generating speech. Our models are optimised for particular applications, such as long-form and multilingual speech generation or latency-sensitive tasks.”
As I mentioned a few days ago, Apple is rumored to have completed a deal to incorporate OpenAI’s GPT technologies into Siri, its pioneering ‘voice assistant”. That would bring OpenAI technologies to over two billion Apple devices. That combined with Apple’s existing $20 billion a year payment by Google to make Google Search the default on Apple devices, and Google announcing Gemini AI integration into Google Search soon, means Apple will have the two types of AI services for its users shortly.
As I said a few nonths ago, Voice AIs have their pros and cons. And as I outlined yesterday, they’re likely the next big way we get to get try AI in a mainstream way.
The ultimate potential here goes far beyond asking questions and hearing answers. This ‘multimodal’ ‘Smart Agent and Agentic’ Voice AI transition is likely to change us and our societies, mostly for the better. Stay tuned.
(NOTE: The discussions here are for information purposes only, and not meant as investment advice at any time. Thanks for joining us here)
VoiceAI may help learn a language. How about Japanese learning Hindi, Italian learning Standard Arabic, English learning Urdu, Russian learning Portuguese, etc.
Human student learns from AI bot. The combos (student/teacher) could be extensive. Learn at your own pace. Stop and start when time allows. Pick up where you left off.