Back to Blog

Watch-Out Siri and Alexa: Voice is the latest AI Battleground

The artificial intelligence arms race is transforming industries and fundamentally changing how we interact with technology.

The days of tapping out prompts into a text box and waiting for an AI’s reply are rapidly fading. In their place, real-time voice and video interactions are emerging as the new frontier for AI-powered applications — offering fluid, human-like exchanges far beyond the limitations of static text. These new communication modes are posed to usurp the traditional dominating powers in voice-first AI like Apple, Google, and Amazon.

While chatbots and text-based AI interfaces have revolutionized how we interact with machines, they still create an artificial barrier between human and computer. The need to type out our thoughts forces us to translate our natural communication instincts into a more structured, less spontaneous format. To add insult to injury, if you want to process visual data, AI systems need you to describe it using text. This translation process, however slight, disrupts the flow of human-computer interaction and limits the potential for truly natural exchanges.

Text-based communication also creates gates that limit access to the technology for individuals who are not literate or have limitations that restrict their ability to express themselves in written format. I know what some of you are thinking, “on-device speech to text” as a solution, but given the shallow nature of many on-device models, it’s often unforgiving. The internet is littered with memes showcasing these misunderstandings.

Text Is Giving Way to Voice

Just as the AI industry evolves at lightning speeds, so do the ways users interact with these new AI models and agents. It wasn’t that long ago when interacting with AI meant typing prompts and waiting for a written response. That static nature made sense in the early days — language models were just learning how to handle context and semantics. But as the industry looks toward mass adoption, it faces high expectations from an audience accustomed to seamless, immediate communication.

What started with Siri, Alexa, and Google Assistant, has given way to a generational shift in how people expect to interact with AI. Real-time voice interactions are emerging to usurp a landscape once dominated by text-based prompts.

At the same time, mobile device manufacturers have stepped up with improved microphones, noise-canceling hardware, and dedicated AI processors, taking real-time response times from “nice-to-have” to a requirement. Audio-first interactions offer an immediacy that text can’t match. Rather than searching for words to frame a request, you can simply say what’s on your mind.

These changes are setting the stage for a new era in human-computer interaction — one where advanced speech recognition, nuanced language understanding, and the ability to interpret a speaker’s intent merge into fluid, on-demand dialogue. This shift toward voice interfaces aligns AI closer to our natural cognitive processes. Humans typically speak at 150 words per minute but type at only 40 words per minute. This speed differential becomes crucial in settings where quick decision-making and rapid information exchange are essential.

The Evolving Landscape

The voice assistants that ushered in the previous era — systems like Siri, Alexa, and Google Assistant — paved the way by giving people a taste of hands-free convenience. But these early assistants are limited by their own scripted frameworks. Ask a question slightly off-script, and they either stumble or return information that may or may not be relevant. They have trouble picking up nuance, context, or subtle cues in your tone of voice.

Since their launch, we’ve only seen incremental improvements in their handling of off-script tasks. But with innovations like the recent release of OpenAI’s Real-Time SDK, these legacy voice assistants are exposed to disruption by agents that are leaps beyond their capabilities. Challengers include generative models that can contextualize a conversation over multiple turns, making recommendations and performing functions that stand apart from the status quo.

This evolution from text to voice interfaces is built on remarkable advances in several key technologies. Large Language Models have evolved beyond basic text processing to understand and generate natural speech patterns, maintaining context across extended conversations while grasping subtle linguistic nuances. Modern LLMs can now process parallel streams of information — analyzing speech patterns, semantic content, and emotional undertones simultaneously — creating a more comprehensive understanding of user intent.

The advancement in Automatic Speech Recognition (ASR) has been equally crucial. Modern ASR systems can achieve near-human accuracy in ideal conditions and increasingly maintain high performance even in challenging acoustic environments. This robustness comes from deep learning architectures that can filter out background noise and adapt to different accents and speaking styles.

Perhaps most importantly, neural speech synthesis has progressed from robotic monotone to generating natural, emotive voices that can adapt their tone and emphasis based on context. Just look at the latest releases from the teams at OpenAI, ElevenLabs, and Fish.Audio. These systems can now modulate pitch, pace, and emphasis to convey emotion and intent, and even mimic a given voice, making interactions feel more human and engaging. The ability to maintain consistent voice characteristics across sessions creates a sense of continuity and personality that was previously impossible.

Adding Depth Through Visual Understanding

The integration of video capabilities takes AI interaction to an entirely new level. Computer vision systems now process real-time video feeds with unprecedented accuracy, recognizing everything from objects to facial expressions and environmental context. Users no longer have to use text to describe visual inputs, thanks to on-device models like MediaPipe and cloud-based solutions like SegmentAnything. AI can now perceive and understand the world through a digital camera lens.

image credit: https://ai.meta.com/sam2/

This visual processing capability extends beyond simple object recognition. Modern systems can understand spatial relationships, track movement over time, and interpret complex scenes with multiple objects and actors. For example, in a video call, AI can now recognize when someone is raising their hand to ask a question, detect confusion through facial expressions, or understand gestural commands. This environmental awareness creates opportunities for more contextually appropriate responses and actions.

The Rise of Multimodal Experiences

The next generation of AI agents are multi-modal-ai powered by conversational LLMs that take more than just text input; developers can pass a list of tools/functions that can be executed, and AI can respond with suggestions on which actions to take next based on the given input. The fusion of modalities supports more complex and intelligent interactions. It allows AI to move beyond being a mere assistant and step into roles that require understanding and the ability to perform complex actions.

What’s emerging is more akin to a virtual companion. Instead of feeling like a query-response machine, these models have the ability to recall previous discussions, adjust their style based on user feedback, and offer not only answers but offer insights and take action.

The true power of multimodal AI lies in its ability to combine different types of input seamlessly. For instance, a user might start with a voice command, gesture to indicate a specific area of interest in their environment, and receive both auditory and visual feedback. This kind of natural interaction loop mirrors human-to-human communication, where we unconsciously integrate multiple channels of information to relay information and respond to each other.

Developers Must Adapt

Shifts of this magnitude don’t affect only end-users. Developers and businesses need to adapt their strategies, tools, and products. Crafting experiences around voice and video requires a different mindset. Designers must consider pacing, intonation, and visual cues rather than just written prompts.

It’s no longer enough to offer an app or website; users expect their AI interactions to feel natural and have a deeper understanding of their current environment and context.

The learning curve is steep, but so are the rewards. Those who embrace this new landscape stand to differentiate themselves with more engaging customer service channels, more immersive educational tools, and more intuitive productivity suites.

Infrastructure Demands

Building this voice and video-first future requires robust infrastructure. Low-latency networks, edge computing setups, and dedicated neural processing units are essential for delivering smooth, real-time responses. This isn’t just about raw processing power — it’s about creating reliable, stable systems that can handle the complexities of multimodal interaction at scale.

The infrastructure challenge extends to data management and processing architectures. Systems need to handle multiple streams of data — audio, video, and contextual information — while maintaining synchronization and real-time processing capabilities. This requires sophisticated orchestration of cloud and edge resources, along with intelligent data routing and caching strategies.

Another challenge emerges in how current AI services are designed. Many APIs, including OpenAI’s Real-Time SDK, are geared toward server-to-server communication and place the onus on developers to manage delivery from the client app back to their own servers. While this can seem manageable at a small scale, it quickly becomes daunting as user bases grow. Scaling real-time voice and video interactions means tackling network unpredictability, varying device capabilities, and the nuances of the “last mile” in data delivery. Yet users have grown accustomed to solutions that “just work,” regardless of bandwidth or conditions. This puts pressure on infrastructure teams to incorporate intelligent load balancing, adaptive streaming techniques, and even localized processing to ensure that the experience remains seamless and reliable.

Privacy considerations become even more critical when dealing with voice and video data. These interactions are inherently more intimate, capturing sensitive information about speech patterns, facial expressions, and environmental details. The industry must prioritize responsible data handling through advances in federated learning, on-device processing, and enhanced encryption methods. User trust will increasingly depend on transparency about data usage, access controls, and learning mechanisms.

Shaping Tomorrow’s Digital Interactions

The evolution from text to voice and video interfaces marks a fundamental shift in human-computer interaction. As these technologies mature, we’re moving toward a future where AI interactions feel less like using a walkie-talkie connected to a magic eight-ball and more like engaging with a knowledgeable, empathetic companion.

The next generation of digital experiences will be defined by natural, multimodal interactions that seamlessly blend speech, visuals, and contextual understanding. For users, this means more intuitive, engaging, and productive relationships with technology. For developers, it represents an opportunity to pioneer new forms of digital engagement that could reshape entire industries.

RTE Telehealth 2023
Join us for RTE Telehealth - a virtual webinar where we’ll explore how AI and AR/VR technologies are shaping the future of healthcare delivery.

Try Agora for Free

Sign up and start building! You don’t pay until you scale.
Try for Free