As conversational AI tech continues to advance, there are still some major hurdles to rolling out real-time voice and video communication with LLMs. In a previous blog we highlighted the importance of overcoming the challenging conditions of the last mile. Latency (delay) is another challenge that must be overcome when enabling speech driven conversational AI in an application.
In this blog we focus on the impact of latency (delay) in speech driven conversational AI applications, including:
On the OpenAI GPT-4o announcement page, the company highlights that GPT-4o “can respond to audio inputs in as little as 232 milliseconds (ms), with an average of 320 ms which is like human response time in a conversation. The referenced study is titled “Universals and cultural variation in turn-taking in conversation.” It covers 10 representative languages and found that the mean response offset of turn transitions was about 208 ms. The conversations analyzed were from videotaped interactions of participants in the same location. For in-person conversations the mouth-to-ear delay (time between one person speaking and the other person hearing) is quite low. With speakers about 2 meters apart, it is about 6 ms. See Figure 1 below as an illustrative example.
For an application where the intent is to support ‘Conversational AI,’ it is important to emulate natural conversation. To emulate natural conversation, it is crucial to consider the latency for both mouth-to-ear delay and the latency in turn taking in conversation. Further, since Conversational AI applications today require interaction between a user leveraging a device and infrastructure in the cloud, all elements which contribute to increasing latency must be understood and minimized for the best experience.
Let us look at a case (see Figure 2 below) where two people are in separate locations and are using an application on their mobile devices to communicate with each other on an audio call.
If the user of mobile phone 1 is holding the phone microphone directly to their mouth and the user of mobile phone 2 is holding the phone speaker directly to their ear, the mouth-to-ear delay in this case would be the sum of the individual delays in all the boxes shown above. The delay contribution for both mobile devices is shown in Table 1. For comparison, we show the typical delay as well as the reduced delay which Agora has achieved with device and operating system optimizations.
The Network Stacks and Transit delay defined as the total time it takes for speech packets to transit the network edge-to-edge can vary significantly depending upon whether the users are in the same city or different cities, states, or countries. In our testing, we compared the one-way latency over the public internet and over Agora’s proprietary Software Defined Real-Time Network (SD-RTN™). This one-way latency is measured from network edge to network edge, not including the last mile hop on each end. We compared data both within a continent (intra-region) and between continents (inter-region). The results are shown in Figure 3 below.
For simplicity, the key takeaway is that 95% of users within the same region or between geographic regions experience > 50% improvement (reduction) in latency.
Let us assume that the two mobile phone users are both located within North America. In this case 95% of users on the public internet would have no more than ~94 ms latency and using Agora’s SD-RTN™ would have ~33 ms latency. The best possible latency for the mobile last mile hop is approximately 10 ms between servers on the public internet and the mobile device and 10 ms between Agora’s SD-RTN™ and the mobile device using Agora’s SDK. This 10 ms number assumes that the last hop is in the same city as the user on the mobile device and that the last mile connection is excellent. Using these numbers the total mouth-to-ear delay can be estimated as shown in Table 2.
The figure below, extracted from the ITU G.114 standard, depicts the telecommunication industry’s study on voice latency vs. user satisfaction quality.
Referring to Figure 4, with up to 275 ms mouth-to-ear delay users are satisfied. Between 275ms and 385ms some users are dissatisfied. Beyond this, the experience is poor.
Referring to Table 3, the device, and operating system level optimizations, in addition to network optimization for latency supported by Agora result in far lower overall latency and higher G.114 user satisfaction ratings.
With this background and context, let us now look at an example of speech-driven conversational AI where the AI agent is at the network edge, as shown in Figure 5. For simplicity, we will assume that the AI workflow and inference takes place at the edge of the network. In this example, we assume the LLM supports a direct speech interface (Audio LLM) which means there is no Speech-to-Text conversion required. The acronym TTS TTFB refers to the Time-To-First-Byte or the duration from when the request is made by the LLM to generate the Text-To-Speech response and the first byte of the response is generated.
Using this example, let us estimate the mouth-to-ear delay from a human using conversational AI app on their mobile phone to Audio LLM based AI, the turn taking delay for the Audio LLM based AI, and the mouth-to-ear delay from the Audio LLM based AI to the human using the conversational AI app on their mobile phone. In this example, we will assume the human user is using an Android Phone.
In this example, the estimated mouth-to-ear delay from Audio LLM based AI to the Android Phone user is near the ‘Some Users Dissatisfied’ threshold according to ITU G.114. This is a scenario where the network stacks and transit delay are minimal, given that the AI workflow and inference is assumed to be performed the edge of the network closest to the user. There will be many scenarios where humans will be interacting with other humans and one or more conversational AI agents over distance. Referring to Figure 3, the latency contribution of the network stacks and transit delay, in conjunction with the mobile device delay contribution can often cause the mouth-to-ear delay to exceed the threshold where users will be dissatisfied with their conversational AI experience.
Finally, let us look at at the same scenario where the AI agent is located intra-region versus right at the network edge. This scenario will become more common as conversational AI solutions scale up and people will be able to interact with one or more AI agents during a session.
For simplicity, let us assume that the user and the AI agent are both located within North America. In this case 95% of users on the public internet would have no more than ~94 ms latency and using Agora’s SD-RTN™ would have ~33 ms latency.
In this example, the estimated mouth-to-ear delay from Audio LLM based AI to the Android Phone user is well within the ‘Some Users Dissatisfied’ region according to ITU G.114. For an inter-region case, the experience can easily enter the ‘Many Users Dissatisfied’ region.
In conclusion, it is essential to minimize latency when implementing speech-driven conversational AI in your application. As discussed in this blog; to emulate natural conversation, it is crucial to consider the latency for both mouth-to-ear delay and the latency in turn-taking in conversation. To minimize the mouth-to-ear delay, it is essential to partner with a provider that offers a proven solution which optimizes latency both at the device level and at the network level to ensure a satisfying conversational AI experience with your application. To minimize the latency in turn-taking in conversation, consider an LLM provider and solution provider who have demonstrated actual performance in this area. Learn more about how Agora helps developers build conversational AI.