Back to Blog

The Impact of Latency in Speech-Driven Conversational AI Applications

As conversational AI tech continues to advance, there are still some major hurdles to rolling out real-time voice and video communication with LLMs. In a previous blog we highlighted the importance of overcoming the challenging conditions of the last mile. Latency (delay) is another challenge that must be overcome when enabling speech driven conversational AI in an application.

In this blog we focus on the impact of latency (delay) in speech driven conversational AI applications, including:

  • Research studies and industry standards which specify the level of latency typical and acceptable for a natural and fluent conversation between humans.
  • The components which contribute to higher latency when a human interacts with a Large Language Model (LLM) using speech over the internet.
  • How latency can be minimized to deliver the best possible human-machine conversational experience.

Latency in natural human conversation

On the OpenAI GPT-4o announcement page, the company highlights that GPT-4o “can respond to audio inputs in as little as 232 milliseconds (ms), with an average of 320 ms which is like human response time in a conversation. The referenced study is titled “Universals and cultural variation in turn-taking in conversation.” It covers 10 representative languages and found that the mean response offset of turn transitions was about 208 ms. The conversations analyzed were from videotaped interactions of participants in the same location. For in-person conversations the mouth-to-ear delay (time between one person speaking and the other person hearing) is quite low. With speakers about 2 meters apart, it is about 6 ms. See Figure 1 below as an illustrative example.

Figure 1: Latency with in-person conversation
Figure 1: Latency with in-person conversation

For an application where the intent is to support ‘Conversational AI,’ it is important to emulate natural conversation. To emulate natural conversation, it is crucial to consider the latency for both mouth-to-ear delay and the latency in turn taking in conversation. Further, since Conversational AI applications today require interaction between a user leveraging a device and infrastructure in the cloud, all elements which contribute to increasing latency must be understood and minimized for the best experience.

Latency in human conversations via RTC applications

Let us look at a case (see Figure 2 below) where two people are in separate locations and are using an application on their mobile devices to communicate with each other on an audio call.

Figure 2: Mouth-to-Ear Delay in Conversation Between 2 People Using a Mobile Phone Application to Talk
Figure 2: Mouth-to-Ear Delay in Conversation Between 2 People Using a Mobile Phone Application to Talk

If the user of mobile phone 1 is holding the phone microphone directly to their mouth and the user of mobile phone 2 is holding the phone speaker directly to their ear, the mouth-to-ear delay in this case would be the sum of the individual delays in all the boxes shown above. The delay contribution for both mobile devices is shown in Table 1. For comparison, we show the typical delay as well as the reduced delay which Agora has achieved with device and operating system optimizations.

Mobile Device Delay ContributorTypical Delay (ms) for iOSAgora Optimized (ms) iOSTypical Delay (ms) for Android*Agora Optimized (ms) Android
Mic input delay 251560-8025
Pre-processing delay (HW/SW) 601060-10010
Codec encoding delay 10~010~0
Packetization delay ~0-40~0~0-40~0
Jitter buffer delay > 6020> 6020
Codec decoding delay 1~01~0
Speaker output (playout) delay 2515160-25020
Total for Device> 18160> 350100
Table 1: Sum of all delays contributed by mobile devices 
*By default, Java ADM is typically used due to its broader compatibility across devices, however the playout delay is often remarkably high.

The Network Stacks and Transit delay defined as the total time it takes for speech packets to transit the network edge-to-edge can vary significantly depending upon whether the users are in the same city or different cities, states, or countries. In our testing, we compared the one-way latency over the public internet and over Agora’s proprietary Software Defined Real-Time Network (SD-RTN™). This one-way latency is measured from network edge to network edge, not including the last mile hop on each end. We compared data both within a continent (intra-region) and between continents (inter-region). The results are shown in Figure 3 below.

Figure 3: One-way Latency over the public internet vs. Agora SD-RTN™
Figure 3: One-way Latency over the public internet vs. Agora SD-RTN™

For simplicity, the key takeaway is that 95% of users within the same region or between geographic regions experience > 50% improvement (reduction) in latency.

Let us assume that the two mobile phone users are both located within North America. In this case 95% of users on the public internet would have no more than ~94 ms latency and using Agora’s SD-RTN™ would have ~33 ms latency. The best possible latency for the mobile last mile hop is approximately 10 ms between servers on the public internet and the mobile device and 10 ms between Agora’s SD-RTN™ and the mobile device using Agora’s SDK. This 10 ms number assumes that the last hop is in the same city as the user on the mobile device and that the last mile connection is excellent. Using these numbers the total mouth-to-ear delay can be estimated as shown in Table 2.

CaseTotal for Device (ms)Network Stacks & Transit Delay (ms)Total mouth-to-ear delay (ms)
Two iOS Devices on Public Internet>18194+20 = 114>295
Two Agora Optimized iOS Devices on SD-RTN™6033+20 = 53113
Two Android Devices on Public Internet>35094+20 = 114>464
Two Agora Optimized Android Devices on SD-RTN™10033+20 = 53153
Table 2: Total mouth-to-ear delay estimation 
Now that we have these estimations, how do we know whether the mouth-to-ear delay levels are acceptable to users or not? Fortunately, the International Telecommunication Union published a standard called G.114 which answers this question.

The figure below, extracted from the ITU G.114 standard, depicts the telecommunication industry’s study on voice latency vs. user satisfaction quality.

Figure 4: From ITU G.114 standard on mouth-to-ear delay vs. user satisfaction
Figure 4: From ITU G.114 standard on mouth-to-ear delay vs. user satisfaction

Referring to Figure 4, with up to 275 ms mouth-to-ear delay users are satisfied. Between 275ms and 385ms some users are dissatisfied. Beyond this, the experience is poor.

CaseTotal mouth-to-ear delay (ms)ITU G.114 User Satisfaction Rating
Two iOS Devices on Public Internet>295Some Users Dissatisfied
Two Agora Optimized iOS Devices on SD-RTN™113Users Very Satisfied
Two Android Devices on Public Internet>464Many Users Dissatisfied
Two Agora Optimized Android Devices on SD-RTN™155Users Very Satisfied
Table 3: ITU G.114 user satisfaction rating for mobile-to-mobile audio call from Table 2. 

Referring to Table 3, the device, and operating system level optimizations, in addition to network optimization for latency supported by Agora result in far lower overall latency and higher G.114 user satisfaction ratings.

Latency in human-to-AI conversations

With this background and context, let us now look at an example of speech-driven conversational AI where the AI agent is at the network edge, as shown in Figure 5. For simplicity, we will assume that the AI workflow and inference takes place at the edge of the network. In this example, we assume the LLM supports a direct speech interface (Audio LLM) which means there is no Speech-to-Text conversion required. The acronym TTS TTFB refers to the Time-To-First-Byte or the duration from when the request is made by the LLM to generate the Text-To-Speech response and the first byte of the response is generated.

Figure 5: Latency in a Conversational AI Example

Using this example, let us estimate the mouth-to-ear delay from a human using conversational AI app on their mobile phone to Audio LLM based AI, the turn taking delay for the Audio LLM based AI, and the mouth-to-ear delay from the Audio LLM based AI to the human using the conversational AI app on their mobile phone. In this example, we will assume the human user is using an Android Phone.

Delay ContributorTypical Delay (ms)Agora Optimized (ms)
Mic input delay60-8025
Pre-processing delay (HW/SW)60-10010
Codec encoding delay10~0
Packetization delay~0-40~0
Network Stacks & Transit Delay1010
Audio LLM Jitter buffer delay4040
Audio LLM Codec decoding delay11
Total> 12186
Table 4: Estimated mouth-to-ear delay from Android phone user to Audio LLM based AI 
Delay ContributorOptimized Delay
Audio LLM Delay100
Sentence Aggregation Delay100
TTS TTFB Delay80
Total280
Table 5: Estimated turn-taking delay of the Audio LLM based AI 
Delay ContributorTypical Delay Agora Optimized (ms)
Audio LLM Codec encoding delay 2121
Audio LLM Packetization delay 22
Network Stacks & Transit Delay 1010
Jitter buffer delay >6020
Codec decoding delay 1~0
Speaker output (playout) delay 160-25045
Total > 25498
Table 6: Estimated mouth-to-ear delay from Audio LLM based AI to Android Phone User 

In this example, the estimated mouth-to-ear delay from Audio LLM based AI to the Android Phone user is near the ‘Some Users Dissatisfied’ threshold according to ITU G.114. This is a scenario where the network stacks and transit delay are minimal, given that the AI workflow and inference is assumed to be performed the edge of the network closest to the user. There will be many scenarios where humans will be interacting with other humans and one or more conversational AI agents over distance. Referring to Figure 3, the latency contribution of the network stacks and transit delay, in conjunction with the mobile device delay contribution can often cause the mouth-to-ear delay to exceed the threshold where users will be dissatisfied with their conversational AI experience.

Latency in human-to-AI conversations with AI agent located intra-region

Finally, let us look at at the same scenario where the AI agent is located intra-region versus right at the network edge. This scenario will become more common as conversational AI solutions scale up and people will be able to interact with one or more AI agents during a session.

For simplicity, let us assume that the user and the AI agent are both located within North America. In this case 95% of users on the public internet would have no more than ~94 ms latency and using Agora’s SD-RTN™ would have ~33 ms latency.

Delay ContributorTypical Delay Agora Optimized (ms)
Mic input delay60-8025
Pre-processing delay (HW/SW)60-10010
Codec encoding delay10~0
Packetization delay~0-40~0
Network Stacks & Transit Delay 10443
Audio LLM Jitter buffer delay4040
Audio LLM Codec decoding delay11
Total> 275119
Table 7: Estimated mouth-to-ear delay from Android phone user to Audio LLM based AI 
Delay ContributorOptimized Delay
Audio LLM Delay100
Sentence Aggregation Delay100
TTS TTFB Delay80
Total280
Table 8: Estimated turn-taking delay of the Audio LLM based AI 
Delay ContributorTypical Delay Agora Optimized (ms)
Audio LLM Codec encoding delay2121
Audio LLM Packetization delay22
Network Stacks & Transit Delay10443
Jitter buffer delay>6020
Codec decoding delay1~0
Speaker output (playout) delay160-25045
Total> 348131 
Table 9: Estimated mouth-to-ear delay from Audio LLM based AI to Android Phone User 

In this example, the estimated mouth-to-ear delay from Audio LLM based AI to the Android Phone user is well within the ‘Some Users Dissatisfied’ region according to ITU G.114. For an inter-region case, the experience can easily enter the ‘Many Users Dissatisfied’ region.

In conclusion, it is essential to minimize latency when implementing speech-driven conversational AI in your application. As discussed in this blog; to emulate natural conversation, it is crucial to consider the latency for both mouth-to-ear delay and the latency in turn-taking in conversation. To minimize the mouth-to-ear delay, it is essential to partner with a provider that offers a proven solution which optimizes latency both at the device level and at the network level to ensure a satisfying conversational AI experience with your application. To minimize the latency in turn-taking in conversation, consider an LLM provider and solution provider who have demonstrated actual performance in this area. Learn more about how Agora helps developers build conversational AI.

RTE Telehealth 2023
Join us for RTE Telehealth - a virtual webinar where we’ll explore how AI and AR/VR technologies are shaping the future of healthcare delivery.

Try Agora for Free

Sign up and start building! You don’t pay until you scale.
Try for Free
Get Started with Agora thumbnail