Jim is CEO of Synervoz Communications, a Toronto-based software development company focused on building apps and SDKs to enhance voice and video calls with music, movies, TV, games, live streams, and other interactive components. Synervoz helps developers solve challenging audio problems, including noise, echo, mixing audio streams, cross-platform issues, voice-controlled user interfaces, Bluetooth, and more. Customers include well-established brands like Bose and Unity, as well as startups building the next generation of virtual hangouts. Join Jim on September 1-2 at the RTE2021 Virtual Conference to hear his panel session on Addressing and Solving the Audio Challenges of Remote Lifestyles.
Noise. Is. Aggravating. Especially on voice and video calls. It’s certainly an unwelcome party to most conversations, but it has always been around. That is, until recently. Machine learning and artificial intelligence have led to highly effective noise-suppression and cancellation techniques, many of which are now being used at various levels of technology stacks: in hardware, middleware, apps, and more recently, SDKs that will soon be available to a broad range of developers.
The industry developing these new techniques and technologies has mostly focused on obvious use cases such as call centers, telephony, and simple voice and video calls that power your daily business meetings, calls with friends and family, and so on. But calls are no longer just calls—increasingly, they are becoming online rooms or meeting spaces in which you can interact in new ways or participate in other activities together. Until recently, voice and video calls only needed to consider a single layer of audio: the Voice over IP layer. But as calls transform into rooms and spaces, opportunities abound for additional audio layers, and so too do the associated challenges.
As someone who has been innovating in audio technologies for the last several years, I can attest that noise suppression now works well enough to unlock some groundbreaking use cases that were previously infeasible—particularly in use cases that require multiple layers of simultaneous audio combination.
Let’s say you want to add a media player like Spotify or YouTube to a voice or video call. Now imagine that you’re hanging out in this voice or video call while listening to music as a group or watching something together, and one of you speaks:
These situations are likely familiar to people who have tried watching something together over FaceTime or a Zoom call. Even in the many watch party apps that have sprung into existence, text-based chat still dominates. The above problems don't exist when you’re hanging out together in person, because your brain’s capacity to process spatial audio helps to separate audio sources. To some extent, new spatial audio technology will help in the digital realm as well. Nonetheless, the ability to cancel noise and better separate audio sources is key to unlocking more digital hangout use cases.
In the media player + voice call example described above, you probably want a way to reduce the volume of the music or video when someone speaks. Technically, it would require a voice activity detector (VAD). However, if you feed the VAD with a noisy signal, you’ll end up ducking (lowering the volume) the media player in response to noise. Or you’re likely to overcompensate and not detect voice when someone speaks—causing the ducker to fail, or worse, causing the voice signal to be ignored and not sent over the wire. An inaccurate VAD thus degrades the user experience. So much so that it’s still rare to find use cases where people leave a microphone open when watching or listening to something together.
Gaming is a counterexample where people do leave the microphone open, but the audio challenges in gaming are often easier than the aforementioned use case. This is because spatial audio is often a feature of the game and the masking of the Voice over IP channel by in-game sounds may be less frequent than is the case with music or video streams. And yet audio issues are still a frequent complaint among gamers. Discord, one of the industry leaders in this space, has demonstrated the importance of noise reduction via its partnership with Krisp. As Discord is increasingly used for nongaming use cases, noise reduction will be an important enabling technology. Consider a group of cyclists keeping a voice call open during their ride (some cyclists use Discord for this). Wind, traffic, and ambient noise present UX issues, especially in combination with listening to music simultaneously. It’s even harder to solve these issues on motorcycles, with added wind and engine noise.
It’s difficult to find a single application that addresses all aspects of all use cases simultaneously. While Discord may be great for gamers and many other use cases, it’s unlikely to be the optimal solution for cyclists and motorcyclists, whose interfaces will need to be optimized for things like hands-free operation and offline functionality. For similar reasons, there’s little question that many targeted applications will continue to be built and targeted at specific use cases. So, what do we think developers will build? How will they get access to this state-of-the-art noise-cancellation technology?
One platform to keep an eye on is Agora, a platform many development teams have turned to when building interactive voice, video, and live-streaming apps. A lot of unique use cases are already being built on top of Agora by developers, and even more will be unlocked with reliable noise suppression.
Noise suppression is used in hardware and software for many reasons, but a primary focus at the moment is improving internet-based voice and video calls. Let’s call these “online use cases.” As meeting online has become a mainstream behavior, there has been a demand for more interactivity on calls. Thanks to a vastly improved audio experience now being possible, you’re likely to see more:
The integration of content and activities into calls is a movement that has started gathering momentum in recent months. Apple recently announced at its Worldwide Developers Conference updates for FaceTime and SharePlay. A raft of announcements from the likes of Spotify, Twitter, Discord, Facebook, Reddit, and a host of others signifies that the shift toward audio-centric online hangouts is already happening. It will all lead to new ways to hang out together online, and much of it will be possible only with modern noise-suppression techniques.
Offline use cases are also being made feasible. That is, use cases where an internet connection is not required, such as those where communication happens directly between devices (peer to peer) or within a single device (e.g., embedded in headphones, to alter the audio in your environment). Consider the following use cases:
The offline world gets even more interesting when you consider layers of audio that could be incorporated from the ambient environment. “Transparency” is a feature you may be familiar with on headphones, allowing you to hear what’s going on around you while listening to your music or podcast. But the technologies that help to separate voice from background noise can also be used to distinguish different sounds in one’s environment (audio source separation). Some sounds can be enhanced while others are suppressed.
Any of the following could be made louder, quieter, or positioned in space, depending on the use case: sirens, birds chirping, people talking nearby, mechanical equipment sounds, various construction sounds, and so on. And the ability to identify specific sounds and isolate them will likely spawn many new utilities (such as apps that listen for mechanical issues or other specific events), entertainment apps (like ways to listen to music together while walking or commuting), and creative works (like the bygone RjDj app in which environmental sounds were incorporated into the music or audio output).
Another group of use cases combines elements of online and offline, such as:
Audio is experiencing a renaissance; however, the industry still has lots of room to grow. We anticipate many new use cases will be built with solutions like Agora combined with innovative audio technology. We have positioned our own company, Synervoz, accordingly. We are a software development team focused on helping other companies build use cases like those discussed above. We have in-house SDKs, expertise, and partnerships with trusted brands like Bose and Agora and more to help with rapid prototyping and to minimize time to market with production-ready applications.
Ready to integrate audio technology into your application? Let us know your use case or head over to https://www.synervoz.com/ to learn more. To learn more about real-time engagement uses cases, join Jim and other thought leaders on September 1-2 at the RTE 2021 virtual Industry conference.