When you look around at the different video calling platforms, you will see they have a lot of similarities. As people move to working remotely and communicating mostly through video calls, a huge market has opened up for real-time engagement. Not only are video call applications being used more than ever, but video calls are also getting integrated into existing apps. Technologies like Agora are democratizing real-time engagement, making it easy for developers to add live communication and live streaming into their apps even if their main focus is not video or audio calling. Herein lies the developers’ problem: Zoom, Google, and Microsoft have set a high bar with regard to a user’s expectations, so how do you create an experience that can meet those expectations?
In this article, we will discuss the five video call characteristics that, if executed well, can provide for a seamless experience that rivals those offered by these companies.
The first two characteristics are very obvious, but we will take a close look at each of them. Let’s start with the first characteristic: user controls for the current user. Every video call needs an option to end the call. But other user controls are also needed, for example, mute and unmute the microphone, disable or enable the camera, and switch from one camera to another.
When deploying to the desktop, you can assume that the user has more than one camera, so they will need the option to switch between their cameras. Allowing access to switch the camera isn’t relevant only to physical cameras: many users are also now running virtual cameras such as Snap’s Desktop Camera App or OBS’s Virtual Camera feature.
When deploying to mobile environments, whether web or native, users will always need the ability to change between the front and rear cameras. iOS13 brought support for multi-camera capture on iPhone and iPad, so now it’s also worth considering adding a button to let the user activate both cameras.
The other obvious characteristic is that we need a way to see all the users that are in the call at the same time as our own video. But this comes with a lot of its own complexities. There are features that you might take for granted from using other video call solutions. One example is the disabled widget. If someone can disable their video, you need to handle that state and show the user something other than a black square.
Another big feature is whenever you are using a floating view (a view where one user’s video is bigger than the rest of the users), you need a way to update the largest user based on various criteria. The most common options of updating the largest view are:
This needs to be handled within your app.
Aside from obvious things like user controls and the video view, there are usually a lot of little features you will find, specifically in group video calls. These might seem like extras, but most people don’t realize how useful these features can be. For example, how many times have you heard: “I think you are on mute.” It seems to come up all the time, and debugging the problem doesn’t always go smoothly. Thankfully the biggest video calling platforms provide a visual representation of the audio and video states to the other users so that you know before they even start talking whether they are muted or not. This translates into less awkwardness during the call and overall is more effective because problems get solved quickly.
Another very useful piece of information is the number of users who are in the call. Most platforms have a user count somewhere on the screen. This count tells the presenter information about how many people they are presenting to. And depending on the team size, the presenter can quickly discover whether all team members are present. Presenters know at a glance whether they should start the presentation.
A big characteristic that you might not have ever noticed because the only time you would ever notice it, is if is not implemented well. This is the characteristic of video and audio quality. This is probably one of the hardest characteristics to get right. It is so difficult that entire companies are devoted to trying to solve this issue. Clear audio is the most important feature that will make or break your video call, and this requires using efficient encoding and decoding as well as the ability to support falling back from full-duplex audio to mono to maintain a clear and audible stream.
Video quality is of course a very important part of video calls. Video quality is the most contentious feature among providers on the market. To be honest, at the end of the day, contention over video quality is not so much about the maximum quality the provider can support, but more about how good the provider is at maintaining quality under poor network conditions-as well as how a developer implements the video. Consider a multi-user video stream: When a single user is broadcasting video, the video is expected to have relatively high resolution. But as more streams join in a channel, that high resolution could overload the CPU/GPU for no apparent reason, causing what would appear to the end user as lag.
Let’s put some numbers to our example so we can better illustrate what’s happening. If there’s a single user in the channel, as a developer you’d want to have 720p video because the video stream will be filling the screen. When you have two users streaming, that’s 2 x (1080×720) for two streams that no longer “fill” the screen. Once you get to two users, the quality of the video stream should dropdown (from the sender) to 480p. This drop is almost half the video quality, but given that the device is now processing two streams, cutting the load by half helps to minimize the increased load on the processor. As the channel scales up to four or more users, it’s safe to also halve the video quality from each sender. As the number of hosts increases, depending on the way the UI is set up, the streams could drop to as low as 120p. As a live stream scales beyond four streams, there may be a shift in the interface, where not all video streams are displayed on the screen at the same time.
Video quality expectations go beyond adjusting the video quality when users join or leave a stream. Users expect that when they have a poor connection their app will drop to a lower resolution to support fluidity. In traditional HLS streaming, the protocol itself has contingencies built in for adjusting the quality.
This last feature is a bit more advanced, but it is a necessity for some use cases: it is a method for the host to control the other users’ audio and video states. For any implementation where there are users joining who might be random or unknown to the host, it’s important that the host can control them. If they are being disruptive in the call or otherwise inappropriate, without host controls you can’t really do anything about the problem.
There are also use cases that don’t involve inappropriate behavior, but you just might want more control over the users. One clear example is from this application that I built, called streamer https://www.youtube.com/watch?v=wpa3Ium0yAc. For this app, we are live streaming the users to both Twitch and YouTube Live. If you have a situation where you have a guest panel, guests will be coming on and off the stage, and you need a way to handle all that.
Most of these features are expected by users due to the precedent set by companies like Zoom, Google, and Microsoft. Implementing these features on your own can be overwhelming if not impossible, but that is where Agora steps in. All these features are offered by the Agora UI Kits right out of the box or with minimal configuration.
Agora supports most of the biggest languages and frameworks.