Why choose WebRTC for audio and video communication?

I often see people say on the Internet: "Online education live broadcast is done using WebRTC", "Audio and video conferencing is done using WebRTC"...; "Shengwang, Tencent, Alibaba... all Using WebRTC". But are you curious, why do these first-tier manufacturers use WebRTC? In other words, what are the benefits of WebRTC?

This question is self-evident to veterans who have been engaged in real-time audio and video communication for a long time; but for novices, it is a question that is eager to know but difficult to get an answer to. So in this article, I will use a comparison method to explain to you in detail the advantages of WebRTC.

The indicators we compared this time include: performance, ease of use, maintainability, popularity, coding style and other aspects. However, it is not easy to make such a comparison. The first difficulty to be solved is that there is currently no open source library on the market that is close to WebRTC or has similar functions. This is really a waste of time!

Fortunately, this difficulty does not trouble us. Since there is no comparable open source library, we will "build" one ourselves and use the self-developed system to compare with WebRTC. Evaluate whether the self-developed system or the audio and video client developed based on WebRTC has lower cost and better quality. Through this comparison, I believe it can help you understand WebRTC better and know how good it is.

Self-developed system live broadcast client architecture

First, let’s take a look at the architecture of the self-developed live broadcast client, as shown in (Figure 1). This is the simplest audio and video live broadcast client architecture. Through this architecture diagram, you can roughly know which modules must be implemented in the self-developed system.

(figure 1)

From (Figure 1) you can know that the simplest live broadcast client should at least include five parts: audio and video collection module, audio and video encoding module, network transmission module, audio and video decoding module and audio and video rendering module.

Audio and video collection module: This module calls the system's API to read the audio and video data collected by the device from the microphone and camera. The audio is collected as PCM data, and the video is collected as YUV data.
Audio and video encoding module: It is responsible for compressing and encoding the original data (PCM, YUV) collected on the audio and video equipment. net
Network transmission module: This module is responsible for generating RTP packets from the encoded data and transmitting them to the opposite end through the network; at the same time, it receives the RTP data from the opposite end.
Audio and video decoding module: It decodes the compressed data received by the network module and restores it back to the original data (PCM, YUV).
Audio and video rendering: After getting the decoded data, this module outputs the audio to the speaker and renders the video to the display.

Through the previous introduction, I believe you must feel that it is not particularly difficult to develop a live broadcast client by yourself. But in fact, the audio and video live broadcast client architecture introduced above is extremely simplified. It cannot even be called a live broadcast client architecture, but can only be called a schematic diagram. Because it needs to do a lot of refinement to turn it into a real, commercially available architecture.

Split audio and video module

Next, we will gradually refine the above live client architecture diagram. The first step of refinement is to split the audio and video modules. Because in actual development, the processing of audio and video is completely independent, and they have their own processing methods. For example, audio has an independent collection device (sound card), independent playback device (speaker), system API for accessing audio devices, etc. In addition, audio has a variety of audio codecs, such as Opus, AAC, iLBC, etc.; similarly, video It also has its own independent collection equipment (camera), rendering equipment (monitor), and various video encoders, such as H264, VP8, etc. The refined live broadcast client architecture is shown in (Figure 2).

(figure 2)

As you can see from (Figure 2), in the refined architecture, the audio collection module and the video collection module are separated, and the audio encoding and decoding module is also separated from the video encoding and decoding module. In other words, audio is a processing flow, video is another processing flow, and they do not intersect. In audio and video processing, we generally call each channel of audio or video as a track.

In addition, you can also know that the self-developed audio and video live broadcast client needs to implement far more than 5 modules, which should at least include: audio collection module, video collection module, audio encoding/audio decoding module, video encoding/video decoding module, network transmission module, audio playback module and video rendering module.

Cross-platform

In addition to implementing the 7 modules introduced above, the implementation of the audio and video live broadcast client must also consider cross-platform issues. Only when audio and video interconnection and interoperability can be achieved on each platform can it be called a qualified audio and video live broadcast. client. So it should at least implement four terminals: Windows, Mac, Android and iOS. Of course, it would be even better if it could also support Linux and browsers.

What you need to know is that it is very difficult to achieve real-time audio and video interoperability on the browser without the help of WebRTC. This is a major flaw of the self-developed system. In addition to intercommunication with the browser, it is relatively easy for several other terminals to achieve intercommunication.

After adding cross-platform, the architecture of the audio and video live broadcast client is much more complex than before, as shown in (Figure 3). From this picture, you can see that to achieve cross-platform, the most difficult and the first thing to bear the brunt is the modules that access hardware devices, such as audio collection module, audio playback module, video collection module and video playback module. Their changes in the architecture is the largest.

(image 3)

Take audio collection as an example. On different platforms, the APIs used to collect audio data are completely different. The PC side uses the API of the CoreAudio series; coincidentally, the system API used to collect audio on the Mac side is also called CoreAudio, but the specific function names are definitely different; on the Android side, it provides an API for collecting audio and video Called AudioRecord; on the iOS side, AudioUnit is used to collect audio data; on the Linux side, PulseAudio is used to collect audio data.

In short, each terminal has its own API for collecting audio and video data. Since different systems have different API design architectures, the calling methods and logic used when using these APIs also vary widely. Therefore, the workload when developing this part of the module is huge.

Plug-in management

For the audio and video live broadcast client, we not only hope that it can process audio data and video data, but also hope that it can share the screen, play multimedia files, and share whiteboards... In addition, even if it processes audio and video, We also hope that it can support multiple codec formats. For example, in addition to Opus and AAC, audio can also support G.711/G.722, iLBC, speex, etc.; in addition to H264, video can also support H265, VP8, VP9, AV1, etc., so that it can be applied more widely.

In fact, these audio and video codecs have their own advantages and disadvantages, and also have their own scope of application. For example, G.711/G.722 is mainly used for telephone systems. If the audio and video live broadcast client wants to interface with the telephone system, it must support this codec format; Opus is mainly used for real-time calls; AAC is mainly used for music applications. , such as piano teaching, etc. Therefore, we hope that the live broadcast client can support as many codecs as possible, so that the live broadcast client is powerful enough.

How can this be achieved? The best design solution is to implement plug-in management. When you need to support a certain function, just write a plug-in and put it in; when it is not needed, you can remove the plug-in at any time. This design is flexible, safe, and reliable.

(Figure 4)

In order to allow the live broadcast client to support plug-in management, I made adjustments to the previous architecture diagram. As shown in Figure 4. As you can see from this, in order to support plug-in management, I replaced the audio and video codecs in the original architecture diagram with the audio and video codec plug-in manager, and various audio and video codecs (Opus, AAC, iLBC... ....) can be registered as a plug-in. When you want to use a certain type of encoder, you can control it through parameters, so that the data collected from the audio and video acquisition module will be sent to the corresponding encoder for encoding; when receiving audio and video data in RTP format At this time, it can be distinguished according to the Payload Type in the RTP header, and the data is handed over to the corresponding decoder for decoding. After processing in this way, the function of our audio and video live broadcast client is more powerful, and the scope of application is wider.

Here I take the audio codec as an example, and briefly introduce to you the difference before and after adding plug-in management to the live client. Before the client adds plugin management, it can only use one audio codec, such as Opus. Therefore, in a live broadcast event, all terminals participating in the live broadcast can only use the same audio codec (Opus). That doesn't seem like a problem, does it? However, if at this time, we want to connect a telephone voice to this live broadcast (the codec used for telephone voice is G.711/G.722), it can't do anything; with plug-in management, the situation is different. Each terminal can call different audio decoders for decoding according to the type of audio data received, thereby realizing the scenario of different codecs interoperating in the same live broadcast. This is the benefit that plug-in management brings to us.

service quality

In addition to the points I introduced above, there is still a lot of work to be done to implement a powerful, superior performance, and widely used audio and video live broadcast client. In particular, the quality of service is of particular concern to everyone. If the live broadcast client cannot provide good service quality, it loses commercial value.

What does quality of service in real-time communication mean? It mainly includes three aspects: first, the communication delay is small; second, the video is clearer and smoother under the same network conditions; third, the voice distortion is small under the same network conditions. How can we ensure that the communication delay is small, the video is clear, and the voice is not distorted?

The key here is networking. If the live broadcast client can ensure that users have a very good network line, and the transmission delay on this line is minimal, no packet loss, and no disorder, then the quality of our audio and video services will naturally improve, right!

But we all know that network problems are the most difficult to solve. Packet loss, jitter, and disorder are commonplace. Some students may say that TCP has already solved the problems of packet loss and out-of-order? It is, but it comes at the expense of latency. When our network is relatively high-quality, both TCP/UDP can be used for real-time transmission, but in most cases, we prefer UDP because using TCP in a weak network environment will cause huge delays.

To understand why TCP produces huge delays in weak network environments, we need to introduce a little bit about the mechanism of TCP. In order to ensure that no packets are lost or out of order, TCP adopts a mechanism of sending, acknowledging, losing packets, and retransmitting. Under normal circumstances, there is no problem in transmitting data from one end to the other, but when packet loss occurs, there will be greater trouble. as the picture shows.

(Figure 5)

The figure shows the delay when multiple packets are lost: when a data packet is sent from the client to the server, the server needs to return an ACK message for confirmation; only after the client receives the confirmation message can it continue to send subsequent data (there is a sliding window) is similar). Every time the client finishes sending data, it will start a timer, and the minimum timeout period of the timer is 200ms. If for some reason, the client does not receive the returned ACK packet within 200 milliseconds, the client will resend the previous packet. Since TCP has a backoff mechanism to prevent frequent sending of lost packets, the timeout period for resending packets will be extended to 400ms. If the retransmission packet still does not receive a confirmation message, the timeout period for the next retransmission will be extended to 800ms. We can see that after several consecutive packet loss, there will be a very large delay, which is the fundamental reason why TCP cannot be used in a weak network environment.

According to the real-time communication indicator, if it exceeds 500ms, it cannot be called real-time communication. Therefore, in the case of weak network, the TCP protocol must not be used. The indicators of real-time communication are shown in (Figure 6).

(Figure 6)

As can be seen from the table in (Figure 6), if the end-to-end delay is within 200ms, it means that the entire call is of high quality, and the call effect is like everyone chatting in the same room; within 300ms, most people are satisfied. Within 400ms, a small number of people can feel the delay, but the interaction is basically unaffected; above 500ms, the delay will significantly affect the interaction, and most people are not satisfied. So the most critical point is 500ms. Only when the delay is lower than 500ms can it be said to be a qualified real-time interactive system.

From the above description, we can know that if we want to achieve good service quality in our own live broadcast client, the task is still very difficult. Of course, in addition to the functions to be implemented above, there are many other details that need to be dealt with.

other

Audio and video out of sync problem. After audio and video data are transmitted over the network, due to problems such as network jitter and delay, the audio and video may be out of sync. Therefore, when you implement an audio and video live broadcast client, you need to add an audio and video synchronization module to ensure audio and video synchronization.

Echo problem. The echo problem refers to the fact that you can hear your own echo when you interact with other people in real time. In real-time audio and video communication, there are not only echo problems, but also problems such as noise and low sound. We collectively refer to them as 3A problems. These issues are very difficult. Among the current open source projects, only WebRTC and Speex have open source echo cancellation algorithms, and WebRTC's echo cancellation algorithm is currently the best in the world.

Real-time issues of audio and video. Network quality is especially critical for real-time communication. But you should also know that it is difficult to guarantee the quality of network service at the physical layer of the network, and it must be controlled at the software layer. Although the commonly used TCP protocol has a complete set of solutions to ensure network quality, it does not perform well in terms of real-time performance. In other words, TCP guarantees network service quality at the expense of real-time performance, and real-time performance is the lifeblood of real-time audio and video communication. This results in the TCP protocol not being the best choice for real-time audio and video transmission. Therefore, in order to ensure real-time performance, UDP protocol should be preferred for real-time live broadcast under normal circumstances. But in this way, we have to write the network control algorithm by ourselves to ensure the network quality.

In addition, there is network congestion, packet loss, delay, jitter, mixing...the list goes on. It can be said that there are many problems to be solved in order to implement a real-time audio and video live broadcast client, and I will not list them one by one here. In short, through the above description, I think you already know how difficult it is to develop an audio and video live broadcast client yourself.

WebRTC client architecture

In fact, WebRTC has already implemented all the functions I talked about in the live broadcast client architecture section. Let us take a look at the WebRTC architecture diagram, as shown in (Figure).

(Figure 7)

As you can see from the WebRTC architecture diagram, it can be roughly divided into four layers: interface layer, Session layer, core engine layer and device layer. Below I will give you a brief introduction to the role of each layer.

The interface layer includes two parts, one is the Web layer interface; the other is the Native layer interface. In other words, you can either use a browser to develop an audio and video live broadcast client, or you can use Native (C++, Android, OC, etc.) to develop an audio and video live broadcast client.

The main function of the Session layer is to control business logic, such as media negotiation, collection of Candidates, etc. These operations are all handled in the Session layer;

The core engine layer includes more content. In a big way, it includes audio engine, video engine and network transport layer. The audio engine layer includes NetEQ, audio codecs (such as OPUS, iLBC), 3A, etc. The video engine includes JitterBuffer, video codec (VP8/VP9/H264), etc. The network transport layer includes SRTP, network I/O multiplexing, P2P, etc. Among the above contents, this book focuses on network-related contents, which are distributed in Chapter 3: The essence of audio and video real-time communication, Chapter 6: ICE implementation in WebRTC, Chapter 9: Detailed explanation of RTP/RTCP protocol, Chapter 10: WebRTC In several chapters such as congestion control, due to space constraints, other content will be published on my personal main website https://avdancedu.com one after another .

The device layer mainly deals with hardware, and its contents include: audio collection and playback on each terminal device, video collection, and network layer. This part of the content will be introduced in detail in the last chapter of this book\textbf{WebRTC source code analysis}.

From the above description, you can see that among the four layers of the WebRTC architecture, the most complex and core layer is the third layer, the engine layer. Therefore, here I will give a brief introduction to the internal relationships of the engine layer. The engine layer includes three parts: audio engine, video engine and network transmission. Among them, the audio and video engine and video engine are relatively independent. However, they all need to deal with the network transport layer (transport). In other words, they all need to send the data they generate through the network transport layer; at the same time, they also need to receive data sent from other ends through the network transport layer. In addition, the audio engine and the video engine are related to each other due to the need for audio and video synchronization.

(Figure 8)

Finally, let's take audio as an example again (as shown in Figure 8) to see how the data flow in WebRTC flows. When WebRTC is used as the sender, after it collects the audio data through the audio device, it first performs 3A processing. The processed data is encoded by the audio encoder. After encoding, the data is sent out by the network transport layer; on the other hand, when After the network transport layer receives the data, it must determine the type of data. If it is audio, it will hand the data to the audio engine module for processing. The data is first put into the NetEQ module for smoothing and audio compensation processing, and then Audio decoding, and finally the decoded data is played out through the speaker. The video processing flow is similar to the audio processing flow, so I won't go into details here.

summary

Through the above description of the self-developed audio and video client architecture and the description of the WebRTC client architecture, I believe that the advantages of WebRTC are already very clear in your mind. Next, I will summarize the two from several aspects such as performance, cross-platform, audio and video service quality, and stability. As shown in (Figure 9):

(Figure 9)

(Figure 9) tells us that the advantages of WebRTC in real-time audio and video broadcasting are self-evident, and with the strong support of Google, this is the real reason why everyone chooses WebRTC.

Original text Why choose WebRTC for audio and video communication? - Know almost

★The business card at the end of the article allows you to receive free audio and video development learning materials, including (FFmpeg, webRTC, rtmp, hls, rtsp, ffplay, srs) and audio and video learning roadmap, etc.

See below! ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓