The delay problem in real-time audio and video chat is enough, low-latency scenarios and optimization

1 Introduction


The application scenarios of audio and video real-time communication have been seen everywhere, from the voice intercom of "eat chicken", live broadcast to the microphone, live broadcast answering questions, group hacking, and bank video account opening. For developers, in addition to focusing on how to quickly implement key audio and video communication in different application scenarios, another thing that needs to be paid more attention may be "low latency". However, how "low" real-time audio and video transmission delay should be in order to meet your application scenarios?

 

2. Delay generation and optimization


Before talking about low latency, let's first explain how latency is generated.

Since the transmission paths of audio and video are the same, we can use a picture to illustrate the generation of delay:

In the process of audio and video transmission, delays will occur at different stages. Generally, it can be divided into three categories:

Delay on the device side


The audio and video data can also be subdivided into delays generated on the device side. The delay on the device side is mainly related to the hardware performance, the codec algorithm used, and the amount of audio and video data. The delay on the device side can reach 30~200ms, or even higher. As shown in the table above, the process of delaying audio and video at the capture end or the playback end is basically the same, but the reasons for the delay are different.

Audio delay on device side:
 

  • Audio collection delay: The collected audio will first undergo signal conversion through the sound card, and the sound card itself will generate a delay. For example, the delay of M-Audio sound card equipment is 1ms, and the delay of Aiken sound card equipment is about 37ms;
  • Encoding and decoding delay: Then the audio enters the pre-processing and encoding stage. If the OPUS standard encoding is used, the minimum algorithm delay is about 2.5~60ms;
  • Audio playback delay: This part of the delay is related to the hardware performance of the player;
  • Audio processing delay: pre- and post-processing, including AEC, ANS, AGC and other pre- and post-processing algorithms will bring algorithm delay, usually the delay here is the filter order. within 10ms;
  • End network delay: This part of the delay mainly occurs in the jitter buffer before decoding. If the retransmission algorithm and the forward error correction algorithm are added in the anti-packet loss processing, the delay here is generally about 20ms to 200ms. But affected by the jitter buffer, it may be higher.


Video delay on the device side:
 

  • Acquisition delay: There will be imaging delay during acquisition, which is mainly generated by CCD-related hardware. The best CCDs on the market can reach 50 frames per second, and the imaging delay is about 20ms. If it is a CCD with 20~25 frames per second, There will be a delay of 40~50ms;
  • Encoding and decoding delay: Taking H.264 as an example, it contains three frames of I, P, and B (which will be analyzed in detail below). If it is a connected frame of 30 frames per second, it does not include B frames (due to the decoding dependence of B frames) The video frame before and after will increase the delay), a frame of data collected may directly enter the encoder, when there is no B frame, the encoded frame delay can be ignored, but if there is a B frame, it will bring algorithm delay;
  • Video rendering delay: In general, the rendering delay is very small, but it will also increase due to the impact of system performance and audio and video synchronization;
  • End-to-end network latency: Like audio, video experiences end-to-end network latency.


In addition, on the device side, the CPU and cache usually process requests from multiple applications and external devices at the same time. If a request from a problem device occupies the CPU, it will cause a delay in processing audio and video requests. Taking audio as an example, when this happens, the CPU may not be able to fill the audio buffer in time, and the audio will freeze. Therefore, the overall performance of the device will also affect the delay of audio and video capture, encoding and decoding, and playback.

Latency between client and server


The following main factors affect the delay between the acquisition end and the server, and the server and the playback end:
 

  • The physical distance between the client and the service;
  • the network operator of the client and server;
  • The network speed of the terminal network;
  • load and network type, etc.

  • If the server is deployed in the nearby service area and the network operators of the server and the client are the same, the main factors affecting the uplink and downlink network delay are the load and network type of the terminal network. Generally speaking, the transmission delay in the wireless network environment fluctuates greatly, and the transmission delay usually varies from 10 to 100ms. Under the wired broadband network, the transmission delay in the same city can be stably as low as 5ms~10ms. However, there are many small and medium-sized operators in China, as well as some overlapping network environments and cross-border transmission, so the delay will be higher.
     

    4.3 Latency between servers


    Here we consider two cases:
     
  • The first type: both ends are connected to the same edge node, then as the optimal path, the data is directly forwarded to the player through the edge node;
  • The second type: the acquisition end and the playback end are not covered by the same edge node, then the data will be transmitted to the backbone network through the edge node "closer" to the acquisition end, and then sent to the edge node "closer" to the playback end, but this time Transmission and queuing between servers will also cause delays.
  • As far as the backbone network is concerned, it takes about 30ms for data transmission from Heilongjiang to Guangzhou, and about 110ms~130ms from Shanghai to Los Angeles.

    In actual situations, in order to solve the problem of poor network and network jitter, we will add buffering strategies on the acquisition device side, server side, and playback side. A delay occurs once the buffer strategy is triggered. If there are many stalls, the delay will gradually accumulate. To solve the freeze and accumulated delay, it is necessary to optimize the entire network condition.

    To sum up: Since the delay of audio and video on the acquisition and playback side depends on the hardware performance and the optimization of the codec kernel, different devices have different performances. Therefore, the common "end-to-end delay" in the market usually refers to the situations in Section 4.2 and Section 4.3.

    3. Low latency ≠ reliable call quality


    Whether it is education, social networking, finance, or other scenarios, everyone may think that "low latency" must be the best choice when developing products. But sometimes, this "pursuit of the ultimate" is also a manifestation of falling into a misunderstanding. Low latency does not necessarily mean reliable communication quality. Due to the fundamental difference between audio and video, we need to talk about the relationship between real-time audio and video communication quality and delay.
  • Audio Quality and Latency

 Factors that affect the quality of real-time audio communication include: audio sampling rate, bit rate, and delay. Audio information is actually a sine wave with time as the horizontal axis, which is a continuous signal (as shown in the figure above):
 

  • Sampling rate: is the number of samples per second that are extracted from continuous signals to form discrete signals. The higher the sample rate, the closer the audio sounds to the real sound;
  • Bit rate: It describes the space required for media content per unit of time. The higher the bit rate, the greater the amount of information per sample, the more accurate the description of this sample, and the better the sound quality.


Assuming that the network state is stable and unchanged, the higher the sampling rate and the higher the bit rate, the better the sound quality, but the larger the corresponding single sampling information volume, the longer the transmission time may be.

Compared with our previous formula, if you want to achieve low latency, you can improve network transmission efficiency, such as increasing bandwidth and network speed, which can be easily achieved in a laboratory environment. However, in the living environment, uncontrollable problems such as weak networks and small and medium-sized operators will definitely affect the network transmission efficiency, and the final result is that the communication quality is not guaranteed. Another method is to reduce the bit rate, then the sound quality will be lost.

LinuxC++ audio and video development video : free] FFmpeg/WebRTC/RTMP/NDK/Android audio and video streaming media advanced development

[Article benefits]: The editor has compiled some relevant audio and video development learning materials (including C/C++, Linux, FFmpeg webRTC rtmp hls rtsp ffplay srs, etc.), qun994289133 for free sharing , you can join the group to receive it if you need it!

Video Quality and Latency

Factors that affect real-time video quality include: bit rate, frame rate, resolution, and delay. The video bit rate is similar to the audio bit rate, which refers to the number of data bits transmitted per unit time. The larger the bit rate, the richer the picture details and the larger the video file size.

 Frame: As we all know, a video consists of a frame of images, as shown in the figure above, it is a video frame under the H.264 standard. It uses GOP grouping composed of I frame, P frame and B frame to represent the image picture (as shown in the figure below): I frame is a key frame, with all the information of the image; I or P frame); B frame is a bidirectional predictive coded frame, recording the difference between this frame and the previous frame.

Frame rate: It refers to the number of image frames refreshed per second. It directly affects the smoothness of the video. The higher the frame rate, the smoother the video. Since the human eye and brain process image information very fast, when the frame rate is higher than 24fps, the picture looks coherent, but this is only a starting value. In the game scene, if the frame rate is less than 30fps, it will make people feel that the picture is not smooth. When it is increased to 60fps, it will bring a more real-time interaction, but after 75fps, it is generally difficult for people to feel any difference.

Resolution: refers to the number of pixels contained in a unit inch, which directly affects the clarity of the image. If you play a 640 x 480 and 1024 x 768 video fullscreen on the same device, you'll notice a noticeable difference in clarity.

In the case of a certain resolution, the bit rate is proportional to the definition. The higher the bit rate, the clearer the image; the lower the bit rate, the less clear the image.

In the case of real-time video calls, there will be various quality problems, such as: blurry, unclear, and screen jumps related to codecs, delays and freezes caused by network transmission problems. So solving the low latency is only a small part of the problem of real-time audio communication.

To sum up: If you want to obtain lower latency when the network transmission is stable, you need to make trade-offs in terms of fluency, video clarity, and audio quality.

4. Summary of delay problems in different scenarios


We use the table below to see the approximate demand for real-time audio and video features in each industry. However, each industry has different requirements not only for low latency, but also for the balance between latency, sound quality, image quality, and even power consumption. In some industries, low latency doesn't always come first.

 game scene

In the mobile game scenario, different game types have different requirements for real-time audio and video. For example, in a board game like Werewolf, whether the voice communication is smooth or not has a great impact on the game experience, so the delay requirements are higher.
Other types of games are shown in the table below:

But meeting low latency does not mean meeting the requirements of mobile game development. Because mobile game development itself has many pain points, such as power consumption, installation package size, security, etc. From a technical point of view, when combining real-time audio and video with mobile games, there are two types of issues that mobile game development concerns: performance and experience.

 When combining real-time audio and video with mobile games, in addition to delay, more attention is paid to package size and power consumption. The size of the installation package directly affects whether the user installs it, and the power consumption directly affects the game experience.

Social live broadcast scene

The current social live broadcast products are classified according to function types, such as those that only support audio-only social networking, such as Lizhi FM; and those that support audio and video social networking, such as Momo.

The requirements for real-time audio and video in these two scenarios include:

 Live quiz scene

In the live answering scene, the requirements for real-time audio and video mainly include the following two points:

In the past, we often saw the host finish a question, but the question has not been sent to the mobile phone. At the end, there are only 3 seconds left to answer the question, and we are out of the game without even seeing the question. The pain point of this scene is not low latency, but the synchronization of live audio and video with the title to ensure that everyone is fair and has money to share.

K song chorus scene 

There are chorus functions in K-song applications such as Tian Tian K Song and Sing Bar. The mainstream form is that user A uploads the complete recording, and user B choruses.
The main requirements for realizing real-time chorus are as follows:

In this scene, the synchronization between the vocals of the two and the music puts high demands on low latency. At the same time, the sound quality is also the key. If the sound quality is greatly reduced for the sake of delay, it will deviate from the original intention of the K song application.

financial scene

For underwriting and bank account opening, one-to-one audio and video calls are required. Due to the particularity of the financial industry, the needs of such applications for real-time audio and video are sorted by importance as follows:

 In this scenario, low latency is not critical. It is important to ensure the security, dual recording function and compatibility of the system platform.

 online education

Online education is mainly divided into two categories: non-K12 online education, such as technology development teaching, the main requirements for real-time audio and video in this scenario are:

Many non-K12 teaching takes place in one-way live broadcast scenarios, so the delay requirement is not high.

The other type is K12 online education, such as English foreign teachers and partial interest teaching, which usually have one-to-one or one-to-many teacher-student connection functions. Its requirements for live broadcast scenarios include: 

In K12 online education, teachers and students have high requirements for low latency. If transnational English teaching is involved, or students in remote areas need to be taught, overseas node deployment and network support of small and medium-sized operators should also be considered.


Catch dolls online 

Catching dolls online is a recent emerging hotspot, which mainly relies on real-time audio and video and offline doll machines.

Its requirements for real-time audio and video include:

5. Technical bottlenecks and trade-off suggestions


The development of products pursues the ultimate, and it is necessary to keep the delay as low as possible. But the ideal is plump and the reality is skinny. As we mentioned above, the delay is caused by multiple stages of data processing and transmission. Then there must be times when it hits the ceiling.

We make a bold assumption: to transmit all the audio and video from Beijing Airport to Shanghai Hongqiao Airport. We broke through all physical environment, financial and human constraints, and set up a straight optical fiber between the two places, and ensured vacuum transmission (actually impossible). The distance between the two places is about 1061 km. According to the calculation, the transmission takes about 35ms. The data acquisition, encoding and decoding processing and playback buffering delay required by the acquisition device and playback device are counted as a higher value, 30ms. Then the end-to-end delay is about 65ms. Please note that we also ignore the influence of factors such as the audio and video file itself, the system, and the attenuation of light.

Therefore, the so-called "ultra-low latency" will also encounter bottlenecks: very low latency can be achieved in any experimental environment, but in the actual environment, the deployment of edge nodes, backbone network congestion, weak network environment, Equipment performance, system performance and other issues, the actual delay will be greater. Under certain network conditions, when choosing a low-latency solution or technology selection for different scenarios, it is necessary to weigh and judge indicators such as delay, freeze, audio quality, and video clarity.

Guess you like

Origin blog.csdn.net/m0_60259116/article/details/123402405
Recommended