Vernacular interpretation of WebRTC audio NetEQ and optimization practice

NetEQ is one of the core technologies of WebRTC audio and video. It has obvious effects on improving the quality of VoIP. This article will introduce the related concepts, background and framework principles of audio NetEQ in WebRTC and related optimization practices from a more macro perspective and in plain language.

Author | Liang Yi
revision | a Thai

Why "Vernacular" NetEQ?

Just search, we can find a lot of articles about audio NetEQ in WebRTC on the Internet. For example, the following articles are very good learning materials and references. Especially in 2013, Wu Jiangrui's master's thesis of Xidian University, "Research on NetEQ Technology in WebRTC Speech Engine", introduced the details of NetEQ implementation in great detail, and was cited in many articles.

Most of these articles have made a very thorough analysis of the details of NetEQ from a more "academic" or "algorithmic" perspective, so here I want to talk about my personal understanding from a more macro perspective. The vernacular is easier to be accepted by everyone. It is not necessary to fight for a mathematical formula. If you don't have a line of code, you can make your thinking clearly. If you have any incorrect understanding, please feel free to let us know.

Understanding of packet loss, jitter and optimization

In the field of audio and video real-time communication, especially mobile office (4G), home office and online classroom (WIFI) under the epidemic, the network environment has become the most critical factor affecting audio and video quality. In the face of poor network quality, no matter how good it is Audio and video algorithms seem to be a drop in the bucket. The performance of poor network quality mainly includes delay, disorder, packet loss, and jitter . Whoever can handle and balance these types of problems can get a better audio and video experience. Since the basic delay of the network is determined by the choice of the link, the link scheduling layer needs to be optimized to solve it; and the disorder is not a lot in most network conditions, and the degree of disorder is not very serious, so next we It will mainly discuss packet loss and jitter.

Jitter refers to the sudden speed and slowness of data transmission on the network. Packet loss refers to the data packet being transmitted through the network and is lost due to various reasons. After several retransmissions, it is successfully received as a recovery packet, and the retransmission also fails or If the recovery packet is out of date, it will cause real packet loss, and the PLC algorithm for packet loss recovery is needed to generate some false data out of nothing to compensate. Packet loss and jitter are unified in the time dimension. What comes after a while is jitter, and what comes after a long time is a retransmission packet. Whatever comes after a lifetime is "true packet loss". Our goal is to Try to reduce the probability of data packets becoming "true packet loss".

Optimization, intuitively speaking, is a certain data indicator, after a fierce operation, it has been upgraded from xxx to xxx. But I think that judging the quality of optimization can’t just stay in this dimension. Optimization is to "know yourself and the enemy." You are your own product needs. That is the ability of the existing algorithm. The combination of yourself and the other is the best optimization, regardless of the algorithm. Whether it is simple or complex, as long as it can perfectly match your product needs, it is the best algorithm. "A good cat is a good cat who can catch a mouse."

NetEQ and related modules

The provenance of NetEQ

"GIPS NetEQ Original Document" , this is the original NetEQ documentation provided by GIPS Company ( Chinese translation ), which introduces what NetEQ is and a brief description of its performance. NetEQ is essentially an audio JitterBuffer (jitter buffer), the name is very appropriate, Network Equalizer (network equalizer). Everyone knows that Audio Equalizer is an effector used to equalize sound, and NetEQ here is an effector used to equalize network jitter. And GIPS also registered a trademark for this name, so NetEQ (TM) is seen in many places.
In the above official document, there is a very important message, "Minimize the impact of delay caused by jitter buffer", which shows that one of NetEQ's design goals is: "Pursue extremely low latency" . This information is critical and provides important clues for our subsequent optimization.
Vernacular interpretation of WebRTC audio NetEQ and optimization practice
Vernacular interpretation of WebRTC audio NetEQ and optimization practice
Vernacular interpretation of WebRTC audio NetEQ and optimization practice

NetEQ's position in the audio and video communication QoS process

For ordinary users, as long as the network is connected, WIFI and 4G can be used for audio and video communication. A call is passed, and when you see people and hear the sound, it is OK. It's a simple matter, but it doesn't look at the underlying implementation. It's that simple. The number of related code files for the WebRTC open source engine alone is about 200,000. I don't know if anyone has calculated the number of lines of code, it should be on the order of tens of millions. I don't know how many coders lost their hair for this reason :).

The picture below is an abstraction and simplification of the actually more complicated audio and video communication process. On the left is the sending (streaming) side: after collection, encoding, encapsulation, and sending; the middle is transmitted through the network; on the right is the receiving (streaming) side: receiving, unpacking, decoding, and playing; here is the focus on QoS (Quality of Service) , Service quality), and the relationship with the main flow of push-pull data. It can be seen that the QoS function is scattered in various positions in the audio and video communication process, which leads to a more comprehensive understanding of QoS after understanding the entire process. It seems that there are more QoS functions on the sending side on the left. This is because the purpose of QoS is to solve the user experience problem in the communication process. To solve the problem, the best way is to find the source of the problem and solve it from the source. It is a better solution. But there are always some problems that cannot be solved from the source. For example, in a multi-person meeting scenario, one person’s network on the receiving side is broken, which cannot affect other people’s meeting experience, and there cannot be "a rat shit that breaks a pot of porridge". The situation cannot be polluted at the source. Therefore, the receiving stream also needs to perform the QoS function. At present, the necessary function on the receiving side is the JitterBuffer, including video and audio. This article focuses on the analysis of the audio JitterBuffer - NetEQ.
Vernacular interpretation of WebRTC audio NetEQ and optimization practice

NetEQ principle and the relationship between related modules

Vernacular interpretation of WebRTC audio NetEQ and optimization practice
The above picture is an abstraction of the workflow of NetEQ and related modules. It mainly includes 4 parts, NetEQ input, NetEQ output, audio retransmission Nack request module, and audio and video synchronization module. Why should the Nack request module and the audio and video synchronization module be included in the analysis of NetEQ? Because these two modules are directly dependent on NetEQ and influence each other. The dotted lines in the figure identify the information of other modules that each module depends on, and the source of this information. Next, I will introduce the whole process.

1. The first is the input part of NetEQ:

After receipt of a bottom Socket UDP packet, from the parsed UDP packet to the trigger packet RTP, matching *** C and after PayloadType to find the corresponding audio stream received Channel, and from InsertPacketInternalthe receiving module's input to NetEQ.

The received audio RTP packet is likely to have RED redundancy. According to the RFC2198 standard or some proprietary encapsulation format, unpack it and restore the original packet. The duplicate original packet will be ignored . The decoded original RTP data packet will be inserted into the packet buffer cache according to a certain algorithm. After the sequence number received, with each of the original package, by UpdateLastReceivedPacketthe function updating Nack retransmission request to the module, the module triggers Nack modes or receiving packets via RTP timer calls GetNackListthe function to generate a retransmission request to NACK RTCP The format of the packet is sent to the streaming side.

At the same time, after solving each original packet, the only receiving time on the time axis is obtained. The receiving time difference between the packet and the packet can also be calculated. This receiving time difference divided by the packing time of each packet is used by NetEQ internally. IAT (interarrival time) for jitter estimation . For example, if the time difference between two packets is 120ms, and the packing time is 20ms, the current packet's IAT value is 120/20=6. After the IAT value of each packet is processed by the core network jitter estimation module (DelayManager), the final target level (TargetLevel) is obtained, and the input processing part of NetEQ is over.

2. Next is the output part of NetEQ:

Is output by the playback apparatus playing audio hardware thread timing triggered by every 10ms playback apparatus will GetAudioInternalfetch data length 10ms from the inside interface NetEQ play.

Enter the GetAudioInternalfollowing function, first step is to decide how to deal with the current data request, the task to the operating decision module to complete, and before the decision module based on current data and status of the operation, the type of operation gives the final judgment. Several types of operations are defined in NetEQ: normal, acceleration, deceleration, fusion, stretching (packet loss compensation), and mute . The meaning of these operations will be described in detail later. With the type of decision-making operation, which again removed from the packet buffer (packet buffer) of a portion of the input RTP packets to the abstract decoder, the decoder by the abstract DecodeLooplayer to the real call decoder for decoding function, and the PCM audio data decoded into the DecodedBufferinside. Then is the start perform different operations, each operation is NetEQ, which have achieved different audio digital signal processing algorithms (the DSP), "normal" action in addition to the direct use DecodedBufferin the decoded data, the other operations are bound decoded data For secondary DSP processing, the processing result will be put into the Algorithm Buffer first, and then inserted into the Sync Buffer. Sync Buffer is a circular buffer with a clever design. It stores data that has been played and data that has not been played after decoding. The data just inserted from the algorithm buffer is placed at the end of the Sync Buffer, as shown in the figure above. Finally, the earliest decoded data is taken from the Sync Buffer, sent to the external mixing module, and then sent to the audio hardware for playback after mixing.

In addition, it can be seen from the figure that the decision module (BufferLevelFilter) will combine the length of the buffer in the packet buffer of the current packet buffer and the length of data buffered in the Sync Buffer, and get the current buffer level of the audio after filtering by the algorithm. The audio and video synchronization module will use the current audio buffer level and the current video buffer level, combining the time stamp of the latest RTP packet and the time stamp obtained by the SR packet of the audio and video to calculate the degree of audio and video asynchrony, and finally set it to NetEQ through SetMinimumPlayoutDelay The minimum target water level inside is used to control the TargetLevel to achieve audio and video synchronization.

NetEQ internal module

NetEQ jitter estimation module (DelayManager)

1. Stationary jitter estimation part:

Accumulate the IAT value of each packet according to a certain ratio (the ratio is determined by the calculation of the forgetting factor part below), and add it to the histogram of the IAT statistics below, and finally calculate 0.95 of the accumulated value from left to right Location, the IAT value at this location is used as the final jitter IAT estimation value. For example, in the figure below, assuming that the target water level TargetLevel is 9, it means that the target cache data duration will be 180ms (assuming the packing duration is 20ms).

Vernacular interpretation of WebRTC audio NetEQ and optimization practice

2. Calculation of the forgetting factor of steady jitter:

The forgetting factor is a coefficient used to control how much the current packet’s IAT value is added to the above histogram. The calculation process uses a seemingly more complicated formula. After analysis, its essence is the yellow curve below, which means At the beginning, the forgetting factor is small, and more current packet IAT values ​​will be used to accumulate. As time goes by, the forgetting factor will gradually become larger, and fewer current packet IAT values ​​will be used to accumulate. This process is a bit complicated. From an engineering point of view, it can be simplified into a straight line, because it basically converges to the target value of 0.9993 in about 5s after the test. In fact, this 0.9993 is the most important factor affecting the jitter estimation. Many optimizations also directly modify this coefficient to adjust the sensitivity of the estimation.
Vernacular interpretation of WebRTC audio NetEQ and optimization practice

3. Peak jitter estimation:

There is a peak detector PeakDetector in DelayManager to identify peaks. If peaks are frequently detected, it will enter the peak jitter estimation state, and take the largest peak as the final estimation result, and once it enters this state, it will continue to last for 20s, regardless of the current jitter Has it returned to normal? Below is a schematic diagram.
Vernacular interpretation of WebRTC audio NetEQ and optimization practice

NetEQ Operation Decision Module (DecisionLogic)

The simplified basic decision logic of the decision-making module is shown in the figure below, which is relatively concise and does not need to be explained. Here is an explanation of the meaning of the following types of operations:

  • ComfortNoise: It is used to generate comfortable noise, which sounds more comfortable in a silent state than a simple silent bag;
  • Expand (PLC): Packet loss compensation, the most important algorithm module out of nothing, solves the problem of no data when "true packet loss" occurs, and fraudulent professional users;
  • Merge: If the data was faked by Expand last time, in order to sound more comfortable, I will perform a fusion algorithm with normal data packets;
  • Accelerate: Accelerated playback algorithm that changes the voice without changing the tone;
  • PreemptiveExpand: a slow-down playback algorithm that changes the voice and does not change the tone;
  • Normal: Normal decoding and playback, no additional fake data is introduced;
    Vernacular interpretation of WebRTC audio NetEQ and optimization practice

    NetEQ related module optimization points

    NetEQ anti-jitter optimization

    1. Since NetEQ's design goal is "very low latency", it cannot be matched well. For non-very low latency scenarios such as video conferences, online classrooms, and live broadcasts, the sensitivity needs to be adjusted, mainly related to the jitter estimation module. The sensitivity;
    2. In the live broadcast scene, since the delay sensitivity can be above the second level, the StreamMode function needs to be enabled (it seems to be removed in the new version), and the parameters need to be adapted;
    3. Serving the goal of extremely low latency, the original packet buffer packetbuffer is too small, which is easy to cause flush, and needs to be adjusted larger according to business needs;
    4. There are also some businesses that actively identify network conditions according to their own business scenarios, and then directly set the minimum TargetLevel to control the NetEQ water level simply and rudely.
      Vernacular interpretation of WebRTC audio NetEQ and optimization practice

      NetEQ anti-packet loss optimization:

    5. The original WebRTC's Nack packet loss request trigger mechanism is triggered by packets, which will deteriorate the retransmission effect under a weak network, and can be changed to a timing trigger to solve it;
    6. There will be retransmission in the packet loss scenario, but if the buffer is too small, the retransmission will also be discarded. Therefore, in order to improve the retransmission efficiency, adding the ARQ delay reservation function can significantly reduce the stretch rate;
    7. Comparison algorithm-level optimization is to optimize the PLC algorithm for packet loss compensation, adjust the existing NetEQ stretching mechanism, and optimize the hearing effect;
    8. After enabling the Dtx function of Opus, the audio buffer will become larger in the packet loss scenario, and the Dtx related processing logic needs to be optimized separately.
      Vernacular interpretation of WebRTC audio NetEQ and optimization practice
      The following is a comparison of the effects after the ARQ delay reservation function is turned on. The average stretch rate is reduced by 50%, and the delay will increase accordingly:
      Vernacular interpretation of WebRTC audio NetEQ and optimization practice

      Audio and video synchronization optimization:

      Vernacular interpretation of WebRTC audio NetEQ and optimization practice

    9. The original WebRTC P2P audio and video synchronization algorithm is no problem, but the current architecture generally has a media forwarding server (SFU), and the SR packet generation algorithm of the server may not be completely correct due to some restrictions or errors, resulting in failure Normal synchronization, in order to avoid SR packet generation errors, the calculation method of the audio and video synchronization module needs to be optimized, and the water level is used as the main reference for synchronization, that is, the buffer time of audio and video is guaranteed to be the same size at the receiving end. The following is a comparison of optimization effects:
      Vernacular interpretation of WebRTC audio NetEQ and optimization practice
      Vernacular interpretation of WebRTC audio NetEQ and optimization practice
    10. There is also a problem of audio and video synchronization. In fact, it is not caused by the audio and video synchronization mechanism, but the performance of the device can not process the decoding and rendering of the video in time, resulting in the accumulation of video data, and the resulting audio and video are out of sync. This kind of problem can be compared with the trend of out-of-sync duration and the trend of video decoding and rendering duration. The matching degree between the two will be very high, as shown in the following figure:
      Vernacular interpretation of WebRTC audio NetEQ and optimization practice

      to sum up

      NetEQ, as the core function of the audio receiving side , basically includes all aspects, so many, many audio and video communication technologies will have its traces. Taking advantage of the fact that WebRTC has been open source for nearly 10 years, NetEQ has become very popular. Hope This vernacular article can help you understand NetEQ better.

The author's final words: The demand does not stop, the optimization is endless!

"Video Cloud Technology" Your most noteworthy audio and video technology public account, pushes practical technical articles from the front line of Alibaba Cloud every week, and exchanges and exchanges with first-class engineers in the audio and video field.

Guess you like

Origin blog.51cto.com/14968479/2661268