In-depth exploration of the implementation mechanism of NetEQ audio anti-network delay and anti-packet loss in the audio and video open source library WebRTC

Table of contents

1 Introduction

2. Introduction to WebRTC

3. What is NetEQ?

4. Detailed explanation of NetEQ technology

4.1. Overview of NetEQ

4.2. Jitter elimination technology

4.3. Packet loss compensation technology

4.4. NetEQ outline design

4.5. NetEQ command mechanism

4.6. NetEQ playback mechanism

4.7. MCU control mechanism

4.8. DSP algorithm processing

4.9. Simulation test of DSP algorithm

5. NetEQ source file description

6. Reference documents


Summary of the development of common functions of VC++ (list of column articles, welcome to subscribe, continuous updates...) icon-default.png?t=N7T8https://blog.csdn.net/chenlycly/article/details/124272585 C++ software anomaly troubleshooting series of tutorials from entry to proficiency (list of column articles) , welcome to subscribe and continue to update...) icon-default.png?t=N7T8https://blog.csdn.net/chenlycly/article/details/125529931 C++ software analysis tools from entry to mastery case collection (column article is being updated...) icon-default.png?t=N7T8https:/ /blog.csdn.net/chenlycly/article/details/131405795 C/C++ basics and advanced (column article, continuously updated...) icon-default.png?t=N7T8https://blog.csdn.net/chenlycly/category_11931267.htmlOpen source components and Database technology (column article, continuously updated...) icon-default.png?t=N7T8https://blog.csdn.net/chenlycly/category_12458859.html        Audio and video software As application scenarios and usage environments change, audio quality requirements are getting higher and higher. , to achieve high-quality audio effects, you can learn from some mature solutions in the audio and video field. WebRTC is currently one of the most advanced voice engines that solves voice quality problems. Among them, the NetEQ network equalizer module well solves the problems of delay, jitter and packet loss in audio data under low bandwidth. This article will analyze in detail the implementation principle, processing flow and packet loss compensation processing mechanism of NetEQ network equalizer in WebRTC.

1 Introduction

       Since IP networks are mainly used for data transmission services, unlike traditional telephones that occupy independent logical or physical lines, there is no quality of service (Qos) guarantee, and there are problems such as packet arrival out of order, delay, packet loss, and jitter. For packet loss, retransmission or multiple transmission mechanisms can be used in the business. However, audio and video software are real-time services and have strict requirements on bandwidth, delay and jitter, so certain QoS guarantees must be provided.

       There are two main factors that affect audio quality in audio and video software: delay jitter and packet loss processing. Generally, jitter buffers are used to eliminate the adverse effects caused by network transmission. Jitter buffer technology directly affects packet loss processing. The receive buffer can be used to eliminate delay jitter, but if packet loss occurs, it will freeze or fill in silence or interpolation compensation. However, in a network with large delay, large jitter, and severe packet loss, the effect is not ideal.

        How to borrow the NetEQ network equalizer technology in WebRTC to improve the audio quality of the software. First, you need to analyze and decompose the principles and processing procedures of NetEQ. Secondly, you need to understand the principles and usage scenarios of the packet loss compensation algorithm, and then apply it effectively to in the design of software products.

2. Introduction to WebRTC

        Before introducing the NetEQ network equalizer in WebRTC in detail, let us first have a general understanding of WebRTC.

       WebRTC (Web Real-Time Communication) is a real-time audio and video communication C++ open source library initiated by Google. It provides a complete set of audio and video solutions such as audio and video collection, encoding, network transmission, decoding and display, etc. We can use this open source library Quickly build an audio and video communication application.

A real-time audio and video application software generally includes the following links: audio and video collection, audio and video encoding (compression), pre- and post-processing (beautification, filters, echo cancellation, noise suppression, etc.), network transmission, decoding and rendering (audio and video play) etc. Each of these subdivided links also has more subdivided technical modules.

      Although it is called WebRTC, it actually not only supports audio and video communication between the Web, but also supports mobile platforms such as Windows, Android, and iOS. The bottom layer of WebRTC is developed in C/C++ and has good cross-platform performance.

  • WebRTC is mainly developed and implemented in C++. The code uses a large number of new features of C++11 and above. Before reading the source code, you need to have a general understanding of these new features of C++.
  • It is necessary to learn the new features of C++11. Not only are new features frequently used in C++ open source code, but they are also often asked during written interviews when changing jobs.
  • It is recommended that you carefully read the new, free and public " Google Open Source Project Style Guide (zh-google-styleguide) ". It is not only Google's coding standards, it not only tells you what to do when coding, but also tells you why you should do it. Do it! It is also very beneficial for learning the new features of C++11 and above! Our project team systematically studied this project style guide last year and gained a lot from it. It is of great reference value!

       Because of its good audio and video effects and good network adaptability, WebRTC has been widely used in video conferencing, real-time audio and video live broadcast and other fields. In the field of video conferencing, domestic manufacturers such as Tencent Conference, Huawei WeLink, Byte Feishu, Alibaba DingTalk, Xiaoyu Yilian, and Xiamen Yilian all provide video conferencing based on WebRTC solutions.

       Agora, a well-known professional audio and video service provider, is based on the open source WebRTC library and provides social live broadcast, education, gaming and e-sports, IoT, AR/VR, finance, insurance, medical, enterprise collaboration and other industries. Audio and video interactive solutions. Companies using Shengwang’s services include Xiaomi, Momo, Douyu, Bilibili, New Oriental, Xiaohongshu, HTC VIVE, The Meet Group, Bunch, Yalla and other global giants, unicorns and startups. In addition to the leading company Shengwang, many companies have developed multiple audio and video applications based on open source WebRTC, providing audio and video communication solutions in multiple fields.

3. What is NetEQ?

       NetEQ is essentially an audio JitterBuffer (jitter buffer), the full name is Network Equalizer (Network Equalizer).

       One of the two core technologies of the GIPS voice engine is an advanced adaptive jitter buffer technology that includes a packet loss concealment algorithm, called NetEQ. In 2010, Google acquired this technology from Global IP Solutions for US$68.2 million. Another core technology is the 3A algorithm. Subsequently, Google integrated it into WebRTC and released it as open source in 2011.

        NetEQ integrates an adaptive jitter control algorithm and a voice packet loss concealment algorithm, and is integrated with the decoder, so NetEQ can always maintain good voice quality in a high packet loss environment.

4. Detailed explanation of NetEQ technology

4.1. Overview of NetEQ

       NetEQ processing includes adaptive jitter control algorithm and voice packet loss compensation algorithm. The adaptive jitter algorithm can quickly adapt to the changing network environment, while the voice packet loss compensation algorithm can ensure a certain sound quality and clarity with minimal buffering delay. In addition, the simulation test of the NetEQ algorithm helps evaluate the sound quality effect and how it compares with existing The organic combination of software design.

       NetEQ processing includes adaptive jitter control algorithm and voice packet loss compensation algorithm. The adaptive jitter algorithm can quickly adapt to the changing network environment, while the voice packet loss compensation algorithm can ensure a certain sound quality and clarity with minimal buffering delay. In addition, the simulation test of the NetEQ algorithm helps evaluate the sound quality effect and how it compares with existing The organic combination of software design.

       The module overview diagram of NetEQ is as follows:

As can be seen from the figure above, NetEQ is divided into 4 parts: Adaptive packet buffer, Speech decoder, Jitter control and error concealment, and Play out. The jitter control and packet loss compensation module is the core algorithm of NetEQ, which not only controls the adaptive buffering, but also controls the decoder and packet loss compensation algorithm, and hands the final calculation results to the sound card for playback.

       First of all, NetEQ is currently the most complete jitter elimination technology. Compared with fixed jitter buffering and traditional adaptive jitter buffering, NetEQ can quickly adapt to changing network environments, thus ensuring smaller delays and less packet loss. The performance comparison of NetEQ adaptive dithering algorithm is shown in the figure below:

       Secondly, the jitter control and packet loss compensation module consists of three major operations, namely Expansion, Normal and Accelerate:

Expansion : Expansion operation, that is, stretching the voice duration, including expand and preemptive_expand modes. The former is NetEQ's packet loss compensation processing, whose function is to wait for delayed packets and compensate for packet loss; the latter is priority expansion, which stretches the voice duration based on the original data, and its function is to slow down playback.
Normal : Normal playback operation, that is, operation when the network environment is normal and relatively stable.
Accelerate : Accelerate the operation, that is, achieve fast playback.

To sum up, this article mainly discusses NetEQ's jitter elimination and packet loss compensation technology, and combines simulation testing and product design analysis to further improve the call sound quality of video conferencing products. The NetEQ performance list looks like this:

4.2. Jitter elimination technology

       There are two definitions of jitter:

Jitter Definition 1: Refers to changes in the arrival rate of data packets in the network due to various delays. Specifically, jitter can be defined as the difference between the sending interval of the data stream at the sending end and the receiving interval at the receiving end, which is suitable for variable code rate scenarios.
Jitter definition 2: The difference between the arrival interval of a certain data packet at the receiving end and the average data packet arrival interval is defined as the delay jitter of the data packet, and is used in constant code rate scenarios.

Jitter is a zero-mean random sequence consisting of delay differences in queued IP packets. When data packets accumulate, it means that data packets arrive early. Although the integrity of the voice is ensured, it can easily cause buffer overflow at the receiving end and increase end-to-end delay. When the data packet times out, it means that the data packet has not reached the receiving end after a period of time after being transmitted through the network, indicating that the data packet may arrive late or be lost. Since both overflow and timeout can cause packet loss, they will increase the probability of end-to-end packet loss. Therefore, jitter must be effectively controlled to reduce packet loss caused by it.

       Jitter is usually eliminated using jitter buffering technology, that is, a buffer is established at the receiver. When the voice packet arrives at the receiver, it first enters the buffer and is temporarily stored. Then the system extracts the voice packet from the buffer at a steady rate and decompresses it. Play from the audio port. The ideal state of jitter elimination is: the delay of each packet in network transmission should be equal to the delay of all buffered data in the buffer, and the size of the buffer should be equal to the jitter of each packet arriving early plus the buffered data. The sum of delays is equal.

       Jitter buffer control algorithms include static jitter buffer and adaptive buffer jitter control algorithms:

Static jitter control algorithm : The delay and size of the buffer are fixed values ​​after the voice call is established until the end of the call. Data with timeout and jitter exceeding the buffer size will be discarded. The algorithm model is simple and easy to implement; however, when the network delay and jitter are large, the packet loss rate is high, and when the network delay and jitter are small, the voice delay is large, and the delay and size of the buffer cannot be dynamically changed according to the network conditions. , and the initial value limits the applicable network conditions.

Adaptive jitter control algorithm : The delay and size of the buffer change with the jitter of the actual network. The receiving end compares the delay of the currently received data packet with the delay information saved in the algorithm to obtain the maximum jitter of the current network, thereby selecting the appropriate buffer delay and size. The advantage of this algorithm is that the packet loss rate is small when the network jitter is large, and the delay is small when the network jitter is small. The disadvantage is that the algorithm is diverse and relatively complex.

Considering the complexity and changeability of current networks, adaptive jitter algorithms are generally used, and NetEQ's jitter elimination also belongs to this type of algorithm.

4.3. Packet loss compensation technology

       Packet loss compensation is also called Packet Loss Concealment, or PLC for short, and can be divided into two categories: sender-based compensation and receiver-based compensation. The packet loss compensation technology consists of the following:


       Compensation based on the sender is also called packet loss recovery, that is, Packet Loss Recovery. Generally speaking, compensation based on the transmitter is better than compensation based on the receiver, but it will increase network bandwidth and delay.

       FEC (Forward Error Correction) is currently the most promising redundant coding technology for improving VoIP voice quality, aiming to improve the reliability of voice data transmission. For this reason, FEC must not only transmit the original data, but also transmit some redundant data based on the correlation, so that the decoder can reconstruct the lost data packet based on the correlation between the data. The simplest in VoIP is the parity code. This method is to transmit a check code containing the XOR operation of the previous n data packets for every n-1 data packets. When the network only loses one packet every n data packets, it can obtain the check code from other n-1 data packets. packets to reconstruct the lost packets. FEC based on parity packets looks like this:

       When continuous packet loss occurs, various compensation techniques such as FEC are not effective. In order to resist large segments of sudden continuous voice loss, interleaving technology can be used. Interleaving technology is not a true packet loss recovery technology because it cannot recover lost data packets, but this technology can reduce the losses caused by packet loss. Interleaving technology divides the original data into several units smaller than IP packets, and reorders the order of these units before sending, so that the data in each IP packet comes from different voice frames. When a frame is lost, only each If part of the data in a frame is lost, all the data in a frame will not be lost. These units are reordered at the receiving end. Interleaving technology takes advantage of the human brain's ability to automatically recover a portion of lost data using auditory perception. When only a small amount of data is lost per frame, the impact on human hearing is less, thereby improving sound quality. Since no additional information is output, the bandwidth will not be increased, but since reordering is required at the receiving end, the delay will be increased, and it will be intolerable to a certain extent. The GSM system uses interleaving technology.

       The interleaving technique is as follows:

       Low-rate redundant coding (Low-rate Redundant Coding) is a redundant technology. In addition to its own data, each data packet also contains a compressed copy of the previous frame data. This copy has low quality and takes up bits. The number is small. When the receiving end loses a packet, this copy can be used to quickly reconstruct the lost packet from subsequent data packets. Unlike FEC, the number of bits added is correlated with the previous and subsequent frames. It is simply copied in subsequent packets, so it will also increase bandwidth and delay. However, like FEC, it is not applicable to continuous packet loss when the network is congested. , which will lead to more serious packet loss. G.729A uses redundant coding technology. The schematic diagram of redundant coding recovery packet loss is as follows:

       The basic principle of packet loss compensation technology at the receiving end is to generate a replacement voice that is similar to the lost voice packet. The feasibility of this technology is based on the short-term similarity of voice and the masking effect of the human ear, and can handle smaller packet losses. rate (<15%) and smaller voice packets (4~40ms). Packet loss compensation at the receiving end cannot replace compensation at the sending end because the lost data cannot be accurately recovered. Therefore, when the network packet loss rate is large, you need to rely on the sender compensation technology, but when the packet loss rate is too large, you can only optimize the network.

       The insertion-based method refers to a method of hiding packet loss by inserting a simple waveform at the packet loss location. This waveform usually has no correlation with the lost waveform, including silence, white noise, and copying.

       The scope of use of silent replacement is very limited, and it works better when the packet loss rate is less than 2% and the voice frame is less than 4ms. White noise or comfort noise takes advantage of the human brain's subconscious ability to repair lost waveforms with correct speech, which is better than silence.

       Copying is a method used by the GSM system. When continuous packet loss occurs, the compensated waveform is generated by gradually attenuating the data of the previous frame. Since the correlation between the previous and next frames of the voice is not considered, the effect is not very ideal. Interpolation technology refers to using similar waveforms to compensate for packet loss. This technology is relatively complex. It uses the data before and after packet loss to estimate the lost data, and then replaces the lost waveform with the most similar waveform, so the effect is better than insertion technology. .

       Inter-frame interpolation technology is a traditional error concealment technology. For speech encoders with transform coding or linear prediction coding, the decoder can compensate by interpolating the parameters of the previous frame based on the short-term stationarity of the speech signal and the correlation of parameters between adjacent frames. G.723.1 uses parameter interpolation technology to perform inter-frame interpolation on LSP coefficients and excitation signals respectively to compensate for lost frames; G.729 also uses the parameters of the previous frame for interpolation to hide error frames, using the linearity of the previous frame prediction coefficient (LPC) and gain reduction coefficient to compensate for lost frames.

       The compensation technology based on reconstruction reconstructs the decoding information before and after packet loss to generate a compensation packet, which requires the largest amount of calculation and has the best effect. It is completed at the receiving end and does not require the participation of the sending end and additional bit streams, so it can meet the requirements of real-time transmission and is more effective and practical in modern network transmission.

       The waveform replacement technology based on pitch detection is to calculate the pitch period, and then judge the unvoiced and voiced sound of the frame based on the pitch period. If it is unvoiced, use the most recent waveform before packet loss, otherwise use the segment with the length of the pitch period before packet loss. Replace it with a suitable waveform, and then combine short-term energy and zero-crossing rate to recover the lost speech. The effect is due to insertion technology, but it is relatively complicated.

       The basic unit of digital speech signal processing is the fundamental tone. The fundamental tone refers to the lowest frequency sound emitted when an object vibrates, and the rest are overtones. That is to say, when the sound-emitting body vibrates, it carries most of the energy in the speech. The frequency of this vocal cord vibration is called the fundamental frequency, and the corresponding period is the fundamental frequency. The estimation of the pitch period is called pitch detection, and its purpose is to obtain the length of the pitch period that is exactly consistent with the vibration frequency of the vocal cords.

       Packet loss compensation technology is not used in the G.711 encoder that uses waveform coding, but in order to improve voice quality, a waveform replacement technology based on pitch detection is added to the G.711 protocol appendix to compensate for frame loss, which uses the decoded previous frame. The data is used to estimate the pitch period of the current speech signal, and the data of the most recent pitch period and the 1/4 pitch period before it are used to compensate for the missing data. The data of the first 1/4 pitch period is used to overlap and add with the speech signal before frame loss to ensure a smooth transition between the original signal and the compensation signal. If the data of the next frame is not lost, in order to ensure a smooth transition, the 1/4 pitch period data is expanded and overlapped and added with the normal decoded data. If the next frame is still lost, the data of one more pitch period will be extracted for compensation. Up to 3 pitch periods can be extracted. The more frames are lost, the greater the difference between the compensated speech and the actual speech. Therefore, except for the first frame, continuous frame loss compensation must be attenuated frame by frame at a rate of 20%. Since the speech signal is a quasi-stationary time series, especially the voiced signal, which has a certain periodicity, it is better to use the speech data before the frame loss to reconstruct the frame loss data.

       The time domain correction technology uses the waveforms on both sides of the gap to extend in the direction of the gap to fill the gap, find the overlapping vectors of the pitch period on either side of the gap, offset them to cover the gap, and average the overlapping parts. This method avoids the phenomenon of phase discontinuity at the gap boundary, and no plosive sound is heard at the junction of packet loss. The subjective effect is better than the waveform replacement of pitch detection.

       NetEQ's packet loss compensation technology in WebRTC uses packet loss compensation technology that incorporates the iLBC algorithm. The full name of iLBC is Internet Low Bit rate Codec. It is a coding and decoding algorithm developed by GIPS specially designed for packet switching network communication. The sampling rate is 8khz. There are two encoding times of 20ms and 30ms. The code rates are 15.2kbps and 13.3kbps respectively. , which is highly robust to packet loss. Packet loss compensation of iLBC is only processed at the decoding end, and the model-based recovery method is used to generate compensation packets. The specific steps are:

1) Reconstruct linear prediction coefficients (LPC) , that is, use the LPC system of the last frame to reconstruct. Because the last frame has the most correlation with the LPC of the current lost frame, both spatially and temporally, but this simple copy will obviously introduce greater distortion when dealing with consecutive lost frames.
2) Reconstruct the residual signal . The residual signal can usually be divided into two parts: quasi-periodic components and noise-like components. The quasi-periodic component can be approximated based on the pitch period of the previous frame, while the noise-like component can be obtained by generating random noise. The energy ratio of the two can also be derived from the proportional relationship of the previous frame. Therefore, the pitch of the previous frame must first be detected, then the speech signal of the lost frame is reconstructed in a pitch synchronized manner, then correlation is used to obtain the noise-like gain, and finally mixing is performed to reconstruct the entire residual signal.
3) In the case of continuous frame loss , each speech frame compensated by PLC has the same spectral characteristics (same LPC) and pitch frequency. In order to reduce the correlation between each compensation frame, the energy will be reduced frame by frame.

       OPUS's packet loss compensation is divided into two modes: CELT and SILK. The OPUS codec was designed by the Internet Engineering Task Force (IETF) for interactive voice and audio output on the Internet, integrating Skype's SILK (incompatible with Skype's existing SILK algorithm) and Xiph.Org's CELT technology. The packet loss compensation in CELT mode is similar to that of iLBC.

The SILK encoder module framework is as follows:

4.4. NetEQ outline design

        The NetEQ module mainly includes two major processing units, MCU and DSP. The MCU (Micro Control Unit) module is the control unit of the jitter buffer. Since the jitter buffer is used to temporarily store received data, the MCU's main function is to insert data packets. and control packet output. Jitter cancellation technology is included in the MCU control module.

       The outline design of NetEQ is as follows:

       There are 240 slots in the jitter buffer. Each original data packet transmitted from the network will be placed in an appropriate slot in the jitter buffer, which mainly stores the timestamp, sequence number, data type, etc. of the data packet. information, while the actual data carrier is stored in a memory buffer. When a new data packet arrives, space is allocated in the memory buffer to store its carrier, thereby achieving jitter elimination.

       The DSP module is mainly responsible for algorithm processing of data packets read from the MCU, including decoding, signal processing, data output, etc. Packet loss compensation technology is included in the DSP module.

       The voice buffer stores decoded and signal-processed data to be played, which can store 565 samples. curPosition represents the starting point of the data to be played, and sampleLefe is the number of samples to be played.

       Shared memory, decoding buffer and algorithm buffer are all temporary data buffers. Shared memory is used to temporarily store the data to be decoded read from the jitter buffer, and stores the number of sample losses and MCU control commands; the decoding buffer is temporarily Stores the decoded data; the NetEQ algorithm buffer temporarily stores the data processed by the DSP algorithm and is used to supplement the new data in the voice buffer; the playback buffer is the data buffer of the playback driver and is used to read data from the voice buffer. and play.

4.5. NetEQ command mechanism

       NetEQ's processing flow is controlled by various commands, and the command to be used is determined based on the status of the received data packet:

1) Both the previous frame and the current frame are received normally: At this time, the data packet enters the normal processing flow, and the decoded data is selected as Normal, Accelerate or Preemptive Expand according to the need for jitter elimination.
2) Only the current frame loses packets or times out: If the current frame loses packets or times out, then enter the PLC processing to reconstruct the LPC and residual signals, that is, the Expand operation. NetEQ will wait up to 100ms for timeout and packet loss frames. If it exceeds this time, it will directly extract and play the next frame.
3) Continuous multiple frame timeouts or packet losses: If multiple consecutive frames are lost, multiple PLC operations are required. At this time, the further back the data is, the more difficult it is to accurately reconstruct the compensation packet, so the compensation energy gain for continuous packet loss is adopted. frame-by-frame reduction to avoid introducing greater distortion.
4) The previous frame is lost and the current frame is normal: The previous frame is lost, then the data of the previous frame played is compensated by PLC. In order to maintain speech continuity between the PLC-compensated frame and the normally decoded frame, smoothing is required based on the correlation between the preceding and following frames. In this case, select Normal or Merge.

       In addition, the decoder will be reset when NetEQ receives a packet for the first time or after the entire NetEQ is reset. On the other hand, when NetEQ receives a packet with a delay of more than 3.75s, it will not be discarded as a timeout packet, but will be inserted into the jitter buffer and the buffer status will be reset.

4.6. NetEQ playback mechanism

       WebRTC's voice engine starts two threads when running: one thread receives the data packet and inserts it into the jitter buffer; the other thread reads 10ms data from the voice buffer every 10ms for playback.

       NetEQ will combine the data storage in the voice buffer and the data storage in the jitter buffer to decide whether to read data from the jitter buffer:

1) When the control command is Normal, Expand, decoder reset or packet buffer status reset, and SampleLeft is greater than or equal to 10ms, the data will not be read from the jitter buffer, and the DSP will perform Normal operation.
2) When the control command is Normal, decoder reset or packet buffer status reset, and SampleLeft is less than 10ms, after reading data from the jitter buffer, the DSP performs Normal operation.
3) When the control command is Expand and SampleLeft is less than 10ms, the data is not read from the jitter buffer and the DSP performs the Expand operation.
4) When the control command is Merge, after reading data from the jitter buffer, the DSP performs the Merge operation.
5) When the control command is Accelerate, when SampleLeft is greater than or equal to 30ms, the data is not read from the jitter buffer, and the DSP performs the Accelerate operation; when SampleLeft is greater than or equal to 10ms and less than 30ms, the data is read from the different jitter buffer, and the DSP performs the Normal operation. ;When SampleLeft is less than 10ms, after reading data from the jitter buffer, the DSP performs an Accelerate operation.
6) When the control command is Preemptive Expand, when SampleLeft is greater than or equal to 10ms, the DSP does not read data from the jitter buffer, and the DSP performs the Preemptive Expand operation; when the SampleLeft is less than 10ms, after reading the data from the jitter buffer, the DSP performs the Preemptive Expand operation. .

       In the above command processing, due to the consideration of preventing additional call delay in the voice buffer, there is no need to read data from the jitter buffer when SampleLeft is greater than or equal to 10ms. Data will only be read when it is less than or equal to 10ms. Maintain an appropriate amount of data in the voice buffer; when the control command is Merge, data needs to be read from the jitter buffer at any time to ensure the continuity of data before and after; when the control command is Expand, packet loss compensation will occur A certain amount of data, so there is no need to read data from the jitter buffer, but when SampleLeft is greater than or equal to 10ms, the Expand operation is not performed because there is enough data in the voice buffer for playback, and only the Normal operation is performed.

4.7. MCU control mechanism

       In NetEQ's jitter algorithm, Blo (optBufferLevel) is used to calculate the jitter of the predicted network based on the forgetting factor algorithm, and BLc (bufferLevelFit) is used to calculate the jitter of the predicted jitter buffer based on the adaptive average algorithm. The control mechanism of the MCU selects the operation command based on the relationship between the timestamp of the played data (playedOutTS, recorded as TSplay), the timestamp of the data packet to be read (availableTS, recorded as TSavail), Blo and Blc. The relationship between TSplay and TSavail is used to determine whether the network data is normal.

      A comparison of Blo and Blc is shown below:

The relationship between Blo and Blc in the above figure determines the selection of Accelerate, Preemptive Expand, Expand and Merge operations.

       In short, jitter elimination in NetEQ is mainly accomplished by sending different commands to notify the DSP to perform corresponding operations based on the relationship between network delay and jitter buffer delay.

4.8. DSP algorithm processing

       The DSP module is the voice signal processing module in NetEQ, and its operation commands are controlled by the MCU. The autocorrelation function method is used in WebRTC. Since the speech signal is a non-stationary signal, the short-term autocorrelation function is used for signal processing. Gene detection using the short-term autocorrelation function method mainly uses the characteristic of the short-term autocorrelation function to be maximum at the signal period, and determines the pitch period by comparing the similarity between the original signal and its shifted signal. If the displacement distance is equal to the pitch period, then the two signals have the greatest similarity. When the classic short-term autocorrelation function is used for pitch detection, a window function is used. The window does not move and the speech signal moves. The window length must be at least twice the pitch period. The longer the window length, the more accurate the pitch period, but the amount of calculation will also increase accordingly.

        The acceleration and deceleration operations in WebRTC are based on the WSOLA algorithm to adjust the voice duration, which is to compress or stretch the voice on the timeline without changing the pitch of the voice and ensuring good sound quality, which is commonly known as variable speed. No tune. Speech duration adjustment algorithms are divided into two categories: time domain and frequency. The time domain is represented by the waveform similarity in overlapping area (WSOLA) algorithm. For speech signals, higher speech quality can be obtained, and the computational complexity of the frequency algorithm is smaller. For audio with drastic frequency changes, such as music signals, it is usually difficult for time-domain algorithms to obtain higher speech quality. In this case, frequency algorithms with large computational complexity, such as the subband WSOLA algorithm, are usually used. Since GIPS designed NetEQ for VoIP services, the data is mainly voice signals and the frequency domain changes are small, so the time domain algorithm WSOLA is used.

       DSP processing is the key to achieving jitter elimination, packet loss compensation, and achieving low latency and better sound quality for the played voice. The various operations of DSP are introduced below:

1) Acceleration: Reduce the number of samples in a voice packet. The reduced data is the pitch period obtained based on the correlation of voice samples. The speech data of two sample periods are smoothed and converted into data of one pitch period. The acceleration operation takes 30ms as one frame. It will only be accelerated when the correlation is very strong (>0.9) or the signal energy is very low. The algorithm is similar to the deceleration processing. As shown below:


2) Slowing down: Increase the number of samples in a voice packet. The added data is the pitch period obtained based on the correlation of voice samples. The speech data of two sample periods are smoothed and inserted between the two sample periods. The deceleration operation takes 30ms as one frame, and the data to be played is at least 0.625ms, otherwise the previous frame will be played repeatedly; when the decoded data cache is less than 30ms, it will be copied directly to the playback cache. The pitch period ranges from 2.5ms to 15ms, so it can be extended by up to 15ms. As shown below:


3) Frame insertion: iLBC’s packet loss compensation algorithm is used, that is, the linear prediction system is reconstructed, the residual signal is reconstructed, and the compensation frame energy is reduced frame by frame. During the frame insertion operation, the buffer to be played should be smaller than the playback read time (10ms). As shown below:


4) Fusion: After the packet loss of the previous frame, a frame insertion operation will be performed. When the new data is decoded, in order to improve the coherence of the voice data, the new decoded data and the frame insertion data are smoothed and the energy is gradually increased. The intermediate data generated by the fusion operation is large, and maximum space needs to be reserved. As shown below:

5) Normal: At this time, the new decoded data is output and played normally, but if the previous frame data is interpolated frame data, the frame needs to be interpolated first and then smoothed.

4.9. Simulation test of DSP algorithm

       In view of the above DSP algorithm operation, this article conducted simulation tests on intermittent voice signals (Test1), continuous voice signals (Test2) and music signals (Test3) under different packet loss rates, and adopted the ITU-T P563 objective measurement standard. For MOS value estimation, LostData is used to fill in silence when packets are lost, Expand is used to insert frames when packets are lost, and Expand+Merge is used to insert frames before fusion when packets are lost.

test

sequence

Output frame

duration

Packet loss rate

MOS(P563)

LostData

Expand

Expand+Merge

test 1

10ms

10.00%

2.131

2.163

2.377

11.11%

2.237

3.085

3.073

12.50%

1.717

2.930

3.231

14.29%

2.159

3.138

3.348

16.67%

1.656

2.650

3.275

20.00%

2.364

2.161

2.333

25.00%

1.838

3.001

3.033

             
   
       Intermittent voice signals are not very sensitive to changes in packet loss rate, because the MOS value will be better when too much voice is not lost during packet loss. The plosive sounds generated during packet loss are significantly improved after packet loss compensation. Subjective experience good.

test

sequence

Output frame

duration

Packet loss rate

MOS(P563)

LostData

Expand

Expand+Merge

test2

10ms

10.00%

2.656

2.872

3.598

11.11%

2.568

2.997

2.926

12.50%

2.634

3.162

3.038

14.29%

2.530

3.169

3.007

16.67%

2.290

2.903

2.980

20.00%

2.522

3.206

3.108

25.00%

1.952

2.943

2.952

                       
   
       The continuous voice signal shows a gradual downward trend following the packet loss rate. The effect of packet loss compensation is similar to that of Test1, the plosives are significantly improved, and the subjective experience is good.

test

sequence

Output frame

duration

Packet loss rate

MOS(P563)

LostData

Expand

Expand+Merge

Test3

10ms

10.00%

2.428

2.658

2.855

11.11%

2.577

2.708

2.663

12.50%

3.420

2.478

2.739

14.29%

3.552

2.444

2.863

16.67%

3.251

2.421

2.792

20.00%

1.000

2.208

2.527

25.00%

1.099

1.993

2.474

                

        The music signal is "Castle in the Sky" by Joe Hisaishi. The entire spectrum changes greatly. When the packet loss rate is small, the compensation effect is good. The greater the packet loss rate, the worse the subjective effect of compensation.

        The acceleration, deceleration and normal operations are not compared one by one. The acceleration and deceleration operations are very effective in eliminating jitter and maintaining the integrity of the voice signal when there is no packet loss.


       The initial value of the dynamic simulation in the above figure is 37 samples, that is, there are 37 samples to be played initially. The MOS value obtained by Merge after Expand decreases. There are two possible reasons: 1. The number of bytes to be played before Expand is less, resulting in low sound quality of the inserted frame data; 2. The more data compensated by Expand before Merge affects the MOS value after Merge. For sound quality, ideally there should be a one-to-one correspondence between Expand and Merge.


       The fixed value of the fixed simulation in the above figure is 500 samples, that is, the number of samples to be played remains at 500. At this time, the MOS value obtained by Expand has been greatly improved. After that, the Merge operation is performed every time to help increase the MOS value.


       The initial value of the dynamic simulation in the above figure is 0, that is, there are 0 samples to be played initially. At this time, the number of samples to be played before Expand is reduced, resulting in a decrease in the MOS value of the Expand frame data. Although Expand and Merge do not correspond one-to-one, there is less data to be played before each Merge, so there will be less data that needs to be smoothed. Help Increase MOS value.

       Several important parameters of the P563 standard are: signal-to-noise ratio (SNR), silence interval and average pitch period, etc. There is no reference phase change, but phase change has a great subjective impact on hearing, such as phase discontinuity after inserting frame data The waveform may produce popping sounds, which will affect the sound quality. Therefore, although the MOS value of P563 is sometimes estimated to be almost the same between frame interpolation and fusion, the subjective experience of fusion is significantly better than that of frame interpolation.

5. NetEQ source file description

mcu_dsp_common.h: NetEQ instance.

neteq_error_codes.h: NetEQ error codes.

neteq_statistics.h: NetEQ PLC statistics.

dsp.h: Header file for PLC operation on dsp.

accelerate.c: Accelerate operations to reduce latency. Processing is performed when the signal correlation is strong and the signal energy is weak.

expand.c: Generate audio signals and generate background noise.

preemptive_expand.c: Slow down to increase delay.

normal.c: normal playback.

merge.c: Smooth the new frame of data and the expanded previous frame of data.

bgn_update.c: Background noise estimation and update.

cng_internal.c: Generate comfortable background noise.

dsp.c: Initialization function and constant table definition of dsp operation.

recin.c: Add RTP packets.

recout.c: Decode and output PCM.

dsp_helpfunctions.h

dsp_helpfunctions.c: Two dsp functions, frequency is a multiple of 8k and downsampling to 4k.

min_distortion.c: Minimum distortion calculation.

correlator.c: Calculate the correlation of signals.

peak_detection.c: Correlation peak detection and positioning.

mix_voice_unvoice.c: mix

mute_signal.c: mute (weaken)

unmute_signal.c: Turn off mute (crescendo)

random_vector.c: Random vector.

codec_db_defines.h

codec_db.c: manages the database of the algorithm library.

dtmf_buffer.h

dtmf_buffer.c: DTMF messages and decoding.

dtmf_tonegen.h

dtmf_tonegen.c: Generate DTMF signal.

mcu.h: MCU side operation.

mcu_address_init.c: MCU address initialization.

mcu_dsp_common.c: Communication between MCU and DSP.

mcu_reset.c: Reset the data on the MCU side.

set_fs.c: DTMF sampling rate.

signal_mcu.c: Notifies the MCU that data is available and requests a PLC command.

split_and_insert.c: Split RTP header and add to packet buffer.

rtcp.h

rtcp.c: RTCP statistics.

rtp.h

rtp.c: RTP function.

automode.h

automode.c: dynamic caching strategy.

packet_buffer.h

packet_buffer.c: Packet cache management.

buffer_stats.h

bufstats_decision.c: Gives PLC commands based on buffer jitter.

webrtc_neteq.h

webrtc_neteq_help_macros.h: NetEQ macro definition

webrtc_neteq.c:NetEQ API。

webrtc_neteq_internal.h: NetEQ internal function

webrtc_neteq_unittest.cc: NetEQ unit test.

6. Reference documents

《GIPS NetEQ》,Global IP Solutions

"Research on NetEQ Technology in WebRTC Speech Engine", Wu Jiangrui

"Research on Packet Caching Technology in WebRTC Speech Engine", Xiao Hongliang

"Research and Development of VoIP Packet Loss Processing Technology", Li Ruwei, Bao Changchun

《ITU-T P.563 Single-ended method for objective speech quality assessment in narrow-band telephony application》,国际电联ITU(International Telecommunication Union)

http://en.wikipedia.org/wiki/Opus_(codec)

Guess you like

Origin blog.csdn.net/chenlycly/article/details/133922114