Detailed explanation of the details of the voice noise reduction module ANS in webRTC (3)

The previous article ( Detailed Explanation of the Voice Noise Reduction Module ANS in webRTC (2)  ) talked about the processing flow of ANS and the mutual conversion of voice in the time domain and frequency domain. This article begins to talk about the core part of speech noise reduction. First, it talks about the initial estimation of noise and the calculation of prior SNR and posterior SNR based on the estimated noise.

1. Initial noise estimation

The initial noise estimation of ANS in webRTC uses the quantile noise estimation method (QBNE, Quantile Based Noise Estimation), and the corresponding paper is "Quantile Based Noise Estimation For Spectral Subtraction And Wiener Filtering". Quantile noise estimation considers that even if it is a speech segment, the input signal may not have signal energy in some frequency band components, then make a statistics on the energy of all speech frames in a certain frequency band, set a quantile value, lower than Quantile values ​​are considered noise, and values ​​higher than the quantile value are considered speech. The general steps of the algorithm are as follows:

When webRTC ANS makes initial estimates, it is divided into three stages. The first stage is the first 50 frames, the second stage is 51~200 frames, and the third stage is after 200 frames. After 50 frames, only the quantile noise estimation method is used to estimate the noise, while for the first 50 frames, the quantile noise estimation method is combined with the noise model to make the noise estimation more accurate. First look at the processing of quantile noise estimation in each stage, the process is as follows:

1) Calculate the natural logarithm value of the amplitude spectrum of each frequency point, that is, the logarithmic spectrum inst->lmagn, and then use lmagn to represent it

2) Update the quantile natural logarithm value (inst->lquantile, subsequently represented by lquantile) and probability density value (inst->density, subsequently represented by density). There are three sets of lquantile and density values, and each frame has 129 frequency points, so the array size of lquantile and density is 387 (129*3). The memory layout is shown in Figure 1:

                                                figure 1

The update of three different sets of lquantile and density is controlled by inst->counter (subsequently represented by counter). The counter array has three integer values, each of which controls a group. The initial value of the counter array is based on 200 (representing the first 200 frames), and 200 is divided into three, which is [66, 133, 200]. The counter value will increase by 1 every time a frame is processed, and will become 0 when the value becomes 200. In this way, when the second frame is processed, the counter value becomes [67, 134, 0], when the third frame is processed, the counter value becomes [68, 135, 1], and so on. After the initial 200 frames are processed, the counter also completes the traversal of 0~200.

Let's see how the counter controls lquantile and density. For the j-th frequency point of the i-th group, first define the variable:

Update the quantile: When the frequency point logarithmic spectrum lmagn[j] > lquantile[i*129 + j], it means that lquantile is too small and needs to be increased, otherwise it needs to be decreased. Update the mathematical expression as follows: 1

(1)           

Update probability density: When |lmagn[j] – lquantile[i*129+j]| < WIDTH (value 0.01), it means that the current noise estimate is more accurate, so the probability density needs to be updated. The updated mathematical expression is as follows Equation 2:

3) When the number of frames is less than 200, do a natural exponential operation on the lquantile of the last group (that is, the second group), and use it as a noise estimate (noise[j], one value for each frequency point), it can be seen that each The estimated noise is different for each frame. When the number of frames is greater than or equal to 200, only when the value in the counter array is equal to 200, the lquantile of the corresponding group will be subjected to natural exponential calculation, and it will be used as the noise estimate. It can be seen that when the number of frames is greater than or equal to 200, the noise estimate will be updated every 66 or 67 frames.

Look at the first 50 frames and use the quantile noise estimation method combined with the noise model to estimate the initial noise. First define the following four variables:

It should be noted that the first 5 frequency points are not used in the definition of the above 4 variables, because i starts from 5. Then use the variables defined above to represent the parameters of white noise (white noise) and pink noise (pink noise), expressed as follows:

Where overdrive is a value obtained according to the set noise reduction degree (set in initialization).

 

Where blockInd represents the index of the current frame.

In this way, the parameters of white noise and pink noise can be used to estimate the model noise, as follows:

Wherein when the frequency point id is less than 5, usedBin = 5, in other cases usedBin = frequency point id.

Finally, the final estimated noise is obtained according to the quantile estimated noise noise and the model estimated noise parametric_noise. For each frequency point j, the expression is as follows formula 3:

                                  (3)

So far, the noise of the combined quantile noise estimation and model noise estimation of the first 50 frames is estimated. In this way, no matter which frame it is, the initial noise can be estimated. Next, the prior SNR and the posterior SNR are calculated based on the estimated initial noise.

2. Calculate the prior SNR and posterior SNR

In the detailed explanation of the speech noise reduction module ANS in webRTC (1), it is said that the posterior signal-to-noise ratio σ is the power ratio of noisy speech Y to noise N, and the prior signal-to-noise ratio ρ is the power ratio of clean speech S to noise N. The expressions are as follows 4 and 5:

                                                                      (4)

                                                                      (5)

Among them, m represents the number of frames, and k represents the number of frequency points, that is, each frequency point has a priori SNR and a posteriori SNR. Since the noise N has been estimated by the quantile estimation method, and the noisy speech Y is known, the posterior SNR can be calculated.

because

thereby

So get formula 6:

          (6)

That is, the prior SNR is equal to the posterior SNR – 1.

As for calculating the prior SNR, the decision-directed method (Decision-Directed, DD for short) is used. According to formula 5 and formula 6, formula 7 can be obtained:

                                                (7)

The estimation of the prior SNR can be obtained by recursively recursing the above formula, as shown in formula 8:

                        (8)

Here α is the weight (or smoothing coefficient) to replace 1/2 in the above formula. It can be seen from the above formula that estimating the prior SNR of the current frame is based on the prior SNR of the previous frame and the posterior SNR of the current frame. max() is used to ensure that the valuation is non-negative. The value range of the smoothing coefficient α is 0 < α < 1, and the typical value is 0.98, which is the value used in webRTC ANS.

In the specific software implementation, in order to reduce the calculation load in WebRTC, it does not strictly follow the defined formula to calculate, but uses the ratio of the amplitude spectrum to calculate, that is, the right part of the second equal sign in Equation 9 and Equation 10.

                                                                                                                     (9)

                                                                                                                    (10)

When calculating the prior SNR of the current frame, the noisy speech Y(k, m-1) of the previous frame is known, and the value of the Wiener filter coefficient H(k, m-1) of the previous frame (ie The value in the inst-smooth array) is also known. According to the principle of Wiener filtering, the estimated clean speech of the previous frame S(k, m-1) = H(k, m-1)Y(k, m-1) is also known, so the prior SNR of the previous frame is calculated as Equation 11:

            (11)

Putting it into Equation 8, the a priori SNR calculation expression of the current frame can be obtained as Equation 12:

      (12)

In this way, both the priori SNR and the posteriori SNR of the current frame are calculated, which are used in the subsequent calculation of the speech noise probability. The next article will talk about the probability calculation method of speech and noise based on noisy speech and features, noise estimation update and noise reduction based on Wiener filtering.

Guess you like

Origin blog.csdn.net/david_tym/article/details/121040570