Detailed explanation of the details of the voice noise reduction module ANS in webRTC (2)

The last article ( Detailed explanation of the speech noise reduction module ANS in webRTC (1) ) talked about the basic principle of Wiener filtering. This article first gives the basic processing process of ANS in webRTC, and then talks about some processing details in the two steps (transfer from instant domain to frequency domain and frequency domain to time domain).

The basic processing of ANS is as shown in Figure 1:

                                            figure 1

As can be seen from Figure 1, the processing process is mainly divided into six steps, as follows:

1) Transform the input noisy signal from the time domain to the frequency domain, mainly including framing, windowing and short-time Fourier transform (STFT), etc.

2) Do the initial noise estimation, and calculate the prior SNR and posterior SNR based on the estimated noise

3) Compute classification features, which include likelihood ratio test (LRT), spectral flatness, and spectral difference. The speech/noise probability is determined according to these features, so as to determine whether the current signal is speech or noise.

4) Update the noise estimate based on the calculated speech/noise probability

5) Denoising based on Wiener filtering

6) Convert the denoised signal from the frequency domain back to the time domain, mainly including inverse short-time Fourier transform (ISTFT), windowing, and overlap-add.

The version I use for understanding and debugging is the previous C version, which is divided into floating-point and fixed-point implementations. For algorithm understanding, it is best to look at the version of the floating point implementation, because it can be well connected with the mathematical expressions in the algorithm principle. There are many engineering implementation techniques such as scaling in fixed-point implementation, which are difficult to directly relate to mathematical expressions. If there are constraints such as load during deployment, it is best to use fixed-point implementation, because usually the load of fixed-point implementation is much smaller than that of floating-point implementation. ANS supports three sampling rates including 8k/16k/32k HZ. For voice, the most commonly used is 16k HZ, and this article and the subsequent ones all set the adoption rate to 16k HZ. When the voice signal is processed, the frame is taken as the unit, and one frame in ANS is 10 ms, so it can be calculated that one frame is 160 sampling points. Speech signal processing is usually carried out in the frequency domain, so the time-domain signal must first be converted into a frequency-domain signal, and then the frequency-domain signal should be converted back to a time-domain signal after processing. The time domain signal changes to the frequency domain signal at the beginning of the ANS noise reduction process, and the frequency domain signal changes to the time domain signal at the end of the ANS noise reduction process, but they are relatively symmetrical, and they have nothing to do with the noise reduction processing algorithm, so Put them all together. Let's talk about some details in the time-frequency conversion.

First look at the signal from the time domain to the frequency domain. The main steps are framing, windowing and short-time Fourier transform (STFT). As mentioned above in terms of framing, one frame is 10 ms, and each frame has 160 sampling points. The purpose of windowing is to avoid spectral leakage. There are many kinds of window functions, such as rectangular window, triangular window, Hanning window and Hamming window. Commonly used in speech processing are Hanning window and Hamming window. ANS uses a hybrid window in which a Hanning window and a rectangular window are mixed. The number of points required to do STFT is 2 to the Nth power. Now there are 160 points per frame, and the nearest 2 to the Nth power greater than 160 is 256, so STFT processes 256 points at a time (this is also 256 in the code (#define ANAL_BLOCKL_MAX 256) origin). Now there are 160 points per frame, and 256 points need to be added. One way is to pad zero after 160 points to make 256 points. ANS takes a better approach. Complement with 96 points from the end of the previous frame to form 256 points. In this way, the processing flow from the time domain signal to the frequency domain signal is shown in Figure 2:

                                                     figure 2

Because STFT is performed on 256 points, the number of windowed points is also 256. The windows used by ANS are Hanning and rectangular hybrid windows. The Hanning window function is w(n) = 0.5 * (1 + cos(2*pi*n / (N-1))), the range is (0,1), and the waveform is shown in Figure 3.

                                      image 3

This mixing window is to insert the Hanning window of 192 (96*2) points into a rectangular window with an amplitude of 1 at the apex of 64 points, thereby forming a mixing window of 256 (256 = 192 + 64) points, and the waveform is as shown in Figure 4 .

                                           Figure 4

As for why this is done, I will talk about the conversion from the frequency domain to the time domain later. The value of 256 points is multiplied by the corresponding window function to get the value to be sent to STFT for processing. After STFT processing, the values ​​of 256 frequency points are obtained. Except for the 0th point and the N/2th point (N=256, that is, the 128th point), these values ​​are real numbers, and the rest of the points are complex numbers, and about the N/2th point point conjugate symmetry. Because of conjugate symmetry, if a point is known, its symmetric point can be found. So there are (N/2 + 1) point values ​​after STFT processing. Here N=256, the output of STFT is the value of 129 points. This is also the origin of 129 (#define HALF_ANAL_BLOCKL 129) in the code. After obtaining the values ​​of 129 frequency points, the amplitude spectrum and energy of each frequency point must be calculated, which will be used in the subsequent noise reduction algorithm. The specific processing is as follows. Detailed notes have been given, so I won’t go into details.

After the noise reduction process in the frequency domain, it is necessary to change the signal from the frequency domain back to the time domain, that is, the reconstruction or synthesis of the signal. The main steps are inverse short-time Fourier transform (ISTFT), windowing and overlap-add (overlap add, OLA), etc., the processing flow is shown in Figure 5.

                                      Figure 5

First do ISTFT (inverse short-time Fourier transform) to get 256 real values. These 256 points include 96 points at the end of the previous frame, that is, there is overlap. How to splice to ensure coherent sound? As mentioned above, the window used when transforming from the time domain to the frequency domain is a Hanning rectangular mixed window. The first half of the Hanning window (96 points at the head) is similar to doing a sine operation, and the second half (96 points at the tail) is similar to doing a cosine operation. . The overlapping part is at the end of the previous frame, and the windowing is a cosine-like operation. In the current frame, it is the head, and the windowing is a sine-like operation. When the signal is reconstructed and superimposed, the energy or amplitude is generally required to be unchanged, and the energy is the square of the amplitude. Those overlapping points (assuming the amplitude is m) performed a cosine-like operation when windowing in the previous frame. After windowing, the amplitude became m*cosθ, and performed a sine-like operation when windowing in the current frame. After windowing, the amplitude becomes m*sinθ, and the energy sum is exactly equal to (the energy of the original signal), which shows that the coherence of the speech signal can be guaranteed only by adding the overlapping parts. This explains the reason why the code performs a windowing operation on the value after ISTFT and adds the overlapping parts. The specific code is shown in Figure 6 below.

                                                   Figure 6

As for the rectangular window part, the amplitude is 1, that is, the signal amplitude remains unchanged after windowing, so no processing is required, just fill it in directly. It should be noted that there is also an energy scaling factor factor in Figure 6. It defaults to 1 in the first 200 frames, and the subsequent frames are obtained according to the following logical relationship.

Figure 7 shows a schematic diagram of data splicing after ISTFT is done. After finishing ISTFT, there are 256 points of data. The data of 96 points at the head of the current frame is added to the data of 96 points at the end of the previous frame. The data of 64 points in the middle remains unchanged. The point data is added, so that the coherent voice data can be well spliced.

                                                  Figure 7

The next article will talk about the initial estimation of the noise and the calculation of the prior SNR and the posterior SNR based on the estimated noise.

Guess you like

Origin blog.csdn.net/david_tym/article/details/120817304