WebRTC’s noise suppression (NS) algorithm

The core algorithm of WebRTC noise suppression is in the ns_core.c file.

The noise spectrum can be estimated using, for example, the speech/noise likelihood function. Classifies each received frame of signal and frequency components as noise or speech.

Algorithm principle

The core idea of ​​this algorithm is to use Wiener filter to suppress the estimated noise .

In the above formula, x and n represent speech and noise respectively, and y represents the signal collected by the microphone.

Their spectrum relationship is as above. It can be seen from the above figure that speech and noise have an additive and uncorrelated relationship. For non-additive relationships, algorithms such as AEC can be used to suppress different scenarios. According to the central limit definition, it is generally believed that the noise and speech distribution obey the normal distribution with a mean of 0 and a variance of ui. But there is also adoption and distribution.

So the central idea here becomes to estimate the noise N from Y, and then suppress N to get the speech, that is:

Therefore, the accuracy of noise estimation is crucial. The more accurate the noise estimation, the better the results will be. This leads to several more methods of estimating noise.

1. Noise estimation based on VAD detection. VAD detects Y. If there is no speech in the detection, it is considered noise. This is an estimation method of noise.

2. Based on the principle of minimum global amplitude spectrum, this estimate believes that the situation with the minimum amplitude spectrum must correspond to the time when there is no speech.

3. There are also noise estimation methods based on the matrix singular value decomposition principle.

webRTC does not use the above method, but improves the likelihood ratio function (this method is used in VAD detection) and merges multiple speech/noise classification features into one model to form a multi-feature comprehensive probability density function. Each frame of input spectrum is analyzed. It can effectively suppress noise from fans/office equipment etc.

The suppression process is as follows:

For each frame of noisy speech signal received, based on the initial noise estimate of the frame, define the speech probability function, measure the classification characteristics of each frame of noisy signal, and use the measured classification characteristics to calculate each frame Based on multi-feature speech probability, the calculated speech probability is weighted by dynamic factors (signal classification features and threshold parameters), and based on the calculated feature-based speech probability of each frame, the speech probability function of each frame in the multi-frame is modified. , and update the initial noise (quantile noise for each frame in consecutive multiple frames) estimate in each frame using the modified per-frame speech probability function.

 

The feature-based speech probability function is obtained by using a mapping function (sigmoid/tanh, also known as S function, commonly used as a seed function in neuron classification algorithms) to map the signal classification features of each frame to a probability value.

Classification features include: mean likelihood ratio over time, a measure of spectral flatness, and a measure of spectral mask difference. The spectral mask difference measurement is based on the comparison of the input signal spectrum to the mask noise spectrum.

 

Signal analysis: Preprocessing steps including buffering, windowing, and discrete Fourier transform (DFT)

Noise estimation and filtering include: initial noise estimate, decision-guided (DD) update of posterior and prior SNR, speech/noise likelihood determination based on likelihood ratio (LR) factors, which It uses posterior and prior SNR, as well as speech probability density function (HF) models (such as Gaussian, Laplacian operator, gamma, super Gaussian, etc.), as well as updating and applying dimensionality based on feature modeling and noise estimation. It is determined by the probability determined by the nanogain filter.

Signal synthesis: Inverse discrete Fourier transform, scaling and window synthesis.

 

The initial noise estimate is based on the quantile noise estimate. The noise estimate is controlled by the quantile parameter, denoted q. The noise estimate determined from the initial noise estimation step can only be used as an initial condition to facilitate subsequent processes of noise update/estimation.

set_feature_extraction_parameters

The parameters used for feature extraction are set. The current WebRTC noise suppression algorithm uses the three indicators of LRT feature/spectrum flatness and spectrum difference, and does not use the two characteristics of spectrum entropy and spectrum variance.

WebRtc Ns_Init Core

NS (noise suppression) module is initialized, and the following code is analyzed according to fs=8000.

 

[cpp] view plain copy

  1. //The length of voice data, the data amount of 8k/10ms is 80  
  2. self->blockLen = 80;  
  3. //Analyze the length. Since it is analyzed in the frequency domain, the length is raised to the power of 2. The minimum value is 128, which is actually the length of fft.  
  4. self->anaLs = 128;  
  5. //Window function, using hybrid Hanning flat-top window function  
  6. self->window = kBlocks80w128;  

Initialize related storage members used by FFT

 

 

[cpp] view plain copy

  1.   // Initialize FFT work arrays.  
  2.   self->ip[0] = 0;  // Setting this triggers initialization.  
  3.   memset(self->dataBuf, 0, sizeof(float) * ANAL_BLOCKL_MAX);  
  4.   WebRtc_rdft(self->anaLen, 1, self->dataBuf, self->ip, self->wfft);  
  5.   
  6. //It is a sliding analysis window. For an fft of 80 points and 128 points, the data of 128-80=48 points of the previous frame will be retained each time, instead of simply filling 80 points with 0 and turning them into 128 points for fft.  
  7. //But this will cause synthesis problems, and windows are usually used to prevent mutations caused by overlap. You can use the same window function as the fft transformation. But this requires a power-preserving mapping of the window function, that is, overlapping  
  8. //The sum of the squares of the partial windows must be 1.  
  9.  memset(self->analyzeBuf, 0, sizeof(float) * ANAL_BLOCKL_MAX);  
  10. //dataBuf stores the original time domain signal  
  11.   memset(self->dataBuf, 0, sizeof(float) * ANAL_BLOCKL_MAX);  
  12. //syntBuf is spectral subtraction, which transforms the signal into the time domain after subtracting the noise.  
  13.   memset(self->syntBuf, 0, sizeof(float) * ANAL_BLOCKL_MAX);  
  14.   
  15.   // For HB processing. This is the high-frequency part, with up to two bands.  
  16.   memset(self->dataBufHB,  
  17.          0,  
  18.          sizeof(float) * NUM_HIGH_BANDS_MAX * ANAL_BLOCKL_MAX);  

Initialize variables used for quantile estimation

 

 

[cpp] view plain copy

  1.   // For quantile noise estimation.  
  2.   memset(self->quantile, 0, sizeof(float) * HALF_ANAL_BLOCKL);  
  3. //3 frame synchronization estimation, lquantile is the logarithmic quantile. Density is the probability density, which is used to calculate quantiles.  
  4.  for (i = 0; i < SIMULT * HALF_ANAL_BLOCKL; i++) {  
  5.     self->lquantile[i] = 8.f;  
  6.     self->density[i] = 0.3f;  
  7.   }  
  8.   
  9.   for (i = 0; i < SIMULT; i++) {  
  10. //My understanding is that counter is a weight, representing the proportion of each frame to the quantile estimate.  
  11.   self->counter[i] =  
  12.         (int)floor((float)(END_STARTUP_LONG * (i + 1)) / (float)SIMULT);  
  13.   }  
  14.   
  15.   self->updates = 0;  

Wiener filter initialization

 

 

[cpp] view plain copy

  1. for (i = 0; i < HALF_ANAL_BLOCKL; i++) {  
  2.   self->smooth[i] = 1.f;  
  3. }  

Set the aggressiveness of noise suppression

 

 

[cpp] view plain copy

  1. // Set the aggressiveness: default.  
  2. self->aggrMode = 0;  


Noise estimation uses

 

 

[cpp] view plain copy

  1. // Initialize variables for new method.  
  2. self->priorSpeechProb = 0.5f;  // Prior prob for speech/noise.  
  3. // Previous analyze mag spectrum.  
  4. memset(self->magnPrevAnalyze, 0, sizeof(float) * HALF_ANAL_BLOCKL);  
  5. // Previous process mag spectrum.  
  6. memset(self->magnPrevProcess, 0, sizeof(float) * HALF_ANAL_BLOCKL);  
  7. // Current noise-spectrum.  
  8. memset(self->noise, 0, sizeof(float) * HALF_ANAL_BLOCKL);  
  9. // Previous noise-spectrum.  
  10. memset(self->noisePrev, 0, sizeof(float) * HALF_ANAL_BLOCKL);  
  11. // Conservative noise spectrum estimate.  
  12. memset(self->magnAvgPause, 0, sizeof(float) * HALF_ANAL_BLOCKL);  
  13. // For estimation of HB in second pass.  
  14. memset(self->speechProb, 0, sizeof(float) * HALF_ANAL_BLOCKL);  
  15. // Initial average magnitude spectrum.  
  16. memset(self->initMagnEst, 0, sizeof(float) * HALF_ANAL_BLOCKL);  
  17. for (i = 0; i < HALF_ANAL_BLOCKL; i++) {  
  18.   // Smooth LR (same as threshold).  
  19.   self->log LrtTimeAvg[i] = LRT_FEATURE_THR;  
  20. }  

Feature quantity, used to calculate noise

 

 

[cpp] view plain copy

  1.   // Feature quantities.  
  2.   // Spectral flatness (start on threshold).  
  3.   self->featureData[0] = SF_FEATURE_THR;  
  4.   self->featureData[1] = 0.f;  // Spectral entropy: not used in this version.  
  5.   self->featureData[2] = 0.f;  // Spectral variance: not used in this version.  
  6.   // Average LRT factor (start on threshold).  
  7.   self->featureData[3] = LRT_FEATURE_THR;  
  8.   // Spectral template diff (start on threshold).  
  9.   self->featureData[4] = SF_FEATURE_THR;  
  10.   self->featureData[5] = 0.f;  // Normalization for spectral difference.  
  11.   // Window time-average of input magnitude spectrum.  
  12.   self->featureData[6] = 0.f;  
  13.   
  14.   // Histogram quantities: used to estimate/update thresholds for features.  
  15.   memset(self->histLrt, 0, sizeof(int) * HIST_PAR_EST);  
  16.   memset(self->histSpecFlat, 0, sizeof(int) * HIST_PAR_EST);  
  17.   memset(self->histSpecDiff, 0, sizeof(int) * HIST_PAR_EST);  
  18.   
  19.   // Update flag for parameters:  
  20.   // 0 no update, 1 = update once, 2 = update every window.  
  21.   self->modelUpdatePars[0] = 2;  
  22.   self->modelUpdatePars[1] = 500;  // Window for update.  
  23.   // Counter for update of conservative noise spectrum.  
  24.   self->modelUpdatePars[2] = 0;  
  25.   // Counter if the feature thresholds are updated during the sequence.  
  26.   self->modelUpdatePars[3] = self->modelUpdatePars[1];  

White noise and pink noise

 

 

[cpp] view plain copy

  1. self->signalEnergy = 0.0;  
  2. self->sumMagn = 0.0;  
  3. self->whiteNoiseLevel = 0.0;  
  4. self->pinkNoiseNumerator = 0.0;  
  5. self->pinkNoiseExp = 0.0;  

 

Compute Spectral Flatness

Spectral flatness calculation, this algorithm assumes that speech has more harmonics than noise. The speech spectrum tends to have peaks in the fundamental frequency (pitch) and harmonics, whereas the noise spectrum is relatively flat. It therefore serves as a feature that distinguishes noise from speech.

When calculating the spectral degree, N represents the number of frequency points after STFT, B represents the number of frequency bands, K is the frequency point index, and j is the frequency band index. Each frequency band includes a large number of frequency points. The 128 frequency points can be divided into 4 frequency bands (low band, mid-low frequency band, mid-high frequency band, and high frequency), each frequency band has 32 frequency points. For noise, the Flatness is large and constant, while for speech, the calculated quantity is low and variable.

 

[cpp] view plain copy

  1. // Compute spectral flatness on input spectrum.  
  2. // |magnIn| is the magnitude spectrum.  
  3. // Spectral flatness is returned in self->featureData[0].  
  4. static void ComputeSpectralFlatness(NoiseSuppressionC* self,  
  5.                                     const float* magnIn) {  
  6.   size_t i;  
  7.   size_t shiftLP = 1;  // Option to remove first bin(s) from spectral measures.  
  8.   float avgSpectralFlatnessNum, avgSpectralFlatnessDen, spectralTmp;  
  9.   
  10.   // Compute spectral measures.  
  11.   // For flatness.  
  12.   avgSpectralFlatnessNum = 0.0;  
  13.   avgSpectralFlatnessDen = self->sumMagn;  
  14.   for (i = 0; i < shiftLP; i++) {   
  15. //Skip the first frequency point, that is, the DC frequency point. Den is the abbreviation of denominator (denominator), and avgSpectralFlatnessDen is used in the denominator calculation of the above formula.  
  16.     avgSpectralFlatnessDen -= magnIn[i];  
  17.   }  
  18.   // Compute log of ratio of the geometric to arithmetic mean: check for log(0) case.  
  19.   // Calculate the numerator part, numerator (numerator), which is an infinitesimal value for log(0), so this situation is treated specially during calculation.  
  20.   for (i = shiftLP; i < self->magnLen; i++) {  
  21.     if (magnIn[i] > 0.0) {  
  22.       avgSpectralFlatnessNum += (float)log(magnIn[i]);  
  23. } else {  
  24. //TVAG is the abbreviation of time-average, which is used to deal with energy abnormalities. Use the previous flatness to directly average and return.  
  25.       self->featureData[0] -= SPECT_FL_TAVG * self->featureData[0];  
  26.       return;  
  27.     }  
  28.   }  
  29.   // Normalize.  
  30.   avgSpectralFlatnessDen = avgSpectralFlatnessDen / self->magnLen;  
  31.   avgSpectralFlatnessNum = avgSpectralFlatnessNum / self->magnLen;  
  32.   
  33.   // Ratio and inverse log: check for case of log(0).  
  34.   spectralTmp = (float)exp(avgSpectralFlatnessNum) / avgSpectralFlatnessDen;  
  35.   
  36.   // Time-avg update of spectral flatness feature.  
  37.   self->featureData[0] += SPECT_FL_TAVG * (spectralTmp - self->featureData[0]);  
  38.   // Done with flatness feature.  
  39. }  

 

Compute Spectral Difference

Another assumption about the noise spectrum is that the noise spectrum is more stable than the speech spectrum. Therefore, it can be assumed that the overall shape of the noise spectrum tends to remain the same at any given stage. This third feature measures the deviation of the input spectrum from the shape of the noise spectrum.

The calculation formula becomes as follows:

 

[cpp] view plain copy

  1. // avgDiffNormMagn = var(magnIn) - cov(magnIn, magnAvgPause)^2 / var(magnAvgPause)  

 

Compute SNR

Calculate the before and after signal-to-noise ratio based on the quantile noise estimate.

The posterior signal-to-noise ratio refers to the transient SNR of the observed energy compared to the input power relative to the noise power:

 

Where Y is the input spectrum containing noise, see formula 1. N is the noise spectrum, and the prior SNR is the expected value of the pure (not necessarily speech) signal power related to the noise power, which can be expressed as:

Among them, X refers to the input pure signal, which corresponds to the speech signal here. In the actual calculation of WebRTC, the square order of magnitude is not used, but the order of magnitude is used.

Since the pure signal is an unknown signal, the estimate of the prior SNR is the average of the estimated prior SNR and the transient SNR of the previous frame  :

H in the above formula corresponds to smooth in the code, which is the Wiener filter of the previous frame, used to smooth the instantaneous SNR. The former item is the prior SNR of the previous frame, and the latter item is the transient estimate of the prior SNR. . It is updated through the decision response DD. The time smoothing parameter is . The larger the value, the higher the fluency and the greater the delay. The value chosen in the program is 0.98.

 

[cpp] view plain copy

  1. static void ComputeSnr(const NoiseSuppressionC* self,  
  2.                        const float* magn,  
  3.                        const float* noise,  
  4.                        float* snrLocPrior,  
  5.                        float* snrLocPost) {  
  6.   size_t i;  
  7.   
  8.   for (i = 0; i < self->magnLen; i++) {  
  9.     // Previous post SNR.  
  10.     // Previous estimate: based on previous frame with gain filter. Here formula 7 is smoothed, corresponding to the first half of formula 8  
  11.     float previousEstimateStsa = self->magnPrevAnalyze[i] /  
  12.         (self->noisePrev[i] + 0.0001f) * self->smooth[i];  
  13.     // Post SNR.  
  14.     snrLocPost[i] = 0.f;  
  15.     if (magn[i] > noise[i]) {  
  16.       snrLocPost[i] = magn[i] / (noise[i] + 0.0001f) - 1.f;//In fact, magn includes speech and noise. Subtracting one means subtracting the noise to obtain the posterior snr  
  17.     }  
  18.     // DD estimate is sum of two terms: current estimate and previous estimate.  
  19.     // Directed decision update of snrPrior.  
  20.     snrLocPrior[i] =  
  21.         DD_PR_SNR * previousEstimateStsa + (1.f - DD_PR_SNR) * snrLocPost[i];//Here is the calculation of Formula 8  
  22.   }  // End of loop over frequencies.  
  23. }  

Speech Noise Prob

 

[cpp] view plain copy

  1. The meaning of the function parameters.  
  2. // |magn| is the input magnitude spectrum. Input signal amplitude spectrum, including signal and noise  
  3. // |noise| is the noise spectrum.  
  4. The following two probabilities are calculated by the ComputeSnr function.  
  5. // |snrLocPrior| is the prior SNR for each frequency.  
  6. // |snrLocPost| is the post SNR for each frequency.  

 

The calculations in the code related to the first feature are:

 

[cpp] view plain copy

  1. // Compute feature based on average LR factor.  
  2. // This is the average over all frequencies of the smooth log LRT.  
  3. logLrtTimeAvgKsum = 0.0;  
  4. for (i = 0; i < self->magnLen; i++) {  
  5.   tmpFloat1 = 1.f + 2.f * snrLocPrior[i];  
  6.   tmpFloat2 = 2.f * snrLocPrior[i] / (tmpFloat1 + 0.0001f);  
  7.   besselTmp = (snrLocPost[i] + 1.f) * tmpFloat2;  
  8.   self->logLrtTimeAvg[i] +=  
  9.       LRT_TAVG * (besselTmp - (float)log(tmpFloat1) - self->logLrtTimeAvg[i]);  
  10.   logLrtTimeAvgKsum += self->logLrtTimeAvg[i];  
  11. }  
  12. logLrtTimeAvgKsum = (float)logLrtTimeAvgKsum / (self->magnLen);  
  13. self->featureData[3] = logLrtTimeAvgKsum;  

To understand the physical and mathematical meaning contained in the above code, you must first look at the following derivation.

 

Calculate the probability of speech/nosie. This probability is returned in the probSpeechFinal parameter,

Let’s first push to the speech/noise probability calculation method, first look at the probability model of speech/noise, define the speech state as , and define the noise state as , where m is the frame and k is the frequency. Then the probability of speech/noise can be expressed as:

This probability depends on the observed noise input spectral coefficients and some characteristic data of the processed signal (such as the classification characteristics of the signal), which is {F} here. Feature data can be noisy input spectrum, past spectrum data, model data, etc. For example, the characteristic data{F} can include spectrum flatness measurement, resonance peak distance, LPC residual and template matching, etc. According to Bayesian criterion, the speech/noise probability can be expressed as:

where p({F}) is the prior probability based on the characteristic data of the signal, which is set to a constant in one or more expressions below. The quantity is the speech/noise probability under the feature data {F}. When ignoring the prior probability p{F} based on {F}, simplifying and , the normalized speech probability can be written as:

The abbreviation of the above formula is:

where the likelihood ratio (LR) is:

In the above expression, it is determined by a linear model and a Gaussian probability density function (PDF) assumption for the speech and noise spectral coefficients. Specifically, the linear model expression of the noisy input signal is: in the speech state:

 

Under noise conditions:

Assuming that the Gaussian probability density function uses complex coefficients , the quantity is expressed as follows:

Since the probabilities can be determined entirely from linear models and Gaussian PDF assumptions, the feature dependence can be removed from the above expression. In this way, the likelihood ratio becomes:

where is the SNR of the unknown signal (i.e., a priori SNR), and is the posterior SNR (i.e., posterior SNR or transient SNR) of frequency K and frame m. In a real-life example, the prior SNR and posterior SNR used in the above expression are estimated by the magnitude definition, and the formula is:

According to the above expression, the speech/noise state probability can be obtained by the likelihood ratio and the quantity q, where the likelihood ratio is determined based on the frequency-varying posterior and the prior SNR, and the quantity is a feature-based or model-based probability, detailed description in the following . Therefore, the speech/noise state probability can be expressed as:

Sometimes the frequency-varying likelihood ratio factor fluctuates greatly from frame to frame, so a time-smoothed likelihood ratio factor is used:

The above formula is what the previous code does, but the code does not fully follow the formula here. Here we first use and to simplify the above code.

The last term of this formula is obtained by taking the logarithm of the following formula. It is worth noting that the above code is not calculated completely according to the formula.

 

 

The geometric mean of the temporally smoothed likelihood ratio factors, including all frequencies, can be used as a reliable measure for frame-based speech/noise classification:

 

[cpp] view plain copy

  1. logLrtTimeAvgKsum = (float)logLrtTimeAvgKsum / (self->magnLen);  

 

When calculating the speech/noise probability, the Gaussian assumption is used as the speech PDF model to obtain the likelihood ratio. Among other models, the probability density PDF model can also be used as a basis for measuring likelihood ratios, including Laplacian, gamma, and super-Gaussian. For example, while the Gaussian assumption is a reasonable representation of noise, this assumption does not necessarily apply to speech, especially in shorter time frames (e.g., ∼10 ms). In this case, another speech PDF model could be used, but this would most likely add complexity.

To determine the probability of speech/noise in the noise estimation and filtering process, this requires not only the guidance of local SNR (i.e., prior SNR and transient SNR), but also combined with the speech model/cognitive content obtained from feature modeling. Incorporating speech models/cognitive content into speech/noise probability determinations allows noise suppression processes to better handle and differentiate between extremely volatile noise levels. If you rely solely on local SNR, you may suffer from likelihood bias. Here the feature-based probabilities are updated and adapted for every frame and frequency containing local SNR and speech feature/model data . It can be abbreviated as . Because the process described here only models and updates quantities on a frame basis , the variable k is suppressed.

The feature-based probability update can use the following model:

where is a smooth formulation, which is a mapping function for a given time and frequency (such as between 0 and 1). The variable Z in this mapping function is Z=FT, where F is the feature being measured and T is the threshold. The parameter w represents the shape/width characteristics of the mapping function. The mapping function classifies time-frequency bins as speech (M close to 1) or noise (M close to 0) based on measured features and threshold and width parameters.

In the noise estimation and filtering process, the following characteristics of the speech signal are considered when determining speech/noise likelihood: (1) LRT mean, which can be derived based on local SNR, (2) spectral flatness, which can be based on the speech harmonic model is derived, and (3) spectral mask difference measurement. Other speech signal features may also be used as supplementary or substitute features.

1. LRT mean characteristics

The geometric mean of the time-smoothed likelihood ratio (LR) factors is a reliable indicator of the speech/noise state:

The time-processed LR factor is obtained according to the expression mentioned above. When using LRT mean features, an example of a mapping function M(z) might be an "S" shaped curve function, such as:

 

[cpp] view plain copy

  1. // Compute indicator function: sigmoid map.  
  2. indicator0 =  
  3.     0.5f*  
  4.     ((float)tanh(widthPrior * (logLrtTimeAvgKsum - threshPrior0)) + 1.f);  

 

where is the feature and is a transition/width parameter that controls the smoothness of the mapping from 0 to 1. Threshold parameters need to be determined based on parameter settings.

2. Spectral flatness characteristics

 

 To obtain spectral flatness characteristics, it is assumed that speech has more harmonic behavior than noise. However, the speech spectrum tends to have peaks in the fundamental frequency (pitch) and harmonics, while the noise spectrum is relatively flat. Therefore, at least in some arrangements, a combination of local spectral flatness measurements can serve as a good basis for distinguishing between speech and noise.

When calculating spectral flatness, N represents the number of frequency bins and B represents the number of frequency bands. k is the frequency bin index, j is the frequency band index. Each frequency band will include a large number of frequency slots. For example, the frequency spectrum of 128 slots can be divided into 4 frequency bands (low, mid-low, mid-high, and high). Each frequency band includes 32 slots. In another example, only one frequency band including all frequencies is used. Spectral flatness can be found by calculating the ratio of the geometric mean to the arithmetic mean of the input amplitude spectrum:

where N represents the number of frequencies in the frequency band. For noise, the calculated quantity is large and constant, while for speech, the calculated quantity is small and variable. Likewise, an example of a mapping function for updating feature-based prior probabilities can be expressed as a sigmoid function:

 

 

[cpp] view plain copy

  1. // Compute indicator function: sigmoid map.  
  2. indicator1 =  
  3.     0.5f*  
  4.     ((float)tanh((float)sgnMap * widthPrior * (threshPrior1 - tmpFloat1)) +  
  5.      1.f);  

 

3. Spectrum template difference characteristics

In addition to the noise-related assumption of spectral flatness characteristics, another assumption about the noise spectrum is that the noise spectrum is warmer than the speech spectrum. Therefore, it can be assumed that the overall shape of the noise spectrum tends to remain the same at any given stage. The template spectrum is determined by updating the portions of the spectrum (initially set to zero) that are most likely to be noise or speech pauses. This comparison is a conservative estimate of the noise, with noise added only at segments where the speech probability is determined to be below a threshold (such as ). In other distributions, template spectra may also be imported into the algorithm, or filtered out from shapes corresponding to different noises. Considering the input spectrum and the template spectrum (expressed as ), if you want to obtain the spectrum template difference characteristics, you can first define the spectrum difference measurement as:

where is the shape parameters, including linear displacement and amplitude parameters, obtained by minimizing J. Obtained through a linear equation, this parameter can be easily extracted for each frame. In some examples, these parameters can indicate any simple shift/scale change in the input spectrum (with increasing volume). This feature will then become a measure of standard speech.

where normalization is the average input spectrum over all frequencies and previous time frames in some time window.

As mentioned above, the spectral template difference feature measures the difference/deviation between the template or learned noise spectrum and the input spectrum. In at least some arrangements, this spectral template difference feature can be used to modify the speech/noise probability of the feature . If it is small, the input frame spectrum can be considered "close" to the template spectrum, and the input frame is likely to be considered noise. On the other hand, if the spectrum template difference feature value is large, it means that the input frame spectrum is very different from the noise template spectrum, and it is judged as speech. Use an S-curve to map spectral template difference features into probability weights. It is important to emphasize that spectral mask difference characterization measures are more common than spectral flatness characterization measures. If a mask has a constant flat spectrum, the spectral mask difference characteristic can be reduced to a measure of spectral flatness.

Jiaqi deadlines can be added to spectrum mask difference measurements to highlight specific frequency bands in the spectrum:

Multiple features mentioned above (LRT mean, spectral flatness and spectral template difference) can appear simultaneously in the update template of speech/noise probability, as follows:

Different characteristic yards have different information. These breaths are supplemented to provide a more stable and adaptive speech/noise probability update.

 

[cpp] view plain copy

  1. // Combine the indicator function with the feature weights.  
  2. indPrior = weightIndPrior0 * indicator0 + weightIndPrior1 * indicator1 +  
  3.            weightIndPrior2 * indicator2;  


Finally, log-convert to normal probability.

 

 

[cpp] view plain copy

  1. // Final speech probability: combine prior model with LR factor:.  
  2. gainPrior = (1.f - self->priorSpeechProb) / (self->priorSpeechProb + 0.0001f);  
  3. for (i = 0; i < self->magnLen; i++) {  
  4.   invLrt = (float)exp(-self->logLrtTimeAvg[i]);  
  5.   invLrt = (float)gainPrior * invLrt;  
  6.   probSpeechFinal[i] = 1.f / (1.f + invLrt);  
  7. }  

 

Noise EstimateUpdate Noise Estimate

After the speech/noise probability is determined, a noise estimate update is performed, expressed as follows:

 

[cpp] view plain copy

  1. probSpeech = self->speechProb[i];  
  2. probNonSpeech = 1.f - probSpeech;  
  3. // Temporary noise update:  
  4. // Use it for speech frames if update value is less than previous.  
  5. noiseUpdateTmp = gammaNoiseTmp * self->noisePrev[i] +  
  6.                  (1.f - gammaNoiseTmp) * (probNonSpeech * magn[i] +  
  7.                                           probSpeech * self->noisePrev[i]);  

where is the estimate of the magnitude of the noise spectrum when the frame/time is m and the frequency bin is k. The parameter controls the smoothness of the noise update. The second period updates the noise using the input spectrum and the last noise estimate, and then weights it according to the speech/noise probability as described above. This can be expressed as:

 

where the LR factor is:

The quantities are model-based or feature-based speech/noise probabilities derived from the above updated model with multiple features. The above noise estimation model updates the noise for each frame and frequency bin where noise is more likely (i.e., speech is less likely). For frames and frequency bins where noise is unlikely, the estimate of the previous frame in the signal is used as the noise estimate.

The noise estimate update process is controlled by the speech/noise probability and the smoothing parameter, which can be set to a value like 0.85. In a different example, the smoothing parameter might be increased to =0.99 for regions where the speech probability exceeds the threshold parameter to prevent the noise level at the beginning of the speech from increasing too high. In one or more arrangements, the threshold parameter is set to =0.2/0.25,.

After completing the noise estimate update, the noise estimation and filtering process employs a Wiener gain filter to reduce or eliminate the estimated amount of noise from the input frame. The standard Wiener filter is expressed as follows:,

where is the estimated noise spectral coefficient, is the observed noisy spectral coefficient, and is the pure speech spectrum (frame is m, frequency is k). Afterwards, the squared magnitude can be replaced by the magnitude, and the Wiener filter becomes:

In one or more conventional methods, time averaging is applied directly to the filter to reduce any inter-frame fluctuations. According to certain aspects of the present invention, the Wiener filter is represented by a priori SNR, and decision-directed (DD) updates are used to time average the prior SNR. The Wiener filter can be expressed as a priori SNR as:

where represents the prior SNR defined above, replacing the noise spectrum with the estimated noise spectrum:

As mentioned above, the prior SNR is estimated according to the DD update. By taking the bottom and subtracting parameters of this gain filter, we can get:

Because the DD update explicitly performs time averaging on the prior SNR, no external time averaging is performed on the gain filter. The parameters are defined based on the active configuration of the noise suppressor implemented in the noise suppression system.

A Wiener filter is applied to the input magnitude spectrum to obtain a suppressed signal. Using the Wiener filter in the noise estimation and filtering process results in:

signal synthesis

Signal synthesis includes various a posteriori noise suppression processes to generate output frames including clean speech. After applying the Wiener filter, the frame is converted back to the time domain using an inverse DFT. In one or more arrangements, the conversion back to the time domain can be expressed as:

Among them, is the speech estimated after being suppressed by the Wiener filter, and is the corresponding time domain signal, where the time index is n and the frame index is m.

After the inverse DFT, energy scaling is performed on the noise-suppressed signal as part of the signal synthesis process. Energy scaling can be used to help reconstruct speech frames in a way that increases the energy of the suppressed speech. For example, scaling should be implemented in such a way that only speech frames are scaled up to a certain extent, while noise frames remain unchanged. Since noise suppression may reduce speech signal levels, it is beneficial to appropriately amplify speech segments during scaling. In one arrangement, speech frames are scaled based on their energy loss during the noise estimation and filtering process. The gain condition can be determined by the energy ratio of the speech frame before and after noise suppression processing.

In the current example, the scale can be extracted according to the following model:

where is the speech probability of frame m, obtained by taking the average of the speech probability functions of all frequencies:

In the above scaling equation, if the probability is close to 1, the first term will be larger, and if it is noise, the second term will be larger.

In the scaling equation above, parameter , controls the scaling of the input frame.

Signal synthesis includes a window synthesis operation that provides the final output frame of the estimated speech. In one example, the window composition is:

where the scaling parameters are derived from the above scaling equation for each frame.

Parameter Estimation

The updated model of the feature-based speech/noise probability function includes multiple feature weighting and threshold parameters applied to the feature measurements :

These weighted sums are used to prevent unreliable feature measurements from entering the updated model. The mapping function also includes a width parameter to control the shape of the mapping function:

For example, if the LRT mean feature F1 of the given input is unreliable

Guess you like

Origin blog.csdn.net/lindonghai/article/details/102715851