Speech enhancement

With people's understanding and development of speech enhancement knowledge, various speech enhancement algorithms have developed accordingly. As mentioned earlier, due to the different characteristics of noise, in general, the more popular algorithms are mainly the following:

  1. Wavelet decomposition method;
  2. Auditory shielding
  3. Noise cancellation method;
  4. Harmonic enhancement method;
  5. Enhanced algorithm based on speech generation model;
  6. Enhanced algorithm based on short-time spectrum estimation;

Although the above various speech enhancement algorithms are different in specific implementation, from another aspect, they all have to make a trade-off in the two directions of speech intelligibility and subjective measurement. As for which aspect to focus on, it depends on the parameter selection inside the algorithm.

       Among them, the basic principle of noise cancellation is to subtract noise from noisy speech. The principle is obvious, but the problem is how to get a copy of the noise. If a signal acquisition system with two microphones (or multiple microphones) can be used, one collects noisy speech and the other (or multiple) collects noise, then this task is easier to solve. And in a strong noise environment, this method can get a good noise elimination result. If the collected noise is "realistic" enough, it can even be directly subtracted from the noisy speech in the time domain. The noise cancellation method can be used to cancel stationary noise or quasi-stationary noise. When using the noise cancellation method, there must be considerable isolation between the two microphones, but there will inevitably be a time difference between the two collected signals, so the noise section contained in the two signals collected in real time It is different, echo and other variable attenuation characteristics will also affect the "purity" of the collected noise. Therefore, the collected noise must pass through a digital filter to get as close as possible to the noise in the noisy speech. Usually, this requires the use of an adaptive filter to make the subtraction noise consistent with the noise in the noisy speech, and its principle is similar to an echo canceller. Adaptive filters usually use FIR filters, and their coefficients can be estimated using the least mean square (LMS) method to minimize the energy of the following signals.

The sound of speech can be divided into three categories according to the different excitation forms: the first category is voiced sound. When the airflow passes through the glottis, if the tension of the vocal cords just causes the vocal cords to vibrate in relaxation and oscillation, then quasi-periodic air can be produced. Impulse, this air pulse excites the sound channel to get voiced sound, which corresponds to u(n) as the pulse train with pitch interval T; the second type is fricative or unvoiced sound. If the sound channel is contracted somewhere, it will force The air rushes through this contraction part at high speed to generate turbulence, and this kind of sound is obtained. The broadband noise source established at this time excites the sound channel, which corresponds to u(n) in the figure as broadband noise; After closing, air pressure is generated, and then suddenly released, which gives a popping sound. Generally speaking, speech signals can be seen as composed of voiced, unvoiced and transitions between them.

1. The spectral components of the speech signal are relatively concentrated

Through the research on the vocal process of the speech signal and the observation of various recorded speech waveforms, people found that the spectral components of the speech signal are mainly concentrated in the range of 300~3400Hz, because the human vocal tract cannot change too fast. This has brought great convenience to our speech research and calculations, as long as we focus on this area.

2. Speech is a time-varying, non-stationary random process

The change speed of the physiological structure of the human vocal system has a certain limit. In a short period of time (5-50ms), the shape of the human vocal cords and vocal tract is relatively stable. It can be approximated that its characteristics remain unchanged, so the voice is short. Time spectrum analysis is also relatively stable. The stability of the short-time spectrum is the basis of many speech processing algorithms and techniques.

3. Speech can be roughly divided into two categories: unvoiced and voiced

Generally speaking, human speech signals often show obvious periodicity (voiced) in some periods. This kind of speech has a formant structure in the frequency domain, and most of its energy is concentrated in the lower frequency band; while in other The time period exhibits complete randomness (unvoiced). This segment has no obvious formant structure in the frequency domain, and its spectrum is similar to white noise; the rest is a mixture of the two. This is reflected in Figure 2-1: The excitation source u(n) is either emitted by a pulse generator, or emitted by a white noise generator, or emitted by a mixture of the two in a certain ratio.

4. As a random process, the speech signal can be described by statistical analysis characteristics

Under the assumption of the Gaussian model, the Fourier expansion coefficient is considered to be an independent Gaussian random variable with a mean value of 0 and a time-varying variance. This Gaussian model is only an approximate description when it is applied to a finite frame length. In the speech enhancement of noisy speech polluted by broadband noise, this assumption can be used as the premise of analysis.

1) Initial operation:

x=wavread('samples.wav');

n=wavread('white.wav');

[p,q]=size(x);

[a,b]=size(n);

x1=x(150000:250000,1);

n1=n(1:length(x1),1);

y = x1 + n1;

Among them, the wavread('filename.wav',k) function reads the first k sampling points of the .wav file named filename into the current workspace. If not stated, read in all values. Use arrays x and n to store pure speech and Gaussian white noise data respectively. The size() function returns the array dimension size p, q. Before that, since we introduced dual-channel voice, q=2 and p=voice length. In order to ensure that the algorithm program does not take up too much memory and cause the system to be paralyzed during computer simulation, we intercept the 150000~250,000 point of the array x for operation. Also in order to ensure the consistency of the dimensionality of the operation between the arrays, the noise data Take similar actions.

2) The core spectrum subtraction part:

       aa=2;

bb = 6;

anglen=angle(fft(n1));

ampn=abs(fft(n1));

ampy=abs(fft(y));

angley=angle(fft(y));

cn=bb*(ampn.^aa);

xx = (enough. ^ aa-cn). ^ (1 / aa);

ifftx=real(ifft(xx.*exp(j*angley)));

This program is an improved program for the basic spectral subtraction, where aa and bb are respectively equivalent to the sum in formula 2-10. The current value is the empirical value obtained after repeated experiments. The main functions used in the program are described as follows:

y is the pure speech signal, fft(y,N) is the fast Fourier transform function, where y is the operation target signal, and N is the number of Fourier transform coefficients (by default, N takes 256). The abs and angle functions take the absolute value and phase angle of 256-point Fourier transform y, respectively, ifft (x, N) is the standard function of inverse Fourier transform, x is the operation object, and N is the number of inverse transform points. real takes the real part of the complex number after the inverse Fourier transform, and the ifftx obtained is the enhanced result, which is played back using soundview('filename',Fs). Fs is the sampling frequency.

       The endpoint detection of the entire voice signal can be divided into 4 sections: mute, transition section, voice section, and end. The program uses a variable status to indicate the current state. In the silent segment, because the parameter value is relatively small, it is not sure whether it is in the real voice segment, so as long as the values ​​of both parameters fall below the low threshold, the current state will be restored to the silent state. And if either of the two parameters exceeds the high threshold in the transition section, it can be determined to enter the voice section.

The three defined short-term energies are realized by the following three lines of MATLAB commands:

amp1=sum(abs(y),2);

amp2 = sum (y. * y, 2);

amp3=sum(log(y.*y+eps),2);% plus floating-point decimal eps is to prevent possible overflow in log operations.

Calculation of zero crossing rate:

zcr=zeros(size(y,1),1);

delta=0.02;

for  i=1:size(y,1)

       x=y(i,:);

       for  j=1:length(x)-1

              if     x(j)*x(j+1)<0&abs(x(j)-x(j+1))>delta

                     zcr(i)=zcr(i)+1;

              end

       end

end

Among them, delta=0.02 is the threshold. This value is the empirical value obtained through many experiments and can be adjusted slightly. Due to the time limit of the graduation project, the current endpoint detection can only recognize the pronunciation of a single digit. If it is applied to the entire length of data, it will introduce greater distortion, and the algorithm needs to be further improved.

The original pure voice waveform is shown in the figure below:

When 2 times the white noise signal is loaded, that is, the system input signal is Y=x1+2*n1, the waveforms obtained after using each algorithm are as follows (display order: noisy speech, basic spectrum subtraction, STFT spectrum subtraction, Wiener filter , Wavelet transform):

Figure 4-2 Noisy speech waveform

Figure 4-3 Waveform after basic spectrum subtraction operation

Figure 4-4 Waveform after STFT spectral subtraction operation

Figure 4-5 Waveform after Wiener filtering

Figure 4-6 Waveform after wavelet processing

Similarly, when the input signal of the system is Y=x1+2*n1, the spectrogram obtained after using each algorithm is as follows (the order of display: pure speech, noisy speech, basic spectral subtraction, STFT spectral subtraction, Wiener filtering, Wavelet transform):

Spectrogram of Pure Voice

Spectrogram of noisy speech

Spectrogram after basic spectrum subtraction

Spectrogram after STFT subtraction

Spectrogram after Wiener Filtering

Spectrogram after wavelet transform

Guess you like

Origin blog.csdn.net/ccsss22/article/details/108850138