Speech Separation, detailed explanation of speech separation - speech signal processing learning (7)

references:

Speech Separation (1/2) - Deep Clustering, PITbilibilibilibili

March 2020 New Program Li Hongyi Human Language Processing Exclusive Notes Speech Seperate - 12 - Zhihu (zhihu.com)

Speech Separation (2/2) - TasNetbilibilibilibili

March 2020 New Program Li Hongyi Human Language Processing Exclusive Notes TasNet - 13 - Zhihu (zhihu.com)

All cited papers are omitted this time

Table of contents

1. Introduction

2. Evaluation (calculation of evaluation indicators)

Signal-to-noise ratio(SNR)

Scale invariant signal-to-distortion ratio(SI-SDR or SI-SNR)

Others (not discussed anymore)

3. Permutation Issue (sample arrangement issue)

4. Deep Clustering

Ideal Binary Mask(IBM)

Deep Clustering

五、Permutation Invariant Training(PIT)

六、Time-domain Audio Separation Network(TasNet)

Encoder and Decoder

Separator

Summary and questions

7. Other exploration content of Speech Separator

Unknown number of speakers

Multiple Microphones

Visual Information

Task-oriented Optimization

Other valuable papers


1. Introduction

  1. Cocktail Party Effect: Humans can easily focus on the sound they want to hear among noisy sounds (complex sounds), extracting one sound from other sounds.

    The "cocktail party effect" is a term derived from the fields of information theory and signal processing. It describes the phenomenon of aliasing and indistinguishability when multiple sources transmit information or signals simultaneously. The term behind this relates to issues of signal aliasing and information crossover, similar to attending a crowded cocktail party where you might hear multiple people talking at the same time, making individual voices difficult to distinguish.

  2. Classification:

    • Speech Enhancement:You are speaking, but there are many other noises interfering with it. How to enhance your voice

    • Speaker Separation:Many people are talking, can you separate each person’s voice

  3. Speaker Separation task introduction:

    • Main tasks:Output one piece of sound, output multiple pieces of sound, the length of the sound input and output are the same (Therefore there is no need to use the Seq2Seq model, because the task that the Seq2Seq model centrally solves is that the input and output sequence lengths are different)

    • Our next research content is based on the following conditions: two speakers, a single microphone, and different training and test speakers.

    • Sound data is easy to collect. We can directly mix two different human voices together to become training data.

2. Evaluation (calculation of evaluation indicators)

For sound conversion (switching timbre), we do not have ground truth, so it is not easy to evaluate. But for the speech separation task, we have ground truth, which is also convenient for objective evaluation, that is, calculating the similarity between the model output and the real data.

Signal-to-noise ratio(SNR)
  • In fact, each piece of audio can be regarded as a vector, so the similarity between the vectors is calculated.

  • SNR was born. The formula is as follows:


    \mathrm{SNR} = 10\log_{10} \frac{\Vert \hat{X} \Vert ^2}{\Vert E \Vert ^2}

     

    Among them, X_hat is the ground truth sound vector, X^* is the model output sound vector, and E is the difference between the two. So looking at it this way, the more similar the two are, the greater the SNR value is..

  • However, this calculation method is also problematic. For example, in the following two cases, the first case has the same direction, and the theoretical effect is very good, but because the sound is very small, the modulus of the difference between the two increases, so the SNR is small; the second case has different directions, but we Simply increasing the volume can increase the SNR.

Scale invariant signal-to-distortion ratio(SI-SDR or SI-SNR)
  • This is equivalent to an evolved version of SNR. The formula is as follows:


    \mathrm{SISDR} = 10\log_{10} \frac{\Vert X_T \Vert ^2}{\Vert X_E \Vert ^2}

     

    Among them, X_T is the projection vector of the model output vector in the direction of the ground truth vector, and The projection vector in the direction of the truth vector.

  • This solves the two problems we just mentioned very well.

  • In fact, this evaluation metric can be improved. In order to prevent the original input sound signal from being close to separation, we made an improvement, which is the following SI-SDRi:

    • This is calculated as the SI-SDR of the split result minus the SI-SDR of the correct output versus the unsplit result.

    • In this way, the improvement of the sound signal after passing through the model can be easily obtained.

Others (not discussed anymore)

3. Permutation Issue (sample arrangement issue)

  • If the input is a mixture of acoustic features, the output acoustic feature matrix can be used with Ground Truth to calculate the L1 or L2 distance as a loss. If the model is strong and can directly output the waveform, we can even use SI-SDR to calculate the loss. Seems simple?

  • However, we can easily encounter a problem: Why do you define the above sound output by the model as speaker 1 during training? The following provisions for Speaker 2? The arrangement of the correct answers will directly determine the entire training process.

  • Can the sounds be arranged according to their characteristics, such as gender, intonation, etc.? no! Because there are too many uncertainties.

    There are a variety of human voices in our training corpus. It is obviously unreasonable to specify a sequential output model. Let's assume that blue is the male voice and red is the female voice. We input a mixture of male and female voices, and the model learned that the female voice was on top and the male voice was on the bottom. But in another mixed sound, the green one is the female voice and the yellow one is the male voice. At this time, the model may also learn the combination of male voices on top and female voices on the bottom. Even if we separate them according to gender, it is stipulated that male voices should be placed at the top and female voices at the bottom. But this may not completely solve the problem. Because you have to classify male and female voices based on one more criterion. This criterion can be pitch, or timbre. But a higher-pitched voice may not necessarily be a female voice. The one with full energy may not necessarily be a male voice. You may also think of separating vocals based on pitch, outputting X1 for higher pitches and outputting X2 for lower pitches. But this method is very unfriendly for mixing two voices with similar pitches. So whether to locate the separated vocals at X1 or X2 is actually a big question.

4. Deep Clustering

  • The earliest use of deep learning to perform Speaker Independent separation tasks was Deep Clustering. In fact, when we look back at the speech separation task defined previously, it actually inputs a sound matrix and outputs two sound matrices. If you do this when using deep learning, it is actually killing a chicken with a knife, because your input and output are actually very similar, and often the input minus something becomes an output.

  • So we need to think about it in another way. Instead of directly generating new sounds, the model generates a MASK. MASK is a matrix with the same shape as the input. We do the dot product of MASK and X to get the vocals to be separated. This is a design commonly used in Speech Speration.

  • What’s interesting is thatMASK can be Binary (only 0 and 1) or Continious (using continuous values).

Ideal Binary Mask(IBM)
  • Let’s take a look at a typical example of using Binary Mask:IBM.

  • First of all, how do we get this IBM? It is very simple. When mixing two sound matrices, which sound matrix has a louder sound (that is, a larger value) at the same position, we will correspond to the Mask and set it to 1, and the rest All sound masks are set to 0.

  • In this way, after using Mask to multiply the mixed speech, the blue sound is filtered out. Yes, only the blue matrix is ​​retained, and the others are all 0. What is the actual effect? It's actually very good.

  • However, the bug is that what we want to do is voice separation. We don’t know which of the original voices is bigger and which is smaller. How to obtain IBM becomes a problem. Therefore, when it comes to actual operation, we need the model (Mask Generation) to generate an IBM. When this IBM has Ground Truth, you can easily get the theoretical IBM as a training label. Since the generated IBM is binary. We only need to generate the IBM of one of the sounds to know the other IBM. So the Permutation Issue is solved.

Deep Clustering
  • How is Deep Clustering done? Let's focus first on how it is used. The key part is Embedding Generation, which can input a two-dimensional sound vector and output a three-dimensional new vector. That is to expand each original grid into a vector. Then we perform K-means clustering on these vectors, so that if they belong to the same category, their mask is set to 1, and the remaining grids are set to 0.

  • However, K-means cannot be trained because it is a fixed algorithm, so we can only train Embedding Generation. So how should we train?

  • First we need to clarify the training goals. We want the same Speaker to be classified into the same cluster. According to the K-means clustering algorithm, that is, the generated vectors should be close to each other, while the grids of different Speakers should have the vectors generated as far apart as possible. So we can first find out its IBM based on the label of the mixed sound. Next, we determine the output vector of the embedding generator based on the positions of 0 and 1 in IBM, whether to move farther or closer at the same position.

  • The magic is that due to the existence of K-means, we can set the number of clusters, which means that we can train multiple people's vocal mixing. This model is also effective on the task of separating multiple people's voices.

    The cool thing is that this paper also has an on-site DEMO. They found two audience members speaking at the same time, and the model was able to separate their voices very well. The only fly in the ointment of DC is that it's not end-to-end. There is K-means operation in the middle.

五、Permutation Invariant Training(PIT)

  • Developed by Tencent, using this can truly carry out end-to-end training, making up for the regrets of Deep Clustering. The idea is very simple, that is, calculate the Loss of all Ground Truth arrangements. Which arrangement has the smallest Loss is suitable for this model. But the premise is that we have a ready-made Separation Model before it can be considered a loss.

  • Therefore, after initialization, we first randomly give an arrangement and train the first round of models, and then we can perform a repetitive cycle of [calculate Loss, compare size, update arrangement, and train again]. This is PIT.

  • So of course, can this method finally converge? The answer is yes. This is the result of the teacher's work. As shown in the figure, the blue line is the accuracy of the model that increases with iteration. It is very stable. The black line is the different ratio between this arrangement and the previous arrangement. It is very unstable in the early stage, but as the model converges, it gradually converges.

  • We can also use PIT to compare other methods. In the figure below, a is the volume arrangement based on the sound, and b is the arrangement based on the speaker characteristics, and PIT is obviously better than them. There is another magical operation. We take the best PIT arrangement method that finally converges and use it as the initialization of the arrangement to re-use PIT to train a new Network. The final training result will be better. If we try again, the result will still be the same. better.

六、Time-domain Audio Separation Network(TasNet)

  • Briefly describe the corresponding structure. TasNet consists of three parts: Encoder, Separator and Decoder. Its input is a direct sound waveform, and its output is still a sound waveform. The input waveform is a very long series of values. The encoder will turn this series of waveforms into a feature matrix. It is different from the acoustic feature matrix generated using templates. The encoder template here is learned. When the feature matrix is ​​fed to the Separator, it will output two MASKs, which are multiplied on the feature matrix respectively, and then passed through the decoder to obtain the output waveform. When training, consider the Pemutation Issue, which requires the use of the PIT technique mentioned earlier.

Encoder and Decoder

The overall schematic diagram is as follows:

  • Encoder:

    • is a weight matrix

    • The input is a very small sound signal, with only 16 samples, that is, a 16-dimensional vector. After passing through the Encoder, a 512-dimensional vector will be generated.

    • Does need to be output aspositive? Add a ReLU? Conclusion: The effect is not good!

    • What you finally learned:

      Match Filter: A matched filter is a signal processing filter used to detect specific predefined patterns or signal characteristics in a signal. In audio processing, matched filters can be used to detect or enhance specific frequency components, often in connection with spectral analysis. It can be used to extract or emphasize specific frequency information in a signal, such as the acoustic components in a speech signal.

      Phases: Phase refers to the periodic change of a signal, which is usually a periodic characteristic of a periodic waveform. In audio signals, the phase of a sound waveform describes the periodic changes in oscillations, usually expressed in degrees or radians. Different phases represent different periodic states, which are important in the spectral representation of audio signals.

      basis (basic element):In sound signal processing and signal processing, "basis" usually refers to a set of basic functions or basic elements used for representation and decomposition Complex signals. This set of basic functions can be used to build or represent the structure of a signal, and is often used to analyze and synthesize signals.

      • It can be seen that the x-axis is 16 samples (2 ms)

      • The y-axis has 512 dimensions. Each row is a basis, which is the so-called match filter.

      • We can see from the heat map that each filter filters different content, that is, it encodes signals of different frequencies, including high frequency and low frequency. There will be more low-frequency phases because there are more vocals in the low-frequency part. Similarly, the Encoder also encodes phases information. (In the past, we would throw away the phases information after doing the Fourier transform)

  • Decoder:

    • is also a linear mapping weight matrix

    • Take a 512-dimensional vector and convert it back into a 16-dimensional sound signal.

    • Does need to be an Encoder matrixInverse? Conclusion:The effect is not good!

Separator

WaveNet:WaveNet is a deep learning model, specifically a deep convolutional neural network (CNN) structure, used to generate audio waveform data. It is widely used for audio synthesis, speech recognition, audio enhancement and other audio processing tasks.

Dilated Convolutional Neural Network (DCNN):The traditional convolution operation is a sliding operation using a fixed-size convolution kernel on the input feature map. to extract features. Dilated convolution introduces a new parameter called dilation rate, which is used to control the interval between elements in the convolution kernel. By adjusting the expansion rate, the receptive field can be increased while keeping the convolution kernel size unchanged, so that long-range dependencies in the input data can be better captured.

Depthwise Separable Convolution (DSC): is a convolution operation commonly used in convolutional neural networks, which can maintain better feature extraction capabilities while significantly reducing the amount of model parameters and calculations. DSC mainly consists of two parts: depthwise convolution (Depthwise Convolution) and pointwise convolution (Pointwise Convolution). The depth convolution stage performs independent convolution operations on each input channel, while the point-wise convolution stage uses a 1x1 convolution kernel to linearly combine the outputs of the depth convolution to obtain the final output feature map.

Long Short-Term Memory (LSTM): is a variant of Recurrent Neural Network (RNN) commonly used to process sequence data. It can effectively solve the problems of gradient disappearance and gradient explosion when traditional RNN processes long sequence data. LSTM better captures long-term dependencies by introducing a gating structure to explicitly save and update information in the input sequence. The LSTM unit includes three gates: input gate, forget gate and output gate, as well as a cell state. These gated structures can control the flow of information based on the current input information and the state of the previous moment, allowing the network to selectively remember and forget information, thereby better processing long sequence data.

  • Adopts the classic WaveNet architecture, a 1-D CNN with many layers. At the same time, Dilated Convolutional Neural Network (DCNN) is also used.

  • The first layer of CNN will group two adjacent vectors into one vector using filter. It is equivalent to the neighbor that jumps one step to the left of the current vector. The second layer of CNN will turn the vectors adjacent to the tab into one vector. It is equivalent to the neighbor that jumps two steps to the left of the current vector. The third layer of CNN jumps the current vector to its neighbors 4 spaces to the left and uses filter to turn it into a vector. The fourth floor is the neighbor who jumps 8 spaces.

  • The final vector contains a lot of information. It will be connected to a linear mapping (transform), and then use sigmoid (convert the value to 0~1) to obtain two MASKs.

  • Do I need to sum the values ​​of the corresponding positions of the two masks to 1? Conclusion:No need!

  • The above is just a small part of the separator. The real separator has more CNNs and more loop operations.

  • Why do we need to use so many CNNs repeatedly? Because if the CNN can be repeated more times, longer information can be seen. For example, after three cycles of the schematic diagram below, the model can see a 1.53s waveform speech.

  • At the same time, with so many CNNs, the number of parameters will be large. In order to reduce the number of parameters of the CNN, TasNet uses a technique called Depthwise Separable Convolution (DSC).

  • Today's Separator of TasNet uses a convolutional network architecture. Have you ever considered LSTM? Early TasNet also conducted related experiments, and the experimental results are as shown below. It turns out that LSTM is a bit too sensitive. If we cut off the first part of the sound sample, the performance of LSTM will be good and bad. It shows that LSTM has produced overfitting. Overfitting requires reading from the beginning of a sentence to be effective.

  • But if we use CNN, we won't have this problem. Why is this so? Because for CNN, it doesn’t matter where you start. But the information the LSTM starts with can have a big impact.

Summary and questions
  • Encoder is similar to Fourier Transform, converting the sound signal into something similar to Spectrum gram, and then throwing the output into Separator to get two Masks. Then the Mask and Encoder outputs are dot-multiplied, and the results are sent to the Decoder. The Decoder here is similar to the Inverse Fourier Transform, and finally the sound clip is output.

  • In 2020, there is also a paper called Wavesplit this year. It achieves SOTA level on Benckmark on WSJ0-2mix dataset. TasNet is in the middle, around 15.3. It sounds like a very close to perfect result. And Deep Clustering is around 10.8. It doesn't look very good today, but when it came out, it was an amazing result. Numerically, TasNet seems to be rubbing Deep Clustering on the ground. But is it really so?

  • TasNet still has some problems. If the audio of two speakers mixing Chinese and English is used, the output result of TasNet will not separate Chinese and English well. This is because TasNet training is done on mixed data of English vocals, thus resulting in a certain sense of overfitting. Looking at Deep Clustering on the other hand, it achieves separation very well. It is obvious from this example that Deep Clustering has better generalization performance than Tasnet.

7. Other exploration content of Speech Separator

Unknown number of speakers
  • In practical applications, we may not know how many speakers there are in the mixed audio. Deep Clustering has the opportunity to do that. For example, there are two Speakers during training, but there are three Speakers during testing. But with TasNet there is no way to separate. Because it can output several MASKs that are preset. Therefore, TasNet cannot solve the situation where the number of input vocals is not fixed.

  • How can we use an architecture like TasNet in situations where the number of speakers is unknown? One approach is to train a network to isolate just one human voice at a time. The rest is background sound. In this way, multiple vocals can be separated recursively. But this requires us to give a way to detect when we can stop the separation work. This isn't necessarily the best approach. What is the best way is worthy of our study.

Multiple Microphones
  • In addition, our input sound signals before were all monophonic sounds. But in reality, many times we use multi-channel sound, which is a microphone matrix. How do we do vocal separation using more than one microphone? Traditionally there are a range of signal processing methods. For Deep Learning, the input is simply replaced with multi-channel sounds, and then end-to-end training is completed, and the hard training is over.

Visual Information
  • Also, a lot of the sound comes from the video. We can also combine visual information in videos to strengthen the task of speech separation. Google has made a DEMO. Which voice to separate can be selected by selecting which face. How was it made? The hard train came out as soon as it was released.

  • The input to this model is the voice signal and the avatar segmented from the picture when speaking. The talking avatar part is encoded using shared Dilated Convolution. The mixed vocal waveform part will first be converted into spectral features using SIFT, and then encoded into speech embedding using an AutoEncoder. Then the two image embeddings and a speech embedding vector are spliced ​​together. We don’t need to do a PIT here. Because the portrait determines the order in which the output sounds are arranged. The spliced ​​vector sessions are first passed through a linear mapping layer to fuse the image and sound. Then use BiLSTM and FC layers to turn this fused embedding into two complex-valued MASKs. These two MASKs are then multiplied by the original sound signal, and then the spectrum is converted into a waveform, and the output is converted into a separated waveform.

Task-oriented Optimization
  • In fact, after we do speech separation, the tasks we apply are also different. It may be for people to listen to, or it may be for machines to listen to.

  • If it is being listened to by people, we focus more on the intelligibility of the separated speech quality, that is, the perception of people hearing the separated sounds. Our optimization goals can be STOI and PESQ. However, the measurement method of PESQ is often very complex and cannot be differentiated, so it cannot be directly used for end-to-end training. Therefore, how to optimize the optimization objective that cannot be differentiated is a research topic.

  • Suppose that the sound we separate is not for human beings, but for machines. For example, Speaker Verification. Our optimization goal is not understandability, but overall system performance. For example, after training the denoise model, we will not feed the results directly to ASR (Automatic Speech Recognition). Instead, we will string the ASR and denoise models together and train them end-to-end, and then the overall system will be easy to use. If you don't connect them in series and directly cascade the various parts, the final effect will not be good.

  • Therefore, according to different tasks, our training goals can be very rich, and our optimization goals can be very rich.

Other valuable papers
  • In addition to the current Speech Separation paper. There are many more papers worth reading. Such as Speech Enhancement and so on.

Guess you like

Origin blog.csdn.net/m0_56942491/article/details/134455964