Separation of single-channel vocal and music based on deep recurrent neural network

SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING DEEP RECURRENT NEURAL NETWORKS

Main content: As a current hot topic, while voice recognition is rapidly applied, it should also be more adaptable to the needs of different scenarios. Especially for smartphones, due to the miniaturization of components, the devices for voice processing cannot be It may be very large, so the speech separation technology on a single channel is extremely important, and speech separation is the front-end part of speech recognition. However, due to the limitation of data processing, the traditional technology cannot deal with the complicated interference in the signal. Therefore, the introduction of DNN and RNN in recent years has greatly improved the separation effect.

  • Abstract The difficulty of single-
      channel speech separation is mainly related to single -channel speech . In this paper, the authors perform supervised separation of mixed speech signals by employing a deep recurrent neural network DRNN , which separates different source signals by employing a nonlinear model at the end. The loss function uses ideal time-frequency masking to compare the errors before and after. Subsequently, different loss functions can also be used to increase the signal-to-interference ratio. Compared with the previous method, this method has been greatly improved. Among them, GNSDR is 2.30~2.48dB, and GSIR is 4.32~5.42dB (the data is based on MIR-1K dataset. The relevant data sets can be directly searched on Google. If you can't find it or the download is slow, please email: [email protected])

  • Introduction The usage scenario of single-
      channel speech separation: de-duo in automatic speech recognition (ASR); by separating the human voice in music, the accuracy of chord recognition and pitch judgment can be improved (this is mainly to identify musical instruments and judge singers) sound quality – my thoughts). But current methods are still far from the accuracy of manual recognition, especially for single-channel, the difference is even greater.
      This paper focuses on the separation of vocals and music. For this goal, the current mainstream processing methods [7, 13, 16, 17] are based on an assumption: that the data matrix of vocal and music signals is low-rank and sparse (low-rank means that the current matrix can use less data Compared
      with the traditional method, the deep learning method has many fewer restrictions, and this method can be better extended through nonlinear structures The expressive ability of the model to find the optimal feature expression of the data. In this paper, a different deep recurrent neural network is built by using connection optimization and soft masking functions. Moreover, the training objective can be flexibly changed to optimize the structure of the network. The specific process is shown in Figure 1.
    Figure 1. Network Architecture
      This paper is organized as follows, and the second section mainly discusses the traditional working method as the introduction of the method. The third section mainly introduces the method of this paper: including deep recurrent neural network, connection optimization of deep network, soft time-frequency masking function and different objective functions. Section 4 mainly focuses on the experimental setup and analysis results of the MIR-1K dataset. Section 5 is the conclusion of this paper.

    1. Connections to
        previous work is mainly based on an assumption that the matrix of audio signals is low-rank and sparse, as in [7, 13, 16, 17]. But this assumption is not always correct. Moreover, in the separation stage, these models are all treated as a single-layer linear network, which predicts clean spectral signals through linear transformation, which is obviously a big flaw. Therefore, in order to optimize the expressiveness of this model, we adopt a deep recurrent network, which does not have strong requirements for low rank and sparsity of the data.
        By using deep architectures, deep learning approaches can find those hidden structures and features at different levels of abstraction in the data. Recently, deep learning has been used in related fields such as speech enhancement and ideal binary mask estimation [1, 9–11, 15].
        In ideal binary mask estimation, the researchers employed a two-stage deep learning framework. In the first stage, the author uses d neural networks to predict the output dimension respectively, where d is the feature dimension of the target. In the second stage, a classifier (single-layer perceptron or SVM) is used to improve the predictions of the first stage. But this kind of network has a flaw, that is, if the sampling points of the FFT are 1024, then the output of the data will be 513 dimensions, the neural network will be large, and there will be a lot of redundancy between adjacent frequencies. Therefore, this paper adopts a general framework capable of predicting all feature dimensions with a single neural network.
        In addition, the researcher used the deep convolutional neural network DCNN to de-dry the audio signal, but this mode is not suitable here, because this can only separate one source signal, and we need to separate all the source signals. For our approach, if multiple signals are separated, we can optimize the masked sum by the different information between the signals, and then get better discriminative training.

    2. The method in this paper
      3.1 Deep Recurrent Neural Network DRNN
        DRNN consists of two parts, DNN and RNN, both of which bring together their respective advantages. Through memory and forgetting, RNN can better capture the context information of the signal, so as to obtain the relevant characteristics of the signal; while DNN can obtain the information of different stages and different time segments through layers. DRNN mainly has three modes in Figure 2: the leftmost is a pure RNN, the middle is DRNN, but only one layer has temporal connections, and the right one is that each layer has temporal connections.
      DRNN architecture: where gray, black, and white are the output, hidden layer, and input layer, respectively
        Our DRNN scheme is as follows: For an L-layer DRNN, the lth layer is a recurrent layer, where the temporal activation function is as follows:
      enter description here
        Its output is defined as follows:
      enter description here
        where Xt is the input at time t, φl is a nonlinear function of the variable, Wl is the weight matrix of the lth layer, Ul is the cyclic connection weight matrix of the lth layer, and the output is a linear layer.
        The stacked RNN has multiple layers of transition functions, which are defined as follows:
      enter description here
        where: hl is the hidden state of the lth layer at time t, and U and W are the hidden activation matrices of the previous time t-1 and the previous layer l-1. When l=1, ht = Xt. For the activation function φ, we find that the following function f(x) = max(0, x) 2 is practical , which is better than the sigmoid and tanh functions. For DNN, the timing weight matrix U is a zero matrix.
      3.2 Model structure The input of the
        network is the amplitude spectrum of the mixed signal, the features at time t are aggregated through the network, and then two different source signals are output, and the network is updated by comparing the two different source signals before and after.
        Our goal is to separate all source signals, not only one kind of signal. Therefore, we use the method of Reference 9 to simulate all source signals. The specific architecture diagram is shown in Figure 3.
      Figure 3. Neural network architecture
        Loss function: We use time-frequency masking, i.e. binary time-frequency masking or soft time-frequency masking [7,9]. The time-frequency masking function can enforce the constraint that the data sum of the predicted signal is equal to the original signal.
        The definition of the time-frequency masking function is as follows:
      enter description here
        where y is the two separated result signals, and f represents different
        frequencies . The values ​​of the time-frequency masking are obtained respectively, and the respective source signals can be obtained by multiplying the mixed signal:
      enter description here
        different from the previous ones, we here The time-frequency masking function is not used as an evaluation of the training result, but as a connection intermediary of the model. The time-frequency masking function is also a layer, and the calculation is as follows:
      enter description here
        where the dots are matrix multiplication, and the reconstruction of the time domain signal uses ISTFT.
      3.3 Training Objectives
        We use minimum mean squared error and regular KL divergence for metrics. The formula is as follows:
      enter description here
        For a mixed signal, in each frame, there will be only one source signal in the majority, that is, the signal-to-interference ratio. Therefore, using the above function can make the predicted signal more similar to the original signal, and at the same time different from another signal.
      enter description here
        where λ is the performance selection variable at training time.

    3. Conducting experiments
      4.1 Experimental
        setup The dataset used is the MIR-1K dataset 6 . The data comes from 110 Chinese karaoke songs (male and female), the sampling rate is 16khz, the time is 4-13 seconds, and various attribute information of the sound has been manually marked. It contains only one singing voice and one background voice, and our experiments are also based on this.
        Using the evaluation framework of [13, 17], the data set is divided into training set and data set, the singing voice and background music are extracted respectively through the channel, and the mixed signal is synthesized by 0 signal-to-noise ratio for separation.
        The evaluation of the experimental results uses the signal-to-interference ratio (SIR), the signal-to-build ratio (SAR), and the signal-to-distortion ratio (SDR) for metrics. The standard SDR is as follows:
      enter description here
        In the training framework, in order to increase the diversity of data, we transform the sound signal every time we mix the signals. The
        input feature uses STFT with 1024 sampling points, and the overlap rate is 50%. The numerical mel spectrum and logarithmic power spectrum will be even worse.
      4.2 Experimental results The effects of the neural network are compared
        mainly from five aspects: the size of the input data, the number of cyclic transformation steps, the output format, the DRNN architecture and the selection of the training objective function.
        The specific configuration of the experiment: 3 hidden layers, each 1000 Units, using the mean square error metric, 10000 cyclic conversion steps, the input window is 3 frames, the framework is DRNN-K, that is, the kth layer cyclic layer, and the evaluation standard is GNSDR.

      • Step 1: Adjust the size of the input window, take 1, 3, and 5 frames respectively. The comparison results are shown in Table 1. Results show: 1 frame is better, and subsequent comparisons are based on 1 frame
      • Step 2: Cycle conversion steps: 50k, 25k, 10k, 0; through comparison, it is found that the cycle steps are much better than no cycle steps, but increasing the cycle steps does not improve the The number is 10k
      • Step 3: Output format: single source, dual source without masking, dual source with masking; dual source with masking is better.
      • Step 4: SRNN architecture and evaluation function: as shown in Table 4, respectively. The results show that the DRNN with recurrent connections in the 2nd hidden layer works best (the architecture here can be more styled)
      • Step 5: Discrimination training. Table 5, discriminative training improves GSIR, but reduces GSAR, and GNSDR slightly improves.
        Finally: the author compares with the traditional method. Compared with RNMF 13 , this method obtains 2.30~2.48 dB GNSDR, 4.32~5.42 dB GSIR and the same GSAR. A sample separation process is shown in Figure 4.
        enter description here
        enter description here
    4. Summary and Outlook
        This paper mainly explores the implementation of DRNN on a single channel. In particular, compared to the traditional way, the improvement of pure DNN is compared, and the connection optimization and masking function are used to improve the effect. The final model works well: 2.30~2.48 dB GNSDR, 4.32~5.42 dB GSIR, and the same GSAR. In addition, the model can also be used in other application scenarios, such as the extraction of main melody.

    5. Related keywords
        Low rank: an m*n matrix, if the rank is very low (rank r is much smaller than m,n), it can be split into a product of an m*r matrix and an r*n matrix (similar to SVD decomposition ). The storage space occupied by the latter two matrices is much smaller than the original m*n matrix. That is, different sound sources can be represented by a few groups of genes, and it is enough to find these groups of genes, the unique characteristics.

    6. 参考文献
      1 N. Boulanger-Lewandowski, G. Mysore, and M. Hoffman.
      Exploiting long-term temporal dependencies in NMF using
      recurrent neural networks with application to source separation.
      In IEEE International Conference on Acoustics, Speech
      and Signal Processing (ICASSP), 2014.
      2 X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier
      neural networks. In JMLR W&CP: Proceedings of the Fourteenth
      International Conference on Artificial Intelligence and
      Statistics (AISTATS 2011), 2011.
      3 M. Hermans and B. Schrauwen. Training and analysing deep
      recurrent neural networks. In Advances in Neural Information
      Processing Systems, pages 190–198, 2013.
      4 G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly,
      A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and
      B. Kingsbury. Deep neural networks for acoustic modeling
      in speech recognition. IEEE Signal Processing Magazine,
      29:82–97, Nov. 2012.
      5 G. Hinton and R. Salakhutdinov. Reducing the dimensionality
      of data with neural networks. Science, 313(5786):504 –
      507, 2006.
      6 C.-L. Hsu and J.-S.R. Jang. On the improvement of singing
      voice separation for monaural recordings using the MIR-1K
      dataset. IEEE Transactions on Audio, Speech, and Language
      Processing, 18(2):310 –319, Feb. 2010.
      7 P.-S. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa-
      Johnson. Singing-voice separation from monaural recordings
      using robust principal component analysis. In IEEE International
      Conference on Acoustics, Speech and Signal Processing
      (ICASSP), pages 57–60, 2012.
      8 P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck.
      Learning deep structured semantic models for web search using
      clickthrough data. In ACM International Conference on
      Information and Knowledge Management (CIKM), 2013.
      9 P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and
      P. Smaragdis. Deep learning for monaural speech separation.
      In IEEE International Conference on Acoustics,
      Speech and Signal Processing (ICASSP), 2014.
      10 A. L. Maas, Q. V Le, T. M O’Neil, O. Vinyals, P. Nguyen,
      and A. Y. Ng. Recurrent neural networks for noise reduction
      in robust ASR. In INTERSPEECH, 2012.
      11 A. Narayanan and D.Wang. Ideal ratio mask estimation using
      deep neural networks for robust speech recognition. In Proceedings
      of the IEEE International Conference on Acoustics,
      Speech, and Signal Processing. IEEE, 2013.
      12 R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. How to construct
      deep recurrent neural networks. In International Conference
      on Learning Representations, 2014.
      13 P. Sprechmann, A. Bronstein, and G. Sapiro. Real-time online
      singing voice separation from monaural recordings using
      robust low-rank modeling. In Proceedings of the 13th International
      Society for Music Information Retrieval Conference,
      2012.
      14 E. Vincent, R. Gribonval, and C. Fevotte. Performance measurement
      in blind audio source separation. Audio, Speech,
      and Language Processing, IEEE Transactions on, 14(4):1462
      –1469, July 2006.
      [15] Y. Wang and D. Wang. Towards scaling up classificationbased
      speech separation. IEEE Transactions on Audio,
      Speech, and Language Processing, 21(7):1381–1390, 2013.
      [16] Y.-H. Yang. On sparse and low-rank matrix decomposition
      for singing voice separation. In ACM Multimedia, 2012.
      [17] Y.-H. Yang. Low-rank representation of both singing voice
      and music accompaniment via learned dictionaries. In Proceedings
      of the 14th International Society for Music Information
      Retrieval Conference, November 4-8 2013.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326350397&siteId=291194637