(IEEE Access7)Effective Combination of DenseNet and BiLSTM for Keyword Spotting

Papers Address: as Effective Combination of DenseNet and BiLSTM for Keyword Spotting
Published in: IEEE Access (Volume 7)
Release Date: January 10, 2019

Abstract

  In this paper, based on the powerful features DenseNet extracted local feature maps, we propose a new network architecture (DenseNet-BiLSTM) for the KWS. We DenseNet-BiLSTM in, DenseNet used to acquire local features, and BiLSTM time series feature for acquiring. Usually, DenseNet for computer vision tasks, it may break the context of speech audio information. To make DenseNet apply to KWS, we propose a variant DenseNet, called DenseNet-Speech, it deletes the pool in the time dimension of the transition layer to preserve the voice time-series information. In addition, our DenseNet-Speech blocks and using smaller density filters so that the model to be kept small, thereby reducing the time consumed by the mobile device. Experimental results show that, DenseNet-Speech features can be well maintained FIG time-series information. In terms of the accuracy of Google Speech Commands dataset, our method is superior to the latest methods. For a 20 command recognition task 223K trainable parameters, DenseNet-BiLSTM accuracy of 96.6% can be achieved.

Introduction

  Command recognition accuracy is very important for the KWS, get a good voice. Further, KWS module should be small enough to fit the power and performance of mobile devices with limited. We use the number of model parameters can be measured training module size.
  Recently, the end of the model has become popular KWS [2]. Studied the KWS convolutional neural network ( CNN ) [. 3] , [. 4] , the depth of the neural network ( the DNN ) [. 5] and the remaining network ( the ResNet ) [. 6]. One disadvantage of CNN and ResNet is that they can not rely on a good voice. Also studied KWS [7], [8] the recurrent neural network ( RNN ). RNN is a potential limitation on the input feature modeling directly, without having to learn a partial structure between the frequency and the time series of successive steps. Some work will RNN CNN and combined to improve the accuracy of the KWS, such as convolutional recurrent neural network ( CRNN ) [. 9] and gate short and long term memory convolution ( LSTM ) [10], the performance of the two methods are superior to use only CNN or RNN.
  Conventional CNN limited capacity to extract local features of FIG. In recent years, the convolution Network (DenseNet) [11] closely connected, it can alleviate the problem of the disappearance of the gradient, to enhance the propagation characteristics, features encourage reuse, and reduce the number of trainable parameters. DenseNet parameter is less efficient method of using the local characteristics of the capture FIG. For RNN, by using the past and future context information, bidirectional short and long term memory (BiLSTM) time series feature can be well obtained. Therefore, we propose a method for binding DenseNet and BiLSTM (DenseNet-BiLSTM), in order for a reasonable KWS training parameters to achieve higher accuracy. Summary DenseNet-BiLSTM described in FIG.
Here Insert Picture Description
Figure 1: binding DenseNet and BiLSTM overview of key findings. Extracting local feature DenseNet focused view and BiLSTM obtain time-series information.

DenseNet between two dense block has a transition layer, in which the tank operation [11]. For DenseNet-BiLSTM, we hope DenseNet module focuses on local features , and in the time dimension little maintenance time-series information, so we only consider the dimensions and characteristics of this structure is called DenseNet DenseNet-Speech. FIG from DenseNet-Speech feature will be fed to BiLSTM module to obtain time series feature . In order to make smaller our model, we use the number of layers in each block is less than DenseNet-121 and DenseNet-169 [11]. For DenseNet-BiLSTM, we also introduced the attention mechanism, which helps to improve the accuracy of the KWS [12].

All in all, the main contribution of this paper is as follows:

1, and proposed a DenseNet BiLSTM KWS method for efficient binding, a command data set which Google Voice v1 [13] on the veracity and v2 [14] than the latest model.

2, designed for variant DenseNet (DenseNet-Speech), small footprint for voice command recognition.

3, a new method for KWS while maintaining the recognition accuracy can be reduced training parameters found. In a similar accuracy CNN introducing a stronger local feature extraction capability can help to reduce the total number of training parameters.

Related Work

CNN architecture KWS brought significant improvements [3]. In [15], the problem is converted into the audio recognition art image classification, and have achieved good results. When CNN becomes deeper, it will encounter problems gradient disappears , which means that the gradient can not be well transmitted to the front layer. To solve this problem, ResNet [6], which introduces a shortcut connection. One or more layers and may be skipped. Inspired by the fast connection ResNet proposed DenseNet , in order to maximize the increase in the flow of information between the CNN layers . The feature map DenseNet delivered to all subsequent layers [11], or other deep compared to ResNet CNN, DenseNet fewer parameters may be used to transfer more information to subsequent layers. A significant improvement for the recognition task CIFAR-10, CIFAR-100, SVHN and ImageNet [11] show, DenseNet extracting local feature maps is more efficient than a conventional cellular neural network. In [16], DenseNet learning and transmission combination for voice command recognition, and the accuracy of 85.52% is obtained.

LSTM [17] is an RNN architecture designed to solve problems gradient explosions and disappear . It can store the value in any time interval, and controls the flow of information into a sequence. LSTM Contact doctrine has time classification (CTC) is better than the depth of KWS neural network (DNN) and Hidden Markov Models (HMM) [7]. BiLSTM can be imported in the past and the future of information about the current time, which introduces more contextual information, and achieved great success in automatic speech recognition. Graves and so on. [18] proposed an HMM depth BiLSTM binding and better performance than the "Wall Street Journal" and the depth of the corpus of the neural network GMM benchmark. For keyword found the problem, only BiLSTM model greatly defeated the HMM-based speech recognizer [19].

RNN and CNN on the combined aspects to KWS, there are many related works. Arik and so on. [9] presents a network architecture, a two-layer CNN and RNN, with a total parameter [mathematical processing error]. With the use of gated LSTM CNN another model [10] having a layer CNN, CNN and two gated layer BiLSTM (C-1-G-2-Blstm). The total number of C-1-G-2-Blstm as [mathematical processing error] parameter may be trained, and the accuracy in the data set Google Speech Commands ratio Transfer Learning Network [16] 6.4% higher. Been successfully used with CNN and RNN and inspire local feature extraction DenseNet powerful features, we proposed a DenseNet and BiLSTM combined to KWS way to achieve better performance.

In addition, attention mechanism [20], [21] so that different parts of the neural network input focus. KWS allows higher accuracy based on the models of attention. Attention model based on Google Speech Commands dataset was a great success [12]. In [22], for introducing a small size KWS end based attention model that beat depth KWS system. For better performance, we will also pay attention to the mechanism into our model.

Methods

A. Audio feature extraction
using Mel spectrum level [23] Pretreatment speech audio. For each pronunciation are used Mel band ratio 80, 1024 point discrete Fourier transform and the hop size of 128 calculated on the Mel scale spectrogram. All frames of each utterance stacked to form a two-dimensional vector. Mel spectrum is then converted to dB scale spectrum, for further processing.

B. model architecture
to each audio, we first use librosa [23] Mel characteristic spectrum extraction and convert it to dB scale spectrum. Standardized spectrogram dB scale for each audio. Then DenseNet-Speech for further obtaining local feature FIG. A set of two-way LSTM for capturing audio in the long-term dependency. Note [24] to make use of the mechanism in place to focus our model should pay attention to. Introduces two fully connected layers, so that the size and number of categories of learning aligned. Finally, softmax used for classification. It is used as the cross entropy loss function. All functions are activated linear rectifier unit (ReLU) [25]. Batch normalized [26] to a convolution operation.
[Image dump the chain fails, the source station may have security chain mechanism, it is recommended to save the picture down uploaded directly (img-w1N3KXF6-1581301372857) (https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/6287639 /8600701/8607038/zeng2-2891838-large.gif)]
C. DenseNet-Speech (not read)
Here Insert Picture Description
D. With the Attention BILSTM (not understand)
Here Insert Picture Description

Experiments

A. Experimental set
using tensorflow [27] as a framework to achieve our model. To assess our model by testing accuracy compared with other architectures on the same data set. Nvidia GeForce GTX 980 for our experiments.

1)数据集
Google Speech Commands数据集v1和v2上训练和评估我们的模型。Google语音命令数据集v1 [13]于2017年8月3日发布,由64,727个一秒钟长的语音和30个短单词组成。Google Speech Commands Dataset v2 [14]于2018年4月发布,由105,829个一秒钟(或更少)的35个单词组成。我们评估基于task-12cmds和task-20words的模型。task-12cmds指的是12个语音命令的识别,而task-20words是20个核心命令的识别任务[14]。

2)超参数
批量大小为100。提前停止适用于总共18,000个步骤的培训。初始学习率为0.001。验证准确性每400步进行测试。如果验证准确性降低,则学习率将降低50%。当它收敛时,训练将停止。我们使用最佳验证准确性的检查点来评估测试准确性。Adam stochastic optimization[28]被用于训练。

B.准确性评估
通过将准确性与分别在Google Speech Commands数据集v1和v2上的现有相关作品进行比较来评估模型。另外,我们添加了双向门控循环单元(BiGRU)[29]和深层BiGRU模型进行比较。我们主要关注每种模型的测试准确性。此外,列出了每项工作的可训练参数数。

有两种典型的方法可以为Google Speech Commands数据集生成训练和测试数据。一种是Google [3]的方法,该方法将数据集拆分为8:1:1,并增加了背景噪音。另一个是Attention RNN [12]的实现,它使用validation_list.txt和testing_list.txt中的音频文件作为验证和测试数据,而其他音频文件作为训练数据。与后一种方法相比,前一种方法包含的样本在测试数据中仅具有背景噪声。在本节中,遵循Attention RNN的实现,DenseNet-BiLSTM包括3层密集块和火车,而遵循Google的工具,有2层密集的街区和火车。

对于Google Speech Commands数据集v1,表2显示了我们的模型在两项任务中均达到了最佳性能。与针对任务12cmds的BiGRU-5和Res15 [6]相比,我们的模型使用较少的参数将单词错误率(WER)降低了9.5%。尽管基于BiGRU的模型比基于BiLSTM的模型表现更好,但是DenseNet-BiLSTM的准确度比DenseNet-BiGRU略高。关于注意力RNN [12],使用生成训练和测试数据的相同方法,我们的模型将任务12cmds的WER降低了40%,将任务20字词的WER降低了27%。我们模型的可训练参数数量比Attention RNN多25%。因此,我们的模型通过合理的可训练参数,在Google Speech Commands数据集v1上实现了最新的性能。

表2: Google语音命令数据集V1的准确性。ConvNet来自[3]。来自[30]的Tpool2结果。Res26,Res15来自[6]的结果。来自[31]的DS-CNN结果。[10]的C-1-G-2-Blstm结果。注意RNN结果来自[12]。来自[32]的TDNN结果。HD-CNN来自[33]。BiLSTM-i(BiGRU-i)是指具有i Layers的BiLSTM(BiGRU)的模型。DenseNet-BiGRU代表结合DenseNet和BiGRU的模型。Res15,Res26,TDNN,DS-CNN遵循Google的工具。C-1-G-2-Blstm将数据集拆分为注意力RNN。我们对ConvNet,BiLSTM-2,BiLSTM-5,BiGRU-2,BiGRU-5,BiGRU-8和HD-CNN进行了培训
[外链图片转存csdnimg.cn/aHR0cHM6Ly9pZWVleHBsb3JlLmllZWUub3JnL21lZGlhc3RvcmVfbmV3L0lFRUUvY29udGVudC9tZWRpYS82Mjg3NjM5Lzg2MDA3MDEvODYwNzAzOC96ZW5nLnQyaS0yODkxODM4LXNtYWxsLmdpZg)
还有一些其他的,,这里没写~

C.学习速度
绘制了训练的损失曲线。我们添加了LSTM-2和GRU-5进行比较。所有模型的学习率从0.001开始,并且在验证准确性没有提高的情况下降低了50%。图5说明了这三个模型都快速收敛,这归功于亚当随机优化[28]。收敛后,我们的模型的损失值最小。
在这里插入图片描述
图5: 学习速度。smoothingWeight为0.95。此处的模型均在Google Speech Commands数据集v1上进行了训练和评估。

D.噪声数据的影响
在这里插入图片描述
图6表明,所有模型的精度都随着背景体积的增加而降低。DenseNet-BiLSTM始终优于其他三个模型。从精度曲线的趋势来看,DenseNet-BiLSTM与其他模型之间的精度差距越来越大,这表明DenseNet-BiLSTM更加健壮。

E.不同DenseNet语音的影响
1)增长率的影响
2)密集块数的影响

F. BiLSTM层和隐藏单元的影响
1)BiLSTM层数的影响
2)隐藏单元数的影响

G.分析
来自DenseNet-Speech的特征图仍然保留了时间序列信息。与DenseNet-Org-BiLSTM相比,DenseNet-BiLSTM以更少的参数实现了更高的精度。

H.讨论
KWS是具有时间序列信息的分类任务。结合CNN和RNN是用于KWS的合适方法。我们的实验也说明了这一点。

通常,由于DenseNets [11]具有大量可训练的参数和池化操作,因此使用DenseNet可能会损坏上下文信息,这不适用于语音音频。通过在模型中进行精心设计,DenseNet-Speech能够很好地提取局部特征图,同时使用合理数量的可训练参数保留时间序列信息。这表明通过精心设计,可以使用复杂的CNN处理语音信号。可以探索与RNN结合的其他复杂CNN,以进一步提高KWS的准确性。

减少一层BiLSTM层可减少99K的可训练参数,将LSTM隐藏单元的数量从64个减少到32个可将参数减少109K,而在DenseNet-Speech中增加一个密集块仅花费额外的30K。图10显示我们的模型能够以更少的参数达到相似的精度。与Attention RNN [12]相比,DenseNet-BiLSTM-1减少了25%的可训练参数,而DenseNet-BiLSTM-2减少了30%。以相似的精度导入具有更强的局部特征提取功能的CNN有助于简化BiLSTM块。因此,减少了可训练参数的数量。因此,我们的工作提出了一种在保持模型准确性的同时减少可训练参数的新方法。

Our model is a drawback DenseNet-BiLSTM still a trade-off between accuracy and training parameters. This problem is very common in the KWS. For task-12cmds Google Speech Commands Dataset v1 in, DenseNet-BiLSTM training parameters can be obtained 97.5% accuracy by 250K. 128 The number of units is set to hide, using 666K trainable parameters can achieve 97.7% accuracy. Although the number of the latter is more accurate, but it was more trainable parameters.

Conclusions

In this article, we explore the combination DenseNet and BiLSTM (DenseNet-BiLSTM) Keyword Discovery to solve the problem. In DenseNet-BiLSTM in, for DenseNet been modified to extract local features, while retaining the time-series information. In fact, maintaining contextual information plays an important role in improving the accuracy. Experiments conducted on Google Speech Commands datasets show that our model effectively DenseNet and BiLSTM combine voice commands for audio, and WER has made significant improvements in reducing, and 250K can be trained to implement the new parameters performance.

We also investigated the effect of different parameters on the super-model performance, found a similar accuracy to import a stronger local feature extraction capabilities CNN can help reduce training parameters.

Published 64 original articles · won praise 7 · views 30000 +

Guess you like

Origin blog.csdn.net/Pandade520/article/details/104156933