Space-time convolution network TCN

1. EDITORIAL

Experiments show that, RNN has a good performance in almost all sequences, including voice / text recognition, machine translation, handwriting recognition, sequence data analysis (forecasting) and so on.

In practical applications, RNN there is a serious problem in interior design: because the network can only handle one time step, step after step must be pre-processed and so on in order to carry out operations. This means that RNN massively parallel processing can not be performed as CNN did, especially when RNN / LSTM the text two-way process. This also means that RNN extremely compute-intensive, since before the entire task runs complete, you must save all the intermediate results.

CNN when processing the image, the two-dimensional image as a "block" (m * n matrix). Migrates to the time sequence, a sequence can be considered as a one-dimensional objects (1 * n vector). Through a multi-layer network architecture, you can obtain a sufficiently large receptive fields. This approach will make CNN is very deep, but thanks to the advantages of large-scale parallel processing, no matter how deep the network, can be processed in parallel, saving a lot of time. This is the basic idea of ​​TCN.

2 CNN extension technology

TCN model CNN model, and made the following improvements:

  1. Applicable series models: causal convolution (Causal Convolution)
  2. Historical Memory: empty convolution / expansion convolution (Dilated Convolution), residual module (Residual block)

Following are descriptions of CNN's expansion technology.

2.1 causal convolution (Causal Convolution)

To deal with the problem because the sequence (sequential), you must use the new CNN model, this is a causal convolution. Sequence can be transformed into the problem: According  [official] to predict  [official] . The following definitions are given causal convolution filter  [official] , the sequence  [official] , in the  [official] causal at convolution is:  [official] . The following figure shows an example causal convolution, the last two nodes are assumed input layer  [official] , the last node of the first layer to the hidden layer  [official] , a filter  [official] , there is according to the formula  [official] .

Causal convolution has two characteristics:

  1. Without considering future information. Given input sequence  [official], predicted  [official]. But predicting  [official]Use only the sequence has been observed  [official], but can not be used  [official] .
  2. A long history dating back more information, the more hidden layers. Figure above, we assumed that the second hidden layer as the output layer, its last three nodes associated with a node input, i.e.  [official] ; assumed to output as the output layer, its last node associated with the four input nodes, which is [official] 

Convolution 2.2 voids / expansion convolution (Dilated Convolution)

Standard CNN to get more receptive field by increasing the pooling layer, and after the loss of information problems certainly exist after pooling layer. Convolution is injected into the hollow cavity in the standard convolution in order to increase the receptive field. A hollow convolution more hyper-parameters dilation rate, refers to the number of kernel interval (standards for CNN dilatation rate equal to 1). The benefit is void without making a pooling of information loss, increased receptive field, so that each convolution output contains a wide range of information. The following figure shows the standard CNN (left) and Dilated Convolution (Right), dilatation rate is equal to 2 in the right panel.

Convolution given cavity defined below the filter  [official] , the sequence  [official], in the  [official] dilatation rate is equal to the convolution of the cavity d:
[official] Examples below shows a hollow convolution assumed that the first layer of hidden layer nodes are the last five  [official] , the last node of the second layer to the hidden layer  [official] , a filter  [official] , there is according to the formula  [official] .

空洞卷积的感受野大小为 [official] ,所以增大 [official] 或 [official] 都可以增加感受野。在实践中,通常随网络层数增加, [official] 以 [official] 的指数增长,例如上图中 [official] 依次为 [official] 。

2.3 残差模块(Residual block)

CNN 能够提取 low/mid/high-level 的特征,网络的层数越多,意味着能够提取到不同 level的特征越丰富。并且,越深的网络提取的特征越抽象,越具有语义信息。

如果简单地增加深度,会导致梯度消失或梯度爆炸。对于该问题的解决方法是权重参数初始化和采用正则化层(Batch Normalization),这样可以训练几十层的网络。解决了梯度问题,还会出现另一个问题:网络退化问题。随着网络层数的增加,在训练集上的准确率趋于饱和甚至下降了。注意这不是过拟合问题,因为过拟合会在训练集上表现的更好。下图是一个网络退化的例子,20 层的网络比 56 层的网络表现更好。

理论上 56 层网络的解空间包括了 20 层网络的解空间,因此 56 层网络的表现应该大于等于20 层网络。但是从训练结果来看,56 层网络无论是训练误差还是测试误差都大于 20 层网络(这也说明了为什么不是过拟合现象,因为 56 层网络本身的训练误差都没有降下去)。这是因为虽然 56 层网络的解空间包含了 20 层网络的解空间,但是我们在训练中用的是随机梯度下降策略,往往得到的不是全局最优解,而是局部最优解。显然 56 层网络的解空间更加的复杂,所以导致使用随机梯度下降无法得到最优解。

 

假设已经有了一个最优的网络结构,是 18 层。当我们设计网络结构时,我们并不知道具体多少层的网络拥有最优的网络结构,假设设计了 34 层的网络结构。那么多出来的 16 层其实是冗余的,我们希望训练网络的过程中,模型能够自己训练这 16 层为恒等映射,也就是经过这16 层时的输入与输出完全一样。但是往往模型很难将这 16 层恒等映射的参数学习正确,这样的网络一定比最优的 18 层网络表现差,这就是随着网络加深,模型退化的原因。

因此解决网络退化的问题,就是解决如何让网络的冗余层产生恒等映射(深层网络等价于一个浅层网络)。通常情况下,让网络的某一层学习恒等映射函数 [official] 比较困难,但是如果我们把网络设计为 [official] ,我们就可以将学习恒等映射函数转换为学习一个残差函数 [official] ,只要 [official] ,就构成了一个恒等映射 [official] 。在参数初始化的时候,一般权重参数都比较小,非常适合学习 [official] ,因此拟合残差会更加容易,这就是残差网络的思想。

下图为残差模块的结构,该模块提供了两种选择方式,也就是 identity mapping(即 [official] ,右侧“弯弯的线",称为 shortcut 连接) 和 residual mapping(即 [official] ),如果网络已经到达最优,继续加深网络,residual mapping 将被 push 为 0,只剩下 identity mapping,这样理论上网络一直处于最优状态了,网络的性能也就不会随着深度增加而降低了。

这种残差模块结构可以通过前向神经网络 + shortcut 连接实现。而且 shortcut 连接相当于简单执行了同等映射,不会产生额外的参数,也不会增加计算复杂度,整个网络依旧可以通过端到端的反向传播训练。

上图中残差模块包含两层网络。实验证明,残差模块往往需要两层以上,单单一层的残差模块 并不能起到提升作用。shortcut 有两种连接方式:

(1)identity mapping 同等维度的映射( [official] 与 [official] 维度相同):

[official]

(2)identity mapping 不同维度的映射( [official] 与 [official] 维度不同):

[official]

以上是基于全连接层的表示,实际上残差模块可以用于卷积层。加法变为对应 channel 间的两个 feature map 逐元素相加。

设计 CNN 网络的规则:

(1)对于输出 feature map 大小相同的层,有相同数量的 filters,即 channel 数相同;

(2)当 feature map 大小减半时(池化),filters 数量翻倍;

对于残差网络,维度匹配的 shortcut 连接为实线,反之为虚线。维度不匹配时,同等映射(identity mapping)有两种可选方案:

(1)直接通过 zero padding 来增加 channels(采用 zero feature map 补充)。

(2)增加 filters,直接改变 1x1 卷积的 filters 数目,这样会增加参数。

在实际中更多采用 zero feature map 补充的方式。

In the residual network, there are many residual block, a next residual network is FIG. Each module contains two layers of a residual, using a solid line connected between the same dimension of the residual module, in dashed residual connections between modules of different dimensions. 2,3 convolution layer of 3x3x64, they have the same number of channel network, it is calculated using:  [official] ; 4,5 convolution 3x3x128 layer networks, the number of the channel and the third layer are different (64 and 128) Therefore calculated [official] using: . Where  [official] is the convolution operator (128 3x3x64 by the filter), for adjusting the number of channel x.

 

3 TCN (Temporal Convolutional Network) network time convolution

Because the time series study, TCN-dimensional convolution using a network. FIG causal convolved with the convolution cavity TCN architecture, each layer can be seen that  [official] the value of the time only depends on the level  [official] value at the point, reflecting the characteristics of the causal convolution; and each layer of the layer pair extract information, it is leaps and bounds, and layer by layer dilated rate of increase exponentially 2, reflects the characteristics of the hollow convolution. As a result of convolution cavity, so each layer do padding (fill usually 0), padding of size  [official] .

FIG TCN architecture is the residual module, an input cavity subjected to the convolution, weight normalized, activation function, Dropout (two), as a function of the residuals  [official] ; 1x1 convolution subjected Filters input, connected as a shortcut  [official] .

Below is an example of TCN, as  [official] when reduced to normal convolution hollow convolution.

     

 Knowing almost turn: https://zhuanlan.zhihu.com/p/69919158      

Guess you like

Origin www.cnblogs.com/USTC-ZCC/p/11734436.html