"Model interpretation" of the residual connection resnet, are you sure really understand it?

https://www.toutiao.com/a6708715626782786061/

 

1 residual connection

Presumably so the depth of learning knows skip connect, that is, the residual connections, skip connect what is it? As shown below

"Model interpretation" of the residual connection resnet, are you sure really understand it?

 

The above is a schematic view from resnet of [1] to skip block. We can use a nonlinear function to describe the change in the input and output of a network, i.e., input X, the output F (x), F generally comprises a convolution operations such as activation.

When we forced a function of the input to the output, although we can still use the relationship between G (x) to describe the input and output, but the G (x) but can be split into a clear F (x) and X linear superposition.

Linear This is a non-linear transformation of ideas skip connect the input and output expressed as a superposition input, useless new formula, no new theory, just for a new expression.

It solves the problem of deep network of training, the original authors of the Melaleuca reached.

It is the first residual connection do? Of course not, the traditional neural networks already have this concept, [2] in the text made it clear that the residual structure, which is thought to control the door LSTM comes from.

y = H(x,WH)•T(x,WT) + X•(1- T(x,WT))

As can be seen, when T (x, WT) = 0, y = x, if T (x, WT) = 1, y = H (x, WH). About LSTM relevant knowledge, you can go to other places up.

In this article, the researchers did not use special initialization method, etc., can also be trained on Melaleuca network. But why the article did not resnet fire? There are many reasons for the natural, what the article did more experiments demonstrated, simplifying the above formula, had cvpr best paper, and what greater fame, among other factors.

In short, as we know it is the following equation

y = H (x, WH) + X, this so-called residual connections, skip connection.

2 Why skip connect

So why do it? First we have formed a Liberal, to a certain extent, the stronger the deeper the network capability, the better the performance expression.

However, the nice, with increasing depth of the network brought many problems, gradients dissipate, gradient explosion; resnet out before we thought of ways to solve it? of course not. Better optimization and better initialization strategy, BN layer, Relu other activation functions are used, but still not enough to improve the problem of limited capacity, until the residual connection is widely used.

We all know the depth of learning to rely on error back propagation to chain parameter update, if we have such a function:

"Model interpretation" of the residual connection resnet, are you sure really understand it?

 

Wherein f, g, k, you can make self-convolution of brain activation, classifier.

cost derivative of f is:

"Model interpretation" of the residual connection resnet, are you sure really understand it?

 

It has problems, if a derivative in which a very small, even after repeatedly multiply gradient may become smaller and smaller, it is often said gradients dissipate for Deep Web, shallow spread almost gone. But if the residual, each of the derivative of a constant added to other items. 1, DH / DX = D (F + X) / DX = DF +. 1 / DX . At this time, even if the original derivative df / dx is small, this time the error is still able to effectively reverse the spread, which is the core idea.

For example we intuitively understand this:

If there is a network input x = 1, the residual non-network is G, the residual network is H, where H = F (x) + x

There is such an input-output relationship:

At time t:

Non residual network G (1) = 1.1,

Residual network H (1) = 1.1, H (1) = F (1) +1, F (1) = 0.1

At time t + 1:

Residual non-network G '(1) = 1.2,

Network residual H '(1) = 1.2, H' (1) = F '(1) +1, F' (1) = 0.2

This time we take a look at:

Non-gradient of the residual network G = (1.2-1.1) /1.1

The gradient of the residual network F = (0.2-0.1) /0.1

Because both are each of the parameters and parameter F, G to be updated, you can see the impact of this change on the F is far greater than G, explained after the introduction of residual maps are more sensitive to changes in output, and what are the outcomes? Not that reflects the true value of the error it?

So, think of such a residual should be effective, the parties to the experimental results also proved.

3 skip connect just that right

Above we explained the skip connect improve the back-propagation process of gradient dissipation problem, so you can make training easier Deep Web, but the researchers say NoNoNo, not so simple.

如今在国内的研究人员,大公司,产品,都醉心于将深度学习用于网络直播和短视频,把整个环境搞的浮躁不堪的情况下,国外有很多的大拿都在潜心研究深度学习理论基础,水平高低之分,可见一斑。文【3】的研究直接表明训练深度神经网络失败的原因并不是梯度消失,而是权重矩阵的退化,所以这是直接从源头上挖了根?

"Model interpretation" of the residual connection resnet, are you sure really understand it?

 

当然,resnet有改善梯度消失的作用,文中也做了实验对比如上:但不仅仅不如此,下图是一个采用残差连接(蓝色曲线)和随机稠密的正交连接矩阵的比对,看得出来残差连接并不有效。

"Model interpretation" of the residual connection resnet, are you sure really understand it?

 

结合上面的实验,作者们认为神经网络的退化才是难以训练深层网络根本原因所在,而不是梯度消散。虽然梯度范数大,但是如果网络的可用自由度对这些范数的贡献非常不均衡,也就是每个层中只有少量的隐藏单元对不同的输入改变它们的激活值,而大部分隐藏单元对不同的输入都是相同的反应,此时整个权重矩阵的秩不高。并且随着网络层数的增加,连乘后使得整个秩变的更低。

这也是我们常说的网络退化问题,虽然是一个很高维的矩阵,但是大部分维度却没有信息,表达能力没有看起来那么强大。

残差连接正是强制打破了网络的对称性。

"Model interpretation" of the residual connection resnet, are you sure really understand it?

 

第1种(图a),输入权重矩阵(灰色部分)完全退化为0,则输出W已经失去鉴别能力,此时加上残差连接(蓝色部分),网络又恢复了表达能力。第2种(图b),输入对称的权重矩阵,那输出W一样不具备这两部分的鉴别能力,添加残差连接(蓝色部分)可打破对称性。第3种(图c)是图b的变种,不再说明。

总的来说一句话,打破了网络的对称性,提升了网络的表征能力,关于对称性引发的特征退化问题,大家还可以去参考更多的资料【4】。

对于skip连接的有效性的研究【5-6】,始终并未停止,至于究竟能到什么地步,大家还是多多关注吧学术研究,也可以多关注我们呀

参考文献

【1】He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.

【2】Srivastava R K, Greff K, Schmidhuber J. Highway networks[J]. arXiv preprint arXiv:1505.00387, 2015.

【3】Orhan A E, Pitkow X. Skip connections eliminate singularities[J]. arXiv preprint arXiv:1701.09175, 2017.

【4】Shang W, Sohn K, Almeida D, et al. Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units[J]. 2016:2217-2225.

【5】Greff K, Srivastava R K, Schmidhuber J. Highway and Residual Networks learn Unrolled Iterative Estimation[J]. 2017.

【6】Jastrzebski S, Arpit D, Ballas N, et al. Residual connections encourage iterative inference[J]. arXiv preprint arXiv:1710.04773, 2017.

Guess you like

Origin blog.csdn.net/weixin_42137700/article/details/94437117