DL study notes --- Resnet

Resnet study notes

Foreword

Neural network learning a few months, the feeling did not learn anything, but added the ability to raise a lot of alchemy. . . Not just stop in the application, or to grasp the theory, and therefore wanted to help blog Park theoretical knowledge I learned are recorded, it can be considered deepened memories.

In recent look at some well-known network model, we proceed from the Resnet wrote the first blog (mainly GoogleNet too complicated ...)

Why Resnet

More good in today's world, more and more deep neural network model, then the model is not deeper?

First, the paper raises the question: Degradation problem. Specifically, with the deepening of the network is the depth, accuracy saturation (including test_acc and train_acc), and even rapid decline. The cause of the problem is clearly not over-fitting (over-fitting train_acc should be high), the paper also said it was not the gradient disappears / explosion (including all kinds of normalization BN can effectively solve the problem disappear gradient).

So why cause Degradation, paper, in fact, no detailed explanation, the Internet also different opinions, but this does not affect the proposed resnet.

We envision two networks \ (N_1 \) and \ (N_2 \) , two network intends to contract a function \ (H (the X-) \) . However \ (N_l \) only layer 10, \ (N_2 \) 100 layer. Two networks trained function we write \ (H_ {Nl} (X) \) , \ (H_ {} N2 of (X) \) , experiments show \ (N_2 \) fitting effective as \ (N_l \ ) , i.e. \ (H_ {N1} (x ) \) is closer to \ (H (X) \) .

Obviously, however, if the \ (N_2 \) before the layer 10 function exactly trained \ (Nl F_ {} (X) \) , then the layer 90 is full identity map, then the effect of at least not more than \ ( N_1 \) is poor. The problem is that the conventional method is difficult to nonlinear networks trained multilayer identity mapping. Therefore Resnet came into being, the use of so-called Residue learning to solve this problem

What is Residue learning

FIG explained first on a single residue_unit

  • avatar

    Then the formula
    \ [y = F (x, \ {W \}) + x \]

    The figure is one of the easiest residue_unit

    Assuming that the unit to be fitted is still \ (H (the X-) \) , but now two-pronged network, directly to the right is the identity map, there is no need to learn parameters.

    Well, this time, the left needs to fit \ (H (the X-) -x \) , we write \ (F (the X-) \) , all parameters need to learn in this way, \ (F (the X-) \ ) is to study the residuals (residue), which is called the residual learning (residue learning)

    It appears to be not very simple, but the effect of such a change is great.

    Just back problems, the existing multi-layer nonlinear network is difficult trained to identity map.

    However, residue_unit to train the identity map is relatively easy. Because the right is the identity map directly, then you only need to learn a left \ (F (x) = 0 \) structure can (such as ownership reset to 0), which is relative to the \ (F (x) = x \) structure is more easy to train. Therefore the success of the multi-layer non-linear network trained to identity transformation, and we will be able to solve the problem as effective as shallow deep web network. But the fact is, the deep network effect with residue unit better than the shallow network

    Shortcut Resnet is added (i.e. the right in the figure above) deep in the network, at least two spaced Shortcut (experiments show the spacer layer does not improve the accuracy), the following figure is a typical right-most network Resnet

What does resnet solve

Papers say resnet not solve gradient disappeared, nor is it over-fitting. (Of course, there are other online papers to prove resnet indeed weakened the gradient disappears)

According to the paper say the solution is to train a valid identity map, making deep network does not suffer degradation problem

Among another paper (Identity Mappings in Deep Residual Networks), the author mentions a benefit

the signal can be directly propagated from any unit to another, both forward and backward

Specific derivation can look at the paper itself. The paper also introduced an enhanced version resnet, and explains why
\ [y = F (x,
\ {W \}) + x \] instead of
\ [y = \ lambda_1 F ( x, \ {W \}) + \ lambda_2x \ \ \ \ \ \ lambda_1, \ lambda_2 is not equal to 1 \]

Epilogue

Resnet introduce it so much, all I understand it, does not ensure the correctness, hope you correct me if wrong also. Under a complex plan to write about GoogleNet.

Resnet really simple!

references

[1] Deep Residue Learning for Image Recognition. Kaiming He

[2] Identity Mappings in Deep Residual Networks. Kaiming He

[3] https://www.zhihu.com/question/64494691/answer/271335912

Guess you like

Origin www.cnblogs.com/buaa17231043/p/11408494.html