[Classic convolutional network] ResNet theory explanation

We know that the deeper the neural network model, the more the number of convolutional layers and pooling layers, and the purpose of convolution and pooling is for feature extraction, so theoretically speaking, the deeper the network will output, the more representational it is. strong feature

 

However, we found that the deepest 56-layer network obtained the highest error rate when doing experiments. It shows that the problem of network decay does exist, that is, as the depth increases, the performance of the network will become worse and worse, and the residual network is to solve this problem.

What exactly does the residual network do?

  • - Let the layers behind the deep network realize the identity mapping
  • - Make the parameters in the network more sensitive to the loss value of reverse conduction by reducing the value of the parameters in the module

Let's explain the above two points in detail

**Explanation** of the first point (let the layer behind the deep network realize the identity mapping) : the identity mapping can be understood as: the output at a certain layer has already reflected the characteristics of the image well, and the subsequent network cannot be changed These features are the so-called identity maps. We already know that the features inside the deep network will reach the best situation at a certain layer. If the features at this time continue to be extracted with the later layers, the network performance will be reduced. The function of the residual network is to enable the deep network to achieve the same performance as the shallow network, that is, to enable the layers behind the deep network to at least achieve the role of identity mapping.

So how does the residual network realize the identity mapping? Let's look at the basic modules of residual networks

 

F(x)=H(x)-x, x is the output of the shallow layer, H(x) is the output of the deep layer, F(x) is the transformation represented by the two layers sandwiched between the two, when the shallow layer x When the representative features are optimized, the change of any feature x will make the loss larger, which means that F(x) will automatically tend to learn to be 0, and x will continue to pass from the constant mapping path. In this way, the original purpose is achieved without increasing the computational cost: in the forward propagation process, when the output of the shallow layer is good enough, the layer behind the deep network can realize the role of identity mapping.

**Explanation of the second point** (by reducing the value of the parameters in the module so that the parameters in the network have a more sensitive response to the loss value of reverse conduction):

 First understand backpropagation:

The so-called backpropagation is that the network outputs a value, and then compares it with the real value to get an error loss. At the same time, the loss is changed to change the parameters. The returned loss depends on the original loss and gradient. Since the purpose is to change the parameters, The problem is that the intensity of changing the parameters is too small, and the value of the parameters can be reduced, so that the intensity of the loss to the parameter changes is relatively greater.

For example, suppose the output without the residual module is h(x). When x= 10 , h ( x ) = 11, h is simplified to linear operation W h , W h is obviously 1.1, after adding the redidual module, F ( x ) = 1 , H ( x ) = F ( x ) + x = 11, F is also simplified to a linear operation, and the corresponding WF is 0.1. When the true value in the label is 12, the loss of backpropagation is 1, and the loss for the parameter in F and the parameter return in h is actually the same and the gradient is the value of x, but for the parameter of F Just from 0.1 to 0.2, and the parameter of h is from 1.1 to 1.2, **So the redidual module will significantly reduce the value of the parameters in the module so that the parameters in the network have a more sensitive response to the loss value of reverse conduction, Although the problem of small backhaul loss is not fundamentally solved, it reduces the parameters, relatively speaking, increases the effect of backhaul loss, and also produces a certain regularization effect . **Secondly, **Because there are branches of identity mapping in the forward process, there are more and easier paths for gradient conduction in the backpropagation process**

Therefore, the most important role of the residual module is to change the way of forward and backward information transmission , which greatly promotes the optimization of the network.

Guess you like

Origin blog.csdn.net/weixin_51781852/article/details/125732755