ResNet problem to be solved
In many papers, as well as many games as the top few algorithms used in ImageNet, showing the importance of network depth. Many tasks have been raised from the deep effect of the network. So the question is: whether simple stacking more layers of the network will be able to get better performance?
First of all, with the increase in the number of layers of the network, and was followed by a question gradients disappear \ explosion, gradient disappeared \ explosion hinder the convergence of the network. However, this problem, through reasonable initialization and normalization (such as BN layer) interlayer, has been largely improved.
The second problem is the degradation of network performance (degradation). With the increase of network layers, the accuracy rate will gradually increase to reach saturation, then rapidly degraded. Network layers increases, the complexity of the network also will increase, we naturally think of this degradation is caused by over-fitting? The answer is no, because the network performance on the training set is also poor. Article experimental results are given in the following figure. ResNet to be solved is degraded (degradation) problem.
Residual structure (building block)
Suppose we want to learn a mapping function is
(
do not necessarily represent the final mapping of the entire network can also be intermediate layers of a final map), learning is not thought residuals directly online learning
, but to learn residuals
, that is to say the want to get the final map
is divided into two parts, i.e., the residual
and the identity mapping
(the input layer). Residual structure (building block) as shown below. No extra identity map and computation parameters introduced.
Thought residual structure, is inspired by (degradation) network performance degradation problem: the network continues to shallow nonlinear layer stack, network performance degradation will occur, so if the stacked layers are constructed identity map, you can then deep network to ensure it will not be worse than the results of its corresponding shallow network degradation problems will be solved. Therefore, to resolve the identity map ResNet degradation via short connection (Shortcut) construct.
Residuals structure identity mapping if this time is optimal, i.e.,
, then directly to the residual
weights the value can be set to 0, and let the contrast
adjustment parameter learning nonlinear layer identity map, much easier. More generally, when the identity map is not optimal, compared to
it is necessary to learn the difference (residual) before and after the input and output, but also to retain the original input useful information, residual structure only need to learn the difference before and after the input and output, reduce the difficulty of learning.
Some notes residual structure
1, for a residual structure (building block) in the residual section, note residual portion of the linear from the last layer (weight layer (as shown)) out of the first identity mapping After adding the corresponding element, and then activate the non-linear function. So the identity map must cross two or more of the weight layer, because if only one layer, then the residuals is not part of the nonlinear activation function, so is only a linear function, and then with a nonlinear function to activate the latter, and this is actually nonlinear layer stacked directly (no residual structure) no essential difference.
2, in and When adding the corresponding element, must ensure and same dimensions, if not identical, a linear mapping may be added (e.g., a 1x1 convolution) in the identity map portion, such that and same dimensions.
ResNet network structure
ResNet network structure as shown below
We used two kinds of network building block, as shown below. A residue was left two stacked layers a 3x3 convolution. It is a residue of the right convolution stack 1x1,3x3,1x1 three layers. Design on the right is called the bottleneck, the first compression dimension convolution of the 1x1, 1x1 convolution then the final dimension of recovery, this design can reduce the model parameters, save computing time.