【论文笔记】一文读懂残差网络ResNet(附代码)


残差网络(Residual Net, ResNet)自从2015年面世以来,凭借其优异的性能在ILSVRC中以绝对优势获得第一名,并成功应用于许多领域。

1. 传统深度网络的问题

在深度学习或者神经领域的研究中,一般认为,网络越深(层数越多),网络的性能应该会更好,因为更深的网络倾向于有更强大的表示能力,即网络容量越大。

但是在实际过程中,我们发现过深的网络反而会导致性能的下降。在网络结构的设计中似乎存在一种“阈值”,在到达一定的层数之后,训练误差和测试误差都会加大。下图为一个24层网络和一个56层网络在CIFAR10数据集的训练表现。
Insert image description here
显然,这种性能的下降并不是因为过拟合引起的。因为过拟合意味着训练误差正常减小,而测试误差显著增大。

对这种现象的一种解释是,在网络深度过深的时候,低层参数的细微变动都会引起高层参数的剧烈变化,优化算法没有能力去得到一个最优解。

做这样一个假设,假设有一个50层的网络,但在其优化过程中,最容易优化出最佳解的层数是25,那么这个网络的后25层应当作为一个恒等映射
x 25 = f 1 ( x ) \mathbf{x}_{25}=f_{1}(\mathbf{x}) x25=f1(x) o u t = f 2 ( x 25 ) \mathbf{out}=f_{2}(\mathbf{x}_{25}) out=f2(x25)

由于神经网络由非线性层组合而成,学习一个恒等映射是比较困难的。优化算法的局限性使得“冗余”的网络层学习到了不适合恒等映射的参数。

2. 残差结构和残差网络

2.1 残差是什么

残差的统计学定义:实际观测值和估计值(拟合值)之间的差值。

如果存在某个k层的网络 F F F是当前最优的网络,那么可以构造一个更深的网络,其最后几层仅是网络f第k层输出的恒等映射,就可以取得与 F F F一致的结果

如果k还不是最佳层数,那么更深的网络就可以取得更好的结果。所以,如果深层网络的效果不如浅层网络,那么则说明新加入层不好学

如果不好学,则可以使用类似“分治法”,分开求解恒等映射和非恒等映射。
x代表之前浅层网络已经学到的东西
F(x)代表已经学到的东西和要学的东西的之间的残差

现在只学F(x)就能与x组合起来。

H ( x ) = F ( x ) + x F ( x ) = H ( x ) − x H(\mathbf{x})=F(\mathbf{x})+\mathbf{x} \\ F(\mathbf{x})=H(\mathbf{x})-\mathbf{x} H(x)=F(x)+xF(x)=H(x)x

x x x成为恒等映射,那么只需要学习残差 F ( x ) F(x) F(x)作为非恒等映射。
残差在这里,指的是直接的映射H(x)与快捷连接x的差值,也就是 F ( x ) F(\mathbf{x}) F(x)

2.2 残差模块 Residual Block

据此,我们设计一个残差模块(Residual Block)的结构如下:
Insert image description here

y l = h ( x l ) + F ( x l , W l ) \mathbf{y}_{l}=h(\mathbf{x}_{l})+\mathcal{F}(\mathbf{x}_{l},\mathcal{W}_{l}) yl=h(xl)+F(xl,Wl) x l + 1 = f ( y l ) \mathbf{x}_{l+1}=f(\mathbf{y}_{l}) xl+1=f(yl)
在网络实现中:
h ( x l ) = x l f = R e L U h(\mathbf{x}_{l})=\mathbf{x}_{l}\qquad f=\mathrm{ReLU} h(xl)=xlf=ReLU x l + 1 ≡ y l \mathbf{x}_{l+1}\equiv\mathbf{y}_{l} xl+1yl
最后得到的残差模块表达式如下:
x l + 1 = x l + F ( x l , W l ) \mathbf{x}_{l+1}=\mathbf{x}_{l}+\mathcal{F}(\mathbf{x}_{l},\mathcal{W}_{l}) xl+1=xl+F(xl,Wl)

2.3 基本模块BasicBlock和BottleNeck

在残差网络中,基本的残差模块由两个3×3的卷积层和ReLU激活函数、BatchNorm层组成。其结构如下(以64个channel的输入为例):

BasicBlock(
    (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu): ReLU(inplace=True)
    (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)

When the number of network layers is too deep, considering the training cost, the author proposes a new structural design BottleNeck . Change the original two 3×3 convolutional layers into two 1×1 convolutional layers and one 3×3 convolutional layer. Among them, the two 1×1 convolutional layers are responsible for reducing/restoring the channel dimension, and the 3×3 convolutional layer is responsible for the “real” convolution operation. Its structure is shown on the right side of the figure below. BottleNeck's operation has smaller time complexity.
Insert image description here

2.4 Design of residual network ResNet

In the article, the author gives network designs such as ResNet18, ResNet34, ResNet50, ResNet101, and ResNet152, which respectively correspond to different numbers of convolutional operation layers. In networks of 50 layers and above, the BottleNeck structure is used for network construction.
Insert image description here

2.4.1 Connection of identity mapping and residuals

In network design, special attention needs to be paid to the fact that between different layers, such as between conv2_x and conv3_x, the output and input sizes are different, as shown by the dotted line in the figure below.

For the residual operation, this can be easily transformed through convolution. For identity mapping, the author considered the following methods for transformation:

  1. Give the identity mapping x \mathbf{x}Add 0 to x , expanding its dimensions
  2. Downsampling with a 1×1 convolution
    Insert image description here

In the code implementation, the down-sampling method is used to transform the identity mapping.

3. Forward/Backward Propagation

3.1 Forward propogation

traditional network

F ( x , w ) = xw F(x,w)=xwF(x,w)=xw
x L = F ( x L − 1 , w L − 1 ) = F ( F ( x L − 2 , w L − 2 ) , w L − 1 ) ⋯ = ∏ i = 1 L − 1 x i w i x_{L}=F(x_{L-1},w_{L-1})=F(F(x_{L-2},w_{L-2}),w_{L-1})\cdots=\prod_{i=1}^{L-1}{x_{i}w_{i}} xL=F(xL1,wL1)=F(F(xL2,wL2),wL1)=i=1L1xiwi

Residual network
x 2 = x 1 + F ( x 1 , w 1 ) x_{2}=x_{1}+F(x_1,w_{1})x2=x1+F(x1,w1)
x 3 = x 2 + F ( x 2 , w 2 ) = x 1 + F ( x 1 , w 1 ) + F ( x 2 , w 2 ) x_{3}=x_{2}+F(x_2,w_{2})=x_{1}+F(x_{1},w_{1})+F(x_{2},w_{2}) x3=x2+F(x2,w2)=x1+F(x1,w1)+F(x2,w2)
⋯ \cdots
x L = x 1 + ∑ i = 1 L − 1 F ( x i , w i ) x_{L}=x_{1}+\sum_{i=1}^{L-1}{F(x_{i},w_{i})} xL=x1+i=1L1F(xi,wi)

3.2 Back Propogation

Traditional network
shallow network is g ( x ) g(x)g ( x ) , after adding layers, it becomesf (g (x)) f(g(x))f(g(x))
∂ f ( g ( x ) ) ∂ x = ∂ f ( g ( x ) ) ∂ g ( x ) ∂ g ( x ) ∂ x \frac{\partial{f(g(x))}}{\partial{x}}=\frac{\partial{f(g(x))}}{\partial{ {g(x)}}}\frac{\partial{g(x)}}{\partial{x}} xf(g(x))=g(x)f(g(x))xg(x)
残差网络
∂ ( f ( g ( x ) ) + g ( x ) ) ∂ x = ∂ f ( g ( x ) ) ∂ g ( x ) ∂ g ( x ) ∂ x + ∂ g ( x ) ∂ x \frac{\partial{(f(g(x))+g(x))}}{\partial{x}}=\frac{\partial{f(g(x))}}{\partial{ {g(x)}}}\frac{\partial{g(x)}}{\partial{x}}+\frac{\partial{g(x)}}{\partial{x}} x(f(g(x))+g(x))=g(x)f(g(x))xg(x)+xg(x)

It can be seen that in the process of finding the gradient, the residual network adds one more item than the traditional network, which is beneficial to solving the problem of gradient disappearance and makes the network training faster.

The loss function has a negative impact on the llth layer of the network.l求梯度:
传统网络
∂ L o s s ∂ x l = ∂ L o s s ∂ x L ∂ x L ∂ x l = ∂ L o s s ∂ x L ∂ ∏ i = 1 L − 1 x i w i ∂ x l \frac{\partial{Loss}}{\partial{x_{l}}}=\frac{\partial{Loss}}{\partial{x_{L}}}\frac{\partial{x_{L}}}{\partial{x_{l}}}=\frac{\partial{Loss}}{\partial{x_{L}}}\frac{\partial{\prod_{i=1}^{L-1}{x_{i}w_{i}}}}{\partial{x_{l}}} xlLoss=xLLossxlxL=xLLossxli=1L1xiwi
残差网络
∂ L o s s ∂ x l = ∂ L o s s ∂ x L ∂ x L ∂ x l = ∂ L o s s ∂ x L ( 1 + ∂ ∑ i = l L − 1 F ( x i , w i ) ∂ x l ) \frac{\partial{Loss}}{\partial{x_{l}}}=\frac{\partial{Loss}}{\partial{x_{L}}}\frac{\partial{x_{L}}}{\partial{x_{l}}}=\frac{\partial{Loss}}{\partial{x_{L}}}(1+\frac{\partial{\sum_{i=l}^{L-1}{F(x_{i},w_{i})}}}{\partial{x_{l}}}) xlLoss=xLLossxlxL=xLLoss(1+xli=lL1F(xi,wi))

It can be seen that in the residual network, the gradient changes from multiplication to addition, which can effectively alleviate gradient disappearance and gradient explosion.

4. Code analysis

PyTorch has now integrated ResNet as a python library and can be called directly. The source code address is as follows:
https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.pyBasic
convolution layer :

def conv3x3(in_planes, out_planes, stride=1, groups=1, dilation=1):
    """3x3 convolution with padding"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
                     padding=dilation, groups=groups, bias=False, dilation=dilation)


def conv1x1(in_planes, out_planes, stride=1):
    """1x1 convolution"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)

Among them, there are two ways to implement downsampling of the input in the convolutional layer:

  • The first one uses stride. In downsampling, set stride=2, that is, the step size of each movement of the convolution kernel is 2, making the output size smaller.
  • The second method uses dilation, that is, at each sampling time, the convolution kernel samples one pixel each time. There is a very vivid and intuitive demonstration in https://github.com/vdumoulin/conv_arithmetic/blob/master/README.md .

In Basic Block
, expansion represents the change in the number of channels after a block. The output channel dimension here is the same as the default, and the expansion is 1.

class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
                 base_width=64, dilation=1, norm_layer=None):
        super(BasicBlock, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        if groups != 1 or base_width != 64:
            raise ValueError('BasicBlock only supports groups=1 and base_width=64')
        if dilation > 1:
            raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
        # Both self.conv1 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = norm_layer(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = norm_layer(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out

BottleNeck
The channel output of BottleNeck is 4 times that of the corresponding BasicBlock, so expansion=4

class Bottleneck(nn.Module):
    # Bottleneck in torchvision places the stride for downsampling at 3x3 convolution(self.conv2)
    # while original implementation places the stride at the first 1x1 convolution(self.conv1)
    # according to "Deep residual learning for image recognition"https://arxiv.org/abs/1512.03385.
    # This variant is also known as ResNet V1.5 and improves accuracy according to
    # https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch.

    expansion = 4

    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
                 base_width=64, dilation=1, norm_layer=None):
        super(Bottleneck, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        width = int(planes * (base_width / 64.)) * groups
        # Both self.conv2 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv1x1(inplanes, width)
        self.bn1 = norm_layer(width)
        self.conv2 = conv3x3(width, width, stride, groups, dilation)
        self.bn2 = norm_layer(width)
        self.conv3 = conv1x1(width, planes * self.expansion)
        self.bn3 = norm_layer(planes * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out

When constructing each layer of ResNet
, pay attention to adding a downsampling layer to downsample the identity map when the number of output channels and input channels is inconsistent.

class ResNet(nn.Module):

    def __init__(self, block, layers, num_classes=1000, zero_init_residual=False,
                 groups=1, width_per_group=64, replace_stride_with_dilation=None,
                 norm_layer=None):
        super(ResNet, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        self._norm_layer = norm_layer

        self.inplanes = 64
        self.dilation = 1
        if replace_stride_with_dilation is None:
            # each element in the tuple indicates if we should replace
            # the 2x2 stride with a dilated convolution instead
            replace_stride_with_dilation = [False, False, False]
        if len(replace_stride_with_dilation) != 3:
            raise ValueError("replace_stride_with_dilation should be None "
                             "or a 3-element tuple, got {}".format(replace_stride_with_dilation))
        self.groups = groups
        self.base_width = width_per_group
        self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3,
                               bias=False)
        self.bn1 = norm_layer(self.inplanes)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2,
                                       dilate=replace_stride_with_dilation[0])
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2,
                                       dilate=replace_stride_with_dilation[1])
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2,
                                       dilate=replace_stride_with_dilation[2])
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

        # Zero-initialize the last BN in each residual branch,
        # so that the residual branch starts with zeros, and each residual block behaves like an identity.
        # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
        if zero_init_residual:
            for m in self.modules():
                if isinstance(m, Bottleneck):
                    nn.init.constant_(m.bn3.weight, 0)
                elif isinstance(m, BasicBlock):
                    nn.init.constant_(m.bn2.weight, 0)

    def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
        norm_layer = self._norm_layer
        downsample = None
        previous_dilation = self.dilation
        if dilate:
            self.dilation *= stride
            stride = 1
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                conv1x1(self.inplanes, planes * block.expansion, stride),
                norm_layer(planes * block.expansion),
            )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample, self.groups,
                            self.base_width, previous_dilation, norm_layer))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes, groups=self.groups,
                                base_width=self.base_width, dilation=self.dilation,
                                norm_layer=norm_layer))

        return nn.Sequential(*layers)

    def _forward_impl(self, x):
        # See note [TorchScript super()]
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)

        return x

    def forward(self, x):
        return self._forward_impl(x)

5. Identity mapping

Make a simple improvement on the residual: xl + 1 = λ lxl + F ( xl , W l ) \mathbf{x}_{l+1}=\lambda_{l}\mathbf{x}_{l}+ \mathcal{F}(\mathbf{x}_{l},\mathcal{W}_{l})xl+1=llxl+F(xl,Wl)
则有:
x L = ( ∏ i = l L − 1 λ i ) xl + ∑ i = l L − 1 F ( xl , W l ) \mathbf{x}_{L}=(\prod_{i= l}^{L-1}\lambda _{i})\mathbf{x}_{l}+\sum _{i=l}^{L-1}{\mathcal{F}(\mathbf{x}_ {l},\mathcal{W}_{l})}xL=(i=lL1li)xl+i=lL1F(xl,Wl)
求梯度,有:
∂ L o s s ∂ x l = ∂ L o s s ∂ x L ∂ x L ∂ x l = ∂ L o s s ∂ x L ( ∏ i = l L − 1 λ i + ∂ ∑ i = l L − 1 F ( x l , W l ) ∂ x l ) \frac{\partial{Loss}}{\partial{\mathbf{x}_{l}}}=\frac{\partial{Loss}}{\partial{\mathbf{x}_{L}}}\frac{\partial{\mathbf{x}_{L}}}{\partial{\mathbf{x}_{l}}}=\frac{\partial{Loss}}{\partial{\mathbf{x}_{L}}}(\prod_{i=l}^{L-1}{\lambda_{i}}+\frac{\partial{\sum_{i=l}^{L-1}{\mathcal{F}(\mathbf{x}_{l},\mathcal{W}_{l})}}}{\partial{\mathbf{x}_{l}}}) xlLoss=xLLossxlxL=xLLoss(i=lL1li+xli=lL1F(xl,Wl))

It can be seen that λ \lambdaWhen λ is greater than 1, the cumulative multiplication will cause the gradient to explode; when it is less than 1, the cumulative multiplication will cause the gradient to disappear.

6. Analyze residual connections

The author gives several variants of residual connection:
Original
xl + 1 = xl + F ( xl , W l ) \mathbf{x}_{l+1}=\mathbf{x}_{l}+ \mathcal{F}(\mathbf{x}_{l},\mathcal{W}_{l})xl+1=xl+F(xl,Wl) constant
xl + 1 = λ 1 xl + λ 2 F ( xl , W l ) \mathbf{x}_{l+1}=\lambda_{1}\mathbf{x}_{l}+\lambda_{2 }\mathcal{F}(\mathbf{x}_{l},\mathcal{W}_{l})xl+1=l1xl+l2F(xl,Wl)exclusive gating
x l + 1 = ( 1 − g ( x l ) ) x l + g ( x l ) F ( x l , W l ) \mathbf{x}_{l+1}=(1-g(\mathbf{x}_{l}))\mathbf{x}_{l}+g(\mathbf{x}_{l})\mathcal{F}(\mathbf{x}_{l},\mathcal{W}_{l}) xl+1=(1g(xl))xl+g(xl)F(xl,Wl)shortcut-only gating
x l + 1 = ( 1 − g ( x l ) ) x l + F ( x l , W l ) \mathbf{x}_{l+1}=(1-g(\mathbf{x}_{l}))\mathbf{x}_{l}+\mathcal{F}(\mathbf{x}_{l},\mathcal{W}_{l}) xl+1=(1g(xl))xl+F(xl,Wl)
The rest also include1×1 conv shortcutanddropout shortcut

The schematic diagram of these types of residual connections is as follows:
Insert image description here
The experimental results given by the author are as follows:
Insert image description here
It can be seen that the original residual connection has the best effect. The effect of using an exclusive gate strongly depends on the bias setting.

7. Residual modules with different structures

The author next analyzes the impact of the design of different residual modules.
Insert image description here

  • In (b), due to the existence of the BN layer, xl + 1 = f ( yl ) \mathbf{x}_{l+1}=f(\mathbf{y}_{l})xl+1=f(yl) is no longer a linear mapping, which will affect the performance of the residual network
  • In (c), the final ReLU activation layer of the residual target makes the output range of the residual non-negative . However, both mathematically and empirically, the range of residuals should be ( − ∞ , + ∞ ) (-\infty,+\infty)(,+ ) , non-negative residuals affect model performance
  • In (d) and (e), the author adopts a pre-activation idea.
    In the original design, fff will affect two parts of the residual module:
    yl + 1 = f ( yl ) + F ( f ( yl ) , W l + 1 ) \mathbf{y}_{l+1}=f(\mathbf{ y}_{l})+\mathcal{F}(f(\mathbf{y}_{l}),\mathcal{W}_{l+1})yl+1=f(yl)+F(f(yl),Wl+1)
    pre-activation makesfff only affects the residual part and does not affect the identity mapping
    yl + 1 = yl + F ^ ( f ( yl ) , W l + 1 ) \mathbf{y}_{l+1}=\mathbf{y}_{ l}+\hat{\mathcal{F}}(f(\mathbf{y}_{l}),\mathcal{W}_{l+1})yl+1=yl+F^(f(yl),Wl+1)
    In the network design, the structure is as follows:
    Insert image description here
    In fact, when we put the BN layer in front of the convolution layer, the network performance will be further improved. This can be seen as the BN layer plays a regularization role.

Guess you like

Origin blog.csdn.net/d33332/article/details/128725870