Using Pytorch to implement MobileNet network

Table of contents

1. MobileNet v1 network

2. MobileNet v2 network

2.1 Inverted Residual

2.2 Linear Bottlenecks

2.3 Network structure of MobileNet v2 model

3. MobileNet v3 network

3.1 Attention mechanism

3.2 Redesign the time-consuming layer structure

3.3 Redesigning the activation function

3.4 Network structure of MobileNet v3 model

4. Using Pytorch to implement MobileNet

4.1 MobileNet v1、v2

4.2 MobileNet v3

5. Training Results


1. MobileNet v1 network

In the traditional convolutional neural network, the model parameters are relatively large, and the requirements for computing power are very high, so it is difficult to run on mobile devices and embedded devices. MobileNet realizes the deep learning network in mobile devices and embedded devices. run on.

The MobileNet network was proposed by the Google team in 2017. It is a lightweight CNN network focused on mobile or embedded devices. Compared with the traditional CNN network, the parameters and calculation amount of the model are greatly reduced under the premise of a small decrease in the accuracy rate. The network has the following two highlights:

  • Depthwise Convolution (DW convolution, greatly reducing the number of parameters and computation)
  • Added hyperparameters to control the number of convolution kernels \alphaand hyperparameters to control the size of the input image\beta

In the traditional convolution operation, the number of channels of the convolution kernel is equal to the number of channels of the input feature matrix, and the number of channels of the output feature matrix is ​​equal to the number of convolution kernels used:

For DW convolution, the number of channels of the convolution kernel is 1, and the number of channels of the input feature matrix is ​​equal to the number of convolution kernels equal to the number of channels of the output feature matrix:

Depthwise Separable Conv consists of DW convolution and PW convolution. PW convolution is similar to traditional convolution operations, except that the size of the convolution kernel is 1×1. Theoretically, the calculation amount of traditional convolution is 8 to 8 times that of Depthwise Separable Conv operation. 9 times.


2. MobileNet v2 network

In the use of MobileNet, most of the convolution kernel parameters of DW convolution are 0, and this part of the convolution kernel does not work. This problem has been improved in MobileNet v2. The network has the following two highlights:

  • Inverted Residual (inverted residual structure)
  • Linear Bottlenecks

2.1 Inverted Residual

The traditional residual structure is to first use a 1×1 convolution kernel for dimensionality reduction, then use a 3×3 convolution kernel for convolution processing, and finally use a 1×1 convolution kernel for dimension enhancement, forming a two-ended Large middle and small bottleneck structure; while the inverted residual structure first uses a 1×1 convolution kernel to increase the dimension, then uses a 3×3 convolution kernel to perform DW convolution, and finally uses a 1×1 convolution kernel to reduce Dimensional processing, which is just the opposite of the ordinary residual structure.

The ReLU activation function is used in the ordinary residual structure, while the ReLU6 activation function is used in the inverted residual structure:

  • ReLU activation function: when the input value is less than 0, it is set to 0 by default, and if the input value is greater than 0, no operation is performed
  • ReLU6 activation function: when the input value is less than 0, it is set to 0 by default. If the input value is between 0 and 6, no operation is performed. When the input value is greater than 6, it is set to 6. The formula is

y=ReLU6\left ( x \right )=min\left ( max\left ( x,0 \right ) ,6\right )


2.2 Linear Bottlenecks

For the last convolutional layer of the inverted residual structure, a linear activation function is used instead of the ReLU activation function. The author of the original article conducted an experiment. The content is that the ReLU activation function will cause a large loss of low-dimensional feature information, and the high-dimensional feature information The loss caused is very small, and the inverted residual structure is a structure with thin sides and thick middle, and the output is a low-dimensional feature vector. Therefore, the loss of using the ReLU activation function will be relatively large, so a linear activation function is used instead.

The block block of MobileNet v2 is shown in the figure below:

It should be noted that there is a shortcut connection only when stride=1 and the shape of the input feature matrix and the output feature matrix are the same (that is, the case on the left).


2.3 Network structure of MobileNet v2 model

tis the expansion factor, which is how many dimensions the input will be raised to;

cis the output feature matrix depth;

nis the number of repetitions of the bottleneck;

sIt is the step size of the bottleneck of the first layer, and the other layers are 1.


3. MobileNet v3 network

MobileNet v3 has the following three improvements:

  • Updated Block (bneck, simple change in inverted residual structure)
  • Technology using NAS (Neural Architecture Search) search parameters
  • Redesigned the structure of some time-consuming layers

There are mainly highlights in the update of Block: the SE module has been added, and the activation function has been updated. The structure diagram is as follows:


3.1 Attention mechanism

Pooling is performed for each channel of the obtained output matrix, and the number of elements of the obtained one-dimensional vector is equal to the number of channels; after the first fully connected layer, the number of nodes is 1/4 of the number of channels, and the activation function is ReLU ; After the second fully connected layer, the number of nodes is equal to the number of channels, and the activation function is hard-sigmoid; the final output vector analyzes the weight relationship for each channel of the matrix, and the important channel will be assigned a relatively large Weight, the schematic diagram is as follows:


3.2 Redesign the time-consuming layer structure

Reduce the number of convolution kernels in the first convolutional layer from 32 to 16 without affecting the accuracy; streamline the Last Stage, the model before streamlining is as follows:

The simplified model:


3.3 Redesigning the activation function

The more commonly used activation function at that time was the swish activation function:

swishx=x\cdot \sigma \left ( x \right )

in

\sigma \left ( x \right )=\frac{1}{1+e^{-x}}

The calculation and derivation of this activation function are very complicated; for mobile devices, quantization operations are performed in order to speed up, and the swish activation function is very complicated for quantization operations. In response to these two problems, the h-swish activation function is proposed:

h-swish\left [ x \right ]=x\cdot \frac{ReLU6\left ( x+3 \right )}{6}

The second half of the formula is the h-sigmoid activation function.

It can be seen from the figure that the two activation functions are very similar, h-swish can be used instead of swish, and the quantization operation can also be simplified.


3.4 Network structure of MobileNet v3 model


4. Using Pytorch to implement MobileNet

4.1 MobileNet v1、v2

In the MobileNet v2 network, all convolutional layers are basically composed of convolution + BN + ReLU6 activation function. First define this combination:

class ConvBNReLU(nn.Sequential):
    # 输入特征矩阵深度,输出特征矩阵深度,卷积核大小,步长,groups=1是普通卷积、=in_channel是DW卷积
    def __init__(self, in_channel, out_channel, kernel_size=3, stride=1, groups=1):
        # 填充参数
        padding = (kernel_size - 1) // 2
        super(ConvBNReLU, self).__init__(
            # 首先是卷积操作:输入特征矩阵深度,输出特征矩阵深度,卷积核大小,步长,填充参数,groups为默认值,不需要偏置
            nn.Conv2d(in_channel, out_channel, kernel_size, stride, padding, groups=groups, bias=False),
            # BN层,输入为卷积操作的输出
            nn.BatchNorm2d(out_channel),
            # 激活函数
            nn.ReLU6(inplace=True)
        )

Inverted residual structure:

# 倒残差结构
class InvertedResidual(nn.Module):
    # 输入特征矩阵深度,输出特征矩阵深度,步长,深度扩大多少倍
    def __init__(self, in_channel, out_channel, stride, expand_ratio):
        super(InvertedResidual, self).__init__()
        # 第一层卷积层,扩展深度
        hidden_channel = in_channel * expand_ratio
        # 定义一个布尔变量,步长为1且输入和输出特征矩阵相等时,采用捷径分支
        self.use_shortcut = stride == 1 and in_channel == out_channel

        # 层列表
        layers = []
        # t = 1时没有对输入特征矩阵进行扩充,此时不需要第一层卷积层
        if expand_ratio != 1:
            # 1x1 pointwise conv
            layers.append(ConvBNReLU(in_channel, hidden_channel, kernel_size=1))
        # extend函数能够一次性批量插入很多元素
        layers.extend([
            # 3x3 depthwise conv,DW卷积的groups为输入特征矩阵深度
            ConvBNReLU(hidden_channel, hidden_channel, stride=stride, groups=hidden_channel),
            # 1x1 pointwise conv(linear)
            nn.Conv2d(hidden_channel, out_channel, kernel_size=1, bias=False),
            # BN层
            nn.BatchNorm2d(out_channel),
        ])

        # *的解析见博客http://t.csdn.cn/CnTEA
        self.conv = nn.Sequential(*layers)

    # 前向传播,判断是否使用捷径分支
    def forward(self, x):
        if self.use_shortcut:
            return x + self.conv(x)
        else:
            return self.conv(x)

Define the network structure of MobileNet v2:

# 网络结构
class MobileNetV2(nn.Module):
    # 类别个数和v1网络中的alpha参数
    def __init__(self, num_classes=1000, alpha=1.0, round_nearest=8):
        super(MobileNetV2, self).__init__()
        # 块,注意不是实例化而是赋值
        block = InvertedResidual
        # 输入特征矩阵深度
        input_channel = _make_divisible(32 * alpha, round_nearest)
        # 输出通道数
        last_channel = _make_divisible(1280 * alpha, round_nearest)
        # 对应网络结构的四个参数值
        inverted_residual_setting = [
            # t, c, n, s
            [1, 16, 1, 1],
            [6, 24, 2, 2],
            [6, 32, 3, 2],
            [6, 64, 4, 2],
            [6, 96, 3, 1],
            [6, 160, 3, 2],
            [6, 320, 1, 1],
        ]

        features = []
        # 第一个卷积层
        features.append(ConvBNReLU(3, input_channel, stride=2))
        # building inverted residual residual blockes
        for t, c, n, s in inverted_residual_setting:
            # 调整每个block输出特征矩阵深度
            output_channel = _make_divisible(c * alpha, round_nearest)
            # 每个block块的具体结构
            for i in range(n):
                # s为第一层的步距,其他层均为1
                stride = s if i == 0 else 1
                features.append(block(input_channel, output_channel, stride, expand_ratio=t))
                # 将output_channel传入input_channel,作为下一层的输入
                input_channel = output_channel
        # 倒数第三个卷积层
        features.append(ConvBNReLU(input_channel, last_channel, 1))
        # 特征提取层结束
        # 将特征提取网络结构传入
        self.features = nn.Sequential(*features)

        # 分类器
        # 平均池化下采样层,输出特征矩阵高和宽为1×1
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        # Dropout层和全连接层
        self.classifier = nn.Sequential(
            nn.Dropout(0.2),
            nn.Linear(last_channel, num_classes)
        )

        # 初始化权重
        for m in self.modules():
            # 遍历每一个子模块
            if isinstance(m, nn.Conv2d):
                # 如果是卷积层就对权重进行初始化
                nn.init.kaiming_normal_(m.weight, mode='fan_out')
                # 如果存在偏置则将偏置设置为0
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.BatchNorm2d):
                # 如果是BN层将方差设置为1,均值设置为0
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                # 如果是全连接层,将权重设置为均值为0,方差为0.01的正态分布,偏置设置为0
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.zeros_(m.bias)

    # 前向传播
    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        # 将输出展平
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

4.2 MobileNet v3

For the MobileNet v3 network, there are more SE modules:

# 注意力模块
class SqueezeExcitation(nn.Module):
    # squeeze_factor:第一个全连接层节点个数是输入的1/4
    def __init__(self, input_c: int, squeeze_factor: int = 4):
        super(SqueezeExcitation, self).__init__()
        squeeze_c = _make_divisible(input_c // squeeze_factor, 8)
        # 全连接层
        self.fc1 = nn.Conv2d(input_c, squeeze_c, 1)
        self.fc2 = nn.Conv2d(squeeze_c, input_c, 1)

    def forward(self, x: Tensor):
        # 自适应的平均池化操作,将每一个channel上的数据平均池化到1×1的大小
        scale = F.adaptive_avg_pool2d(x, output_size=(1, 1))
        scale = self.fc1(scale)
        scale = F.relu(scale, inplace=True)
        scale = self.fc2(scale)
        # hardsigmoid激活函数
        scale = F.hardsigmoid(scale, inplace=True)
        # 直接相乘
        return scale * x

The entire Block block structure:

# 整个Block块结构
class InvertedResidual(nn.Module):
    def __init__(self,
                 cnf: InvertedResidualConfig,
                 norm_layer: Callable[..., nn.Module]):
        super(InvertedResidual, self).__init__()

        # 步长只能为1或2
        if cnf.stride not in [1, 2]:
            raise ValueError("illegal stride value.")

        # 判断是否使用捷径分支
        self.use_res_connect = (cnf.stride == 1 and cnf.input_c == cnf.out_c)

        layers: List[nn.Module] = []
        # 判断使用什么激活函数
        activation_layer = nn.Hardswish if cnf.use_hs else nn.ReLU

        # expand
        # 只有第一个bneck没有1×1的升维卷积层
        if cnf.expanded_c != cnf.input_c:
            layers.append(ConvBNActivation(cnf.input_c,
                                           cnf.expanded_c,
                                           kernel_size=1,
                                           norm_layer=norm_layer,
                                           activation_layer=activation_layer))

        # depthwise
        layers.append(ConvBNActivation(cnf.expanded_c,
                                       cnf.expanded_c,
                                       kernel_size=cnf.kernel,
                                       stride=cnf.stride,
                                       groups=cnf.expanded_c,
                                       norm_layer=norm_layer,
                                       activation_layer=activation_layer))

        if cnf.use_se:
            layers.append(SqueezeExcitation(cnf.expanded_c))

        # project
        layers.append(ConvBNActivation(cnf.expanded_c,
                                       cnf.out_c,
                                       kernel_size=1,
                                       norm_layer=norm_layer,
                                       # 线性激活,没有做任何处理
                                       activation_layer=nn.Identity))

        self.block = nn.Sequential(*layers)
        self.out_channels = cnf.out_c
        self.is_strided = cnf.stride > 1

    def forward(self, x: Tensor):
        result = self.block(x)
        # 是否使用捷径分支
        if self.use_res_connect:
            result += x

        return result

MobileNet v3 network structure:

class MobileNetV3(nn.Module):
    # 一系列结构参数列表,倒数第二个全连接层输出节点个数,分类类别个数,InvertedResidual模块
    def __init__(self,
                 inverted_residual_setting: List[InvertedResidualConfig],
                 last_channel: int,
                 num_classes: int = 1000,
                 block: Optional[Callable[..., nn.Module]] = None,
                 norm_layer: Optional[Callable[..., nn.Module]] = None):
        super(MobileNetV3, self).__init__()

        # 数据检查
        if not inverted_residual_setting:
            raise ValueError("The inverted_residual_setting should not be empty.")
        elif not (isinstance(inverted_residual_setting, List) and
                  all([isinstance(s, InvertedResidualConfig) for s in inverted_residual_setting])):
            raise TypeError("The inverted_residual_setting should be List[InvertedResidualConfig]")

        if block is None:
            block = InvertedResidual

        if norm_layer is None:
            # partial用法http://t.csdn.cn/k80JE
            norm_layer = partial(nn.BatchNorm2d, eps=0.001, momentum=0.01)

        layers: List[nn.Module] = []

        # building first layer
        # 获得第一个卷积层输出的channel
        firstconv_output_c = inverted_residual_setting[0].input_c
        layers.append(ConvBNActivation(3,
                                       firstconv_output_c,
                                       kernel_size=3,
                                       stride=2,
                                       norm_layer=norm_layer,
                                       activation_layer=nn.Hardswish))
        # building inverted residual blocks
        for cnf in inverted_residual_setting:
            layers.append(block(cnf, norm_layer))

        # building last several layers
        # 获得最后一个bneck结构的输出channel
        lastconv_input_c = inverted_residual_setting[-1].out_c
        # 160 * 6
        lastconv_output_c = 6 * lastconv_input_c
        # 最后一个卷积层
        layers.append(ConvBNActivation(lastconv_input_c,
                                       lastconv_output_c,
                                       kernel_size=1,
                                       norm_layer=norm_layer,
                                       activation_layer=nn.Hardswish))
        self.features = nn.Sequential(*layers)
        # 池化层和两个全连接层
        self.avgpool = nn.AdaptiveAvgPool2d(1)
        self.classifier = nn.Sequential(nn.Linear(lastconv_output_c, last_channel),
                                        nn.Hardswish(inplace=True),
                                        nn.Dropout(p=0.2, inplace=True),
                                        nn.Linear(last_channel, num_classes))

        # initial weights
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode="fan_out")
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.zeros_(m.bias)

    def _forward_impl(self, x: Tensor):
        x = self.features(x)
        x = self.avgpool(x)
        # 展平处理
        x = torch.flatten(x, 1)
        x = self.classifier(x)

        return x

    def forward(self, x: Tensor):
        return self._forward_impl(x)

5. Training Results

MobileNet v2 uses pre-trained weights and only trains the results of the fully connected layer:

using cuda:0 device.
Using 4 dataloader workers every process
using 3306 images for training, 364 images for validation.
train epoch[1/5] loss:1.007: 100%|██████████| 207/207 [00:09<00:00, 22.49it/s]
valid epoch[1/5]: 100%|██████████| 23/23 [00:03<00:00,  6.31it/s]
[epoch 1] train_loss: 1.245  val_accuracy: 0.794
train epoch[2/5] loss:0.813: 100%|██████████| 207/207 [00:07<00:00, 27.63it/s]
valid epoch[2/5]: 100%|██████████| 23/23 [00:03<00:00,  6.26it/s]
[epoch 2] train_loss: 0.864  val_accuracy: 0.824
train epoch[3/5] loss:0.580: 100%|██████████| 207/207 [00:07<00:00, 28.38it/s]
valid epoch[3/5]: 100%|██████████| 23/23 [00:03<00:00,  6.28it/s]
[epoch 3] train_loss: 0.716  val_accuracy: 0.865
train epoch[4/5] loss:0.818: 100%|██████████| 207/207 [00:07<00:00, 28.45it/s]
valid epoch[4/5]: 100%|██████████| 23/23 [00:03<00:00,  6.46it/s]
[epoch 4] train_loss: 0.626  val_accuracy: 0.857
train epoch[5/5] loss:0.488: 100%|██████████| 207/207 [00:07<00:00, 28.32it/s]
valid epoch[5/5]: 100%|██████████| 23/23 [00:03<00:00,  6.34it/s]
[epoch 5] train_loss: 0.587  val_accuracy: 0.857
Finished Training

MobileNet v3 large uses pre-trained weights and only trains the results of the fully connected layer:

using cuda:0 device.
Using 4 dataloader workers every process
using 3306 images for training, 364 images for validation.
train epoch[1/5] loss:0.831: 100%|██████████| 207/207 [00:09<00:00, 22.47it/s]
valid epoch[1/5]: 100%|██████████| 23/23 [00:03<00:00,  6.40it/s]
[epoch 1] train_loss: 0.889  val_accuracy: 0.868
train epoch[2/5] loss:0.824: 100%|██████████| 207/207 [00:07<00:00, 27.67it/s]
valid epoch[2/5]: 100%|██████████| 23/23 [00:03<00:00,  6.50it/s]
[epoch 2] train_loss: 0.508  val_accuracy: 0.887
train epoch[3/5] loss:0.370: 100%|██████████| 207/207 [00:07<00:00, 27.11it/s]
valid epoch[3/5]: 100%|██████████| 23/23 [00:03<00:00,  6.56it/s]
[epoch 3] train_loss: 0.451  val_accuracy: 0.901
train epoch[4/5] loss:0.841: 100%|██████████| 207/207 [00:07<00:00, 27.61it/s]
valid epoch[4/5]: 100%|██████████| 23/23 [00:03<00:00,  6.34it/s]
[epoch 4] train_loss: 0.411  val_accuracy: 0.904
train epoch[5/5] loss:0.195: 100%|██████████| 207/207 [00:07<00:00, 27.60it/s]
valid epoch[5/5]: 100%|██████████| 23/23 [00:03<00:00,  6.32it/s]
[epoch 5] train_loss: 0.378  val_accuracy: 0.904
Finished Training

Guess you like

Origin blog.csdn.net/AdjsWsgz/article/details/130202448