Study notes: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision

https://zhuanlan.zhihu.com/p/35405071
论文地址:MobileNetv1
Howard, Andrew G., et al. “Mobilenets: Efficient convolutional neural networks for mobile vision applications.” arXiv preprint arXiv:1704.04861 (2017).

1. Introduction

Mobilenet v1 is a network architecture released by Google in 2017. It aims to make full use of the limited resources of mobile devices and embedded applications, and effectively maximize the accuracy of the model to meet various application cases under limited resources. Mobilenet v1 can also be used for tasks such as classification, detection, embedding and segmentation to extract image convolution features like other popular models (such as VGG and ResNet).

First, the core depth separable convolution Depthwise cnn + Pointwise cnn

Insert picture description here
The above picture is the ordinary convolution,
kernel-channel = input channel, scan input-img (HW) corresponding points to multiply, and then sum
output-channel = kernel-num
multiplication calculation amount: F = [Ci x 3 x 3] x (H x W) x Co # The C-out here is equivalent to the kernel-num
Insert picture description here
. The above figure is depthwise convolution. The separation of each channel is convolution to get the HW C img, and then 1 * 1 convolution (for the channel Fusion)
Multiplication calculation

  • 1 Depth separation convolution in the left picture: F1 = [Ci x 3 x 3] x [H x W]
    Insert picture description here
  • 2 Right picture 1x1 convolution: F2 = [Ci 1 x 1 x 1] x [H x W] x Co
    Insert picture description here
    [depthwise + Piontwise] / ordinary CNN calculation = 1 / Co + 1/9
    (Here Co is the output image channel The number is as follows: The depth separable convolution replaces the ordinary CNN by one more BN
    (Insert picture description here

2 Model network

Local structure
The following packet, it can be seen that similar vgg stacked
Insert picture description here
vgg16
Insert picture description here
there are provided two super parameters to control the size and amount of calculation model

  • Width multiplier: used to control the number of channels, Afa a, when <1, the model will be narrowed, and the amount of calculation is reduced to a²
  • Resolution multiplier: used to control the size of the feature map, denoted as p, applying this multiplier on the corresponding feature map can reduce the amount of calculation

A hyperparameter α∈ [0,1] is added to control the number of feature map channels. The smaller the alpha, the smaller the model. The role is to change the number of input and output channels, reduce the number of feature maps, and make the network thinner
  Insert picture description here
  .
  Insert picture description here
  Of course, compressing the calculation of the network must come at a price. Figure 11 shows the performance of [Formula] Mobilenet v1 on ImageNet at different times. It can be seen that even with [Formula] Mobilenet v1 still has an accuracy of 63.7% on ImageNet.
Insert picture description here
 A hyperparameter ρ is added to control the resolution of the input image. The smaller the ρ, the smaller the input image.
Insert picture description here
Insert picture description here

Results

1 Classification

Under the premise that the calculation amount and parameter size are reduced by many times, acc is equivalent to
Insert picture description here
the size of the input image of Googlenet and VGGnet.
Insert picture description here

2 Target detection, which is quite different from the mAP of a large network, but the amount of calculation does drop a lot

Insert picture description here

3 Face classification

Insert picture description here
知识蒸馏,Facenet教Mobilenet学习识别人脸
The FaceNet model is a state of the art face recognition model [25]. It builds face embeddings based on the triplet loss. To build a mobile FaceNet model we use distillation to train by minimizing the squared differences of the output of FaceNet and MobileNet on the training data. Results for very small MobileNet models can be found in table 14.
Insert picture description here

Interlude: What is the knowledge distillation hintin proposed in 2015

Papers
what knowledge Distillation is: That we need to make softmax distribution of the new model and the real tag match, and now just need to let the new model and the original model matched to softmax at a given input distribution line (new function approximation of the original function)
raw produce model A certain logits is [formula], the logits generated by the new model is [formula] to
Insert picture description heremake this function close to 0.
In the chemistry of the back propagation algorithm , distillation is an effective method to separate components with different boiling points. The general steps are The temperature is first increased to vaporize the low-boiling components, and then cooled and condensed to achieve the purpose of separating the target substance. In the process mentioned earlier, we first let the temperature [formula] rise, and then restore the "low temperature" during the test phase, so as to extract the knowledge in the original model, so it is really good to call it distillation

class MobileNetv1(nn.Module):
    def __init__(self):
        super(MobileNetv1, self).__init__()

        def conv_bn(dim_in, dim_out, stride):
            return nn.Sequential(
                nn.Conv2d(dim_in, dim_out, 3, stride, 1, bias=False),
                nn.BatchNorm2d(dim_out),
                nn.ReLU(inplace=True)
            )

        def conv_dw(dim_in, dim_out, stride):
            return nn.Sequential(
                nn.Conv2d(dim_in, dim_in, 3, stride, 1, groups= dim_in, bias=False),
                nn.BatchNorm2d(dim_in),
                nn.ReLU(inplace=True),
                nn.Conv2d(dim_in, dim_out, 1, 1, 0, bias=False),
                nn.BatchNorm2d(dim_out),
                nn.ReLU(inplace=True),
            )
        self.model = nn.Sequential(
            conv_bn(  3,  32, 2),
            conv_dw( 32,  64, 1),
            conv_dw( 64, 128, 2),
            conv_dw(128, 128, 1),
            conv_dw(128, 256, 2),
            conv_dw(256, 256, 1),
            conv_dw(256, 512, 2),
            conv_dw(512, 512, 1),
            conv_dw(512, 512, 1),
            conv_dw(512, 512, 1),
            conv_dw(512, 512, 1),
            conv_dw(512, 512, 1),
            conv_dw(512, 1024, 2),
            conv_dw(1024, 1024, 1),
            nn.AvgPool2d(7),
        )
        self.fc = nn.Linear(1024, 20)

    def forward(self, x):
        x = self.model(x)
        x = x.view(-1, 1024)
        x = self.fc(x)
        return x

to sum up

  1. Use Depthwise and Pointwise to achieve deep separability and reduce the amount of calculation and model size
    (low latency and model size mentioned repeatedly in the article)
  2. The model is more retro, similar to vgg stacking, no residuals, feature fusion and other technologies
  3. Disadvantages: each channel of depth decomposition convolution is independent, the dimension of the convolution kernel is small, there are only few input features in the output feature, plus relu is easy to become zero, the feature extraction of the tutor fails, and the convolution kernel is redundant

Your own experiment result pytorch code (non-finetune)

6 mobilenetv1 2080 t-bs:64 v-bs:64 lr:0.01 100epoch

The data set GHIM-20 classification, basically fitted in epoch = 8000 x 64/9000 = 57 rounds
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here

In addition: The experiment on the influencing factors of classification basic network training is as follows blog, which studies the effect of batchsize kernel learn rate on the quality of the training model
https://blog.csdn.net/weixin_44523062/article/details/105457045

Published 63 original articles · praised 7 · views 3396

Guess you like

Origin blog.csdn.net/weixin_44523062/article/details/105472219