Records of the impact of different parameters on the performance of the classification model

 The performance parameters that affect the model are mainly

1) The size and step size of the convolution kernel

2) The size and strategy of the learning rate

3) Best practice

4) Regularization factor

5) Network Depth

Test network MobileNet, ALLconv6. Test data set CHIM-10K, Place20

1) Basic network unit introduction:

a: ALLconv6 (6 means 5 convolutional layers + a classification fully connected layer)

b,MobileNet

        MobileNet describes an efficient network architecture that allows the direct construction of very small, low-latency, easily compliant models for embedded devices via two hyperparameters.  

        The performance of MobileNet is equivalent to that of VGG16, but the calculation amount is only 1/10 of that of VGG16. The network has 28 convolutional layers, including 27 convolutional layers and 1 fully connected layer. The method of improving the speed of the model is to train A good complex model is compressed to obtain a small model; the second is to directly design a small model and train it. In any case, the goal is to reduce the model size (parameters size) while maintaining the model performance (accuracy), while increasing the model speed (speed, low latency).

       The basic unit of MobileNet is the depthwise separable convolution. In fact, this structure has been used in the Inception model before. Depth-level separable convolution is actually a factorized convolution operation (factorized convolutions), which can be decomposed into two smaller operations: depthwise convolution and pointwise convolution,

1) What is depthwise convolution and pointwise convolution,

        Depthwise convolution is different from standard convolution. For standard convolution, its convolution kernel is used on all input channels (input channels), while depthwise convolution uses a different convolution kernel for each input channel, that is, a convolution kernel Corresponds to an input channel, so depthwise convolution is a depth-level operation. The pointwise convolution is actually an ordinary convolution, but it uses a 1x1 convolution kernel. Both operations are shown more clearly in Figure 2. For depthwise separable convolution, it first uses depthwise convolution to convolve different input channels separately, and then uses pointwise convolution to combine the above outputs, so that the overall effect is similar to a standard convolution, but it will greatly reduce calculations quantities and model parameters.

                                           图1 Depthwise separable convolution 

                                     图2 Depthwise convolution和pointwise convolution

The amount of parameters using this convolution type:

        The depthwise separable convolution is described above, which is the basic component of MobileNet, but in real applications, batchnorm will be added and the ReLU activation function will be used, so the basic structure of depthwise separable convolution is shown in Figure 3.  

                               Figure 3 depthwise separable convolution with BN and ReLU

2)  MobileNet 

                                                    Table 1 Network structure of MobileNet 

           The network structure of MobileNet is shown in the table. The first is a 3x3 standard convolution, and then the stacked depthwise separable convolution is followed, and it can be seen that part of the depthwise convolution will be down sampled through strides=2. Then use average pooling to change the feature into 1x1, add a fully connected layer according to the size of the predicted category, and finally a softmax layer.

      If depthwiseconvolution and pointwise convolution are calculated separately, the entire network has 28 layers (here Avg Pool and Softmax are not counted). We can also analyze the parameter and computation distribution of the whole network, as shown in Table 2. It can be seen that the entire calculation is basically concentrated on 1x1 convolution. If you are familiar with the underlying implementation of convolution, you should know that convolution is generally implemented through an im2col method, which requires memory reorganization, but when the convolution kernel is 1x1, In fact, this kind of operation is not needed, and the bottom layer can have a faster implementation. The parameters are also mainly concentrated on 1x1 convolution. In addition, the fully connected layer accounts for a part of the parameters.

                                             Table 2 Calculation and parameter distribution of MobileNet network

 Implementation of tensorflow for MobileNet

class MobileNet(object):
    def __init__(self, inputs, num_classes=1000, is_training=True,
                 width_multiplier=1, scope="MobileNet"):
        """
        The implement of MobileNet(ref:https://arxiv.org/abs/1704.04861)
        :param inputs: 4-D Tensor of [batch_size, height, width, channels]
        :param num_classes: number of classes
        :param is_training: Boolean, whether or not the model is training
        :param width_multiplier: float, controls the size of model
        :param scope: Optional scope for variables
        """
        self.inputs = inputs
        self.num_classes = num_classes
        self.is_training = is_training
        self.width_multiplier = width_multiplier
 
        # construct model
        with tf.variable_scope(scope):
            # conv1
            net = conv2d(inputs, "conv_1", round(32 * width_multiplier), filter_size=3,
                         strides=2)  # ->[N, 112, 112, 32]
            net = tf.nn.relu(bacthnorm(net, "conv_1/bn", is_training=self.is_training))
            net = self._depthwise_separable_conv2d(net, 64, self.width_multiplier,
                                "ds_conv_2") # ->[N, 112, 112, 64]
            net = self._depthwise_separable_conv2d(net, 128, self.width_multiplier,
                                "ds_conv_3", downsample=True) # ->[N, 56, 56, 128]
            net = self._depthwise_separable_conv2d(net, 128, self.width_multiplier,
                                "ds_conv_4") # ->[N, 56, 56, 128]
            net = self._depthwise_separable_conv2d(net, 256, self.width_multiplier,
                                "ds_conv_5", downsample=True) # ->[N, 28, 28, 256]
            net = self._depthwise_separable_conv2d(net, 256, self.width_multiplier,
                                "ds_conv_6") # ->[N, 28, 28, 256]
            net = self._depthwise_separable_conv2d(net, 512, self.width_multiplier,
                                "ds_conv_7", downsample=True) # ->[N, 14, 14, 512]
            net = self._depthwise_separable_conv2d(net, 512, self.width_multiplier,
                                "ds_conv_8") # ->[N, 14, 14, 512]
            net = self._depthwise_separable_conv2d(net, 512, self.width_multiplier,
                                "ds_conv_9")  # ->[N, 14, 14, 512]
            net = self._depthwise_separable_conv2d(net, 512, self.width_multiplier,
                                "ds_conv_10")  # ->[N, 14, 14, 512]
            net = self._depthwise_separable_conv2d(net, 512, self.width_multiplier,
                                "ds_conv_11")  # ->[N, 14, 14, 512]
            net = self._depthwise_separable_conv2d(net, 512, self.width_multiplier,
                                "ds_conv_12")  # ->[N, 14, 14, 512]
            net = self._depthwise_separable_conv2d(net, 1024, self.width_multiplier,
                                "ds_conv_13", downsample=True) # ->[N, 7, 7, 1024]
            net = self._depthwise_separable_conv2d(net, 1024, self.width_multiplier,
                                "ds_conv_14") # ->[N, 7, 7, 1024]
            net = avg_pool(net, 7, "avg_pool_15")
            net = tf.squeeze(net, [1, 2], name="SpatialSqueeze")
            self.logits = fc(net, self.num_classes, "fc_16")
            self.predictions = tf.nn.softmax(self.logits)
 
    def _depthwise_separable_conv2d(self, inputs, num_filters, width_multiplier,
                                    scope, downsample=False):
        """depthwise separable convolution 2D function"""
        num_filters = round(num_filters * width_multiplier)
        strides = 2 if downsample else 1
 
        with tf.variable_scope(scope):
            # depthwise conv2d
            dw_conv = depthwise_conv2d(inputs, "depthwise_conv", strides=strides)
            # batchnorm
            bn = bacthnorm(dw_conv, "dw_bn", is_training=self.is_training)
            # relu
            relu = tf.nn.relu(bn)
            # pointwise conv2d (1x1)
            pw_conv = conv2d(relu, "pointwise_conv", num_filters)
            # bn
            bn = bacthnorm(pw_conv, "pw_bn", is_training=self.is_training)
            return tf.nn.relu(bn)

         This article briefly introduces the mobile model MobileNet proposed by Google. Its core is the use of decomposable depthwise separable convolution, which can not only reduce the computational complexity of the model, but also greatly reduce the model size.

2) The size of the learning rate and the impact of the strategy on the classification model

The learning rate is a very important parameter. The learning rate, strategy and optimization method will all affect the convergence of the model. The influence on the learning rate is verified by using the above model ALLconv6 network;

  Commonly used better learning rate strategies are Step and Multistep, and these two strategies have similar effects. If the initial learning rate is too small, it is easy to cause the SGD method to converge slowly. The learning rate should not be too small at the beginning. In You Sanshu, the initial setting of 0.001 leads to slow convergence at the beginning and is not conducive to optimization.

3) Effect of regularization factor on classification model

The regularization factor is very important to reduce the overfitting of the model. What is regularization?

a) Question introduction

 

         When the number of items in the prediction function is too high, the algorithm will run a deformed curve in order to reduce the cost, that is, the difference.

          Although this deformed curve perfectly fits all the data points , it is clear that such a curve is not scalable or generalizable . The data given in the future cannot be accurately predicted. This situation is called overfitting .

         Generally, when the amount of data is small and the eigenvalues ​​are large, the concept of overfitting is relatively high.

What is regularization?

    Regularization, Chinese translation can be called regularization , or normalization . What are the rules? You cannot look up books in the closed-book exam. This is a rule and a limitation . In the same way, here, regularization means adding some restrictions to the loss function , and through this rule to regulate them in the next loop iterations, do not self-inflate.

Why is regularization needed?


        Let's first review the process of model training. The training of model parameters is actually a continuous iteration to find an equation to fit the data set. However, at this point, we only know that we need to fit the training set, but we haven't discussed the best degree of fitting. Take a look at the fit of the regression model below to see what you can find.

The leftmost picture: the degree of fitting is relatively low, obviously this is not what we want. Even if the accuracy rate of the training set is so low, then the test set must not be high, that is, the generalization ability of the model is not high.

The picture on the far right: the degree of fitting is very high, and even every point can be  expressed.Is this what we desire? Not! There is unavoidably a lot of noise in our dataset, and ideally, we want the noise to have zero impact on our model training. And if the model accurately describes every point in the training set, it obviously contains a lot of noise points, and the accuracy rate obtained in the subsequent test set is not high. On the other hand, too complex  directly leads to the shape of the function is not smooth, but will turn around as shown in the figure, which cannot play a predictive role, and the "regression" model also loses its predictive ability (that is, the model pan transformation ability), which is obviously not what we want.

The middle picture: shows the most suitable degree of fitting,not too complicated or too simple, and can intuitively predict the direction of the function. Although it is not as accurate as Figure 3 in the test set, the accuracy we get in the test set is the highest, and the generalization ability is also the strongest.

 Similarly, there are cases of overfitting and underfitting in classification models .

in conclusion:

1. Underfitting: poor generalization ability, low accuracy of training sample set, and low accuracy of test sample set.
2. Overfitting: Poor generalization ability, high accuracy of training sample set and low accuracy of test sample set.
3. Appropriate degree of fitting: strong generalization ability, high accuracy rate of training sample set, high accuracy rate of test sample set


Reasons for underfitting:

1. The number of training samples is small
. 2. The model complexity is too low
. 3. The loop stops before the parameters converge.


Solutions for underfitting:

1. Increase the number of samples
2. Increase model parameters and increase model complexity
3. Increase the number of cycles
4. Check whether the learning rate is too high and the model cannot converge


Reason for overfitting:

1. The data is too noisy
2. There are too many features
3. The model is too complex

Overfitting solution:

1. Clean the data
2. Reduce the model parameters and reduce the model complexity
3. Increase the penalty factor (regularization) to retain all the features, but reduce the size of the parameters (magnitude).

     

 Through analysis, we can see that regularization is a means to prevent the model from overfitting . We add a constraint to the cost function, limiting the size of its higher-order parameters to not be too large . Or use the regression model as an example:

 It is those high-order terms that lead to overfitting, so if we can make the coefficients of these high-order terms close to 0 , we can fit well. Therefore, we modify the cost function  J(\theta) as follows:

 

We have added two constraints to the equation, respectively restricting  and   , so that they cannot be too high. It is intuitive to see that in order to   minimize , not only   sufficient fitting is required , but also the sum  and .

 This is the meaning of regularization, which can help us prevent the model from overfitting in the process of training the model.

b) How to regularize the model?

        We generalize the previous discussion. If we have a lot of features, we don't know which features we want to punish, we will punish all the features, and let the cost function optimization software choose the degree of these penalties. Therefore, we analyze  how the cost function of the linear regression model  and  the cost function of the Logistic regression model  are modified.

 

 It   is called the regularization parameter (Regularization Parameter) . When the parameter is larger, the punishment (regulation) will be stronger, and the more it can play a normative role. But beware, bigger is not always better! If the selected regularization parameter is large, all parameters will be minimized, causing the model to become , resulting in underfitting. Therefore, we  need to choose reasonable.

c) Different regular terms

 

    Some friends may ask questions, why does the regular term have a square? Then let's analyze the regularization terms under different powers. Let's take the first and second times as examples. The graphs of the regularization items are a rectangle and a circle respectively, as shown in the figure:    

So what does it matter? Let's draw the whole cost function (including the regular term) to know 

       It can be intuitively understood that we minimize the loss function is to find the minimum value of the sum of the blue circle + red circle , and this value is usually the place where two surfaces intersect in many cases.

You can see the advantages of the quadratic regular term , which is derivable everywhere and easy to calculate .

 Comparison of L1 regularization and L2 regularization in practical applications
     L1 can play a good role in scenarios where sparse models are indeed required and the effect is far better than L2. In the case where the number of model features is much greater than the number of training samples, if we know in advance that there are only a small number of relevant features in the model features (that is, the parameter value is not 0), and the number of relevant features is less than the number of training samples, then the L1 The effect is much better than L2. However, it should be noted that when the number of relevant features is much larger than the number of training samples, neither L1 nor L2 can achieve good results.

 Reference blog post:

An article fully understands regularization (Regularization)

 d) Effect of regularization factor on classification model

     It is called the regularization parameter (Regularization Parameter) . When the parameter is larger, the punishment (regulation) will be stronger, and the more it can play a normative role. But beware, bigger is not always better! If the selected regularization parameter is large, all parameters will be minimized, causing the model to become , resulting in underfitting. Therefore, we  need to choose reasonable.

 

3) The influence of the optimization method on the classification model

     The optimization method is mainly to use the derivative of the objective function to solve the problem of unconstrained optimization multiple times. A summary of different optimization methods can be found in the blog post:

Optimization method in deep learning_Study Notes of Qianmeng

      In three books, using the experimental method, the SGD method with momentum is the best, followed by Nesterov, and then the tuned Amda and AdaGrad.

4) The influence of network depth on the classification model

 a) Use the ALLconv network to test the depth in the data set CHIM-10k

        Use ALLconv5, ALLconv6, ALLconv7_1, ALLconv7_2ALLconv8_1, ALLconv8_2,

ALLconv8_3 is tested, the batch_size is 16*4, and four GPUs are used to run, each of which has a batch_size of 16;

      Through experiments in three books, the following conclusions can be drawn:

1. With the deepening of the network, the performance of the model will improve, but the improvement will become smaller and smaller, and gradually approach the limit. The extreme value of the network depth can be found according to the accuracy of the classification, instead of continuing to deepen the network performance to improve infinitely. The impact of network depth on model performance has a critical point.

2. In the shallow layer of the network, the larger the resolution is, the better the accuracy of the model will be;

Guess you like

Origin blog.csdn.net/YOULANSHENGMENG/article/details/120848240