CNN image recognition

1. Image recognition

  • Image recognition technology is an important technology in the information age, and its purpose is to allow computers to replace humans to process a large amount of physical information. With the development of computer technology, human beings have a deeper and deeper understanding of image recognition technology.
  • Image recognition technology is defined as a technology that uses computers to process, analyze and understand images to identify targets and objects in different patterns.
  • The process of image recognition technology is divided into information acquisition, preprocessing, feature extraction and selection, classifier design and classification decision.

1.1 Pattern Recognition

  • Pattern recognition is an important part of artificial intelligence and information science. Pattern recognition refers to the process of analyzing and processing different forms of information representing things or phenomena to obtain a description, identification and classification of things or phenomena.
  • Computer image recognition technology is to simulate the human image recognition process. Pattern recognition is essential in the process of image recognition.
  • Pattern recognition was originally a basic intelligence of human beings. However, with the development of computers and the rise of artificial intelligence, the pattern recognition of human beings can no longer meet the needs of life, so human beings hope to use computers to replace or expand part of human mental work. In this way, computer pattern recognition is produced.
  • Simply put, pattern recognition is the classification of data. It is a science closely integrated with mathematics, and the ideas used are mostly probability and statistics.

1.2 Process of Image Recognition

The process of image recognition technology is divided into the following steps:

  1. Acquisition of information: refers to the conversion of information such as light or sound into electrical information through sensors. That is to obtain the basic information of the research object and transform it into information that the machine can recognize through some method.
  2. Preprocessing: mainly refers to operations such as denoising, smoothing, and transformation in image processing, thereby enhancing important features of the image. Image enhancement.
  3. Feature extraction and selection: refers to the need for feature extraction and selection in pattern recognition. Feature extraction and selection is one of the key technologies in the process of image recognition.
  4. Classifier design: refers to obtaining a recognition rule through training, through which a feature classification can be obtained, so that the image recognition technology can obtain a high recognition rate. Classification decision-making refers to classifying the identified objects in the feature space, so as to better identify which category the object under study belongs to.

1.3 Application of Image Recognition

  1. image classification
  2. web search
  3. search by image
  4. smart home
  5. E-commerce shopping: "similar models (photo recognition/scanning recognition)"
  6. Agroforestry: Forest Survey.
  7. finance
  8. Security
  9. the medical
  10. entertainment regulation

2. Deep learning development

2.1 Why deep learning rises

  • Data Scale Drives Deep Learning Progress
    insert image description here

insert image description here

2.2 Classification and detection

  • When we are faced with a picture, the most basic task is what the picture is, whether it is a landscape picture or a figure picture, whether it is a description of a building or about food. This is classification.
  • When the category of the image is known, the next step is detection. For example, I know that the image is about a face, so where is the face and can it be framed.
  • Object classification and detection are widely used in many fields, including face recognition, pedestrian detection, intelligent video analysis, pedestrian tracking, etc. in the security field, object recognition in traffic scenes, vehicle counting, retrograde detection, license plate detection and recognition in the traffic field, and Content-based image retrieval in the Internet field, automatic classification of photo albums, etc.
    insert image description here

2.3 Common Convolutional Neural Networks

insert image description here

3. VGG

The reason why VGG is classic is that it made deep learning very "deep" for the first time, reaching 16-19 layers. At the same time, it used a very "small" convolution kernel (3X3),
insert image description here

3.1 VGG16

The inspiration that VGG gives us is that we can make the network deeper and pay attention to the problem of overfitting on this basis.
insert image description here

3.2 Structure of VGG16:

  1. An original image is resized to (224,224,3).
  2. Conv1 twice [3,3] convolutional network, the output feature layer is 64, the output is (224,224,64), and then 2X2 maximum pooling, the output net is (112,112,64).
  3. Conv2 twice [3,3] convolution network, the output feature layer is 128, the output net is (112,112,128), and then 2X2 maximum pooling, the output net is (56,56,128).
  4. Conv3 three times [3,3] convolutional network, the output feature layer is 256, the output net is (56,56,256), and then 2X2 maximum pooling, the output net is (28,28,256).
  5. Conv3 three times [3,3] convolutional network, the output feature layer is 256, the output net is (28,28,512), and then 2X2 maximum pooling, the output net is (14,14,512).
  6. Conv3 three times [3,3] convolutional network, the output feature layer is 256, the output net is (14,14,512), and then 2X2 maximum pooling, the output net is (7,7,512).
  7. Using convolution to simulate the fully connected layer, the effect is equivalent, and the output net is (1,1,4096). A total of two times.
  8. Using convolution to simulate the fully connected layer, the effect is equivalent, and the output net is (1,1,1000). The final output is the prediction for each class.

VGG uses relatively small convolution kernels, such as 3 3 convolution kernels, while Alexnet uses relatively large convolution kernels, such as 11 11 convolution kernels. The advantage of using small convolution kernels is that more detailed information can be obtained .

3.3 Use convolutional layer instead of full connection

The difference between the convolutional layer and the fully connected layer: the convolutional layer is locally connected; while the fully connected layer uses the global information of the image. Can imagine, is the largest part equal to the whole world? This first illustrates the feasibility of replacing fully connected layers with convolutional layers.
insert image description here

eg: There are [5044] neuron nodes in the input and 500 nodes in the output, then a total of 5044*500=400000 weight parameters W and 500 bias parameters b are required

Both the convolutional layer and the fully connected layer perform a dot product operation, and their functions have the same form. Therefore, the fully connected layer can be converted into a corresponding convolutional layer. We only need to make the convolution kernel the same size as the input feature map (h, w), which is equivalent to making the convolution have as many parameters as the fully connected layer.

For example, in VGG16, the input of the first fully connected layer is 7 7 512, and the output is 4096. This can be equivalently represented by a convolutional layer with a convolution kernel size of 7 7, a stride of 1, no padding, and an output channel number of 4096, whose output is 1 1*4096, and a fully connected layer, etc. price. Subsequent fully connected layers can be equivalently replaced with 1x1 convolutions.

In short, the rule for converting a fully connected layer to a convolutional layer is: set the size of the convolution kernel to the size of the input space.

3.4 The role of 1*1 convolution

Realize dimensionality enhancement and dimensionality reduction of feature channels
By controlling the number of convolution kernels, the number of channels can be scaled. The pooling layer can only change the height and width, but cannot change the number of channels.

3.5 VGG16 Code Example

import torch.nn as nn

class VGG16(nn.Module):
    def __init__(self, num_classes=1000):
        super(VGG16, self).__init__()

        # VGG16 有五个卷积块
        # 下面我们定义每一个卷积块内部的卷积层结构
        # 其中,'M' 代表最大池化层

        self.features = nn.Sequential(
            # 第一个卷积块包含两个卷积层,每个都有 64 个 3x3 的卷积核
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            # 第二个卷积块包含两个卷积层,每个都有 128 个 3x3 的卷积核
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            # 第三个卷积块包含三个卷积层,每个都有 256 个 3x3 的卷积核
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            # 第四个卷积块包含三个卷积层,每个都有 512 个 3x3 的卷积核
            nn.Conv2d(256, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            # 第五个卷积块也包含三个卷积层,每个都有 512 个 3x3 的卷积核
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )

        # VGG16 的三个全连接层
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096), # 7x7 是特征图的大小(假设输入图像是 224x224)
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1) # flatten the tensor
        x = self.classifier(x)
        return x

# 实例化 VGG16 模型
model = VGG16()
print(model)

4. Residual model - Resnet

Residual net (residual network): The data output of a certain layer of the first few layers is directly skipped over multiple layers and introduced into the input part of the subsequent data layer.

Residual neural unit: Assuming that the input of a certain neural network is x, and the expected output is H(x), if we directly pass the input x to the output as the initial result, then the goal we need to learn is F(x) = H( x) - x, this is a residual neural unit, which is equivalent to changing the learning goal, no longer learning a complete output H(x), but the difference between output and input H(x) - x, that is, the residual .

insert image description here

4.1 Resnet

  • The biggest difference between the ordinary direct-connected convolutional neural network and ResNet is that ResNet has many bypass branches that directly connect the input to the subsequent layer, so that the latter layer can directly learn the residual. This structure is also called shortcut. Or skip connections.
  • When the traditional convolutional layer or fully connected layer transmits information, there will be more or less problems such as information loss and loss. ResNet solves this problem to a certain extent. By directly bypassing the input information to the output to protect the integrity of the information, the entire network only needs to learn the part of the difference between the input and output, which simplifies the learning objectives and difficulty.
    insert image description here

ResNet50 has two basic blocks, namely Conv Block and Identity Block. The input and output dimensions of the Conv Block are different, so they cannot be connected in series. Its function is to change the dimension of the network; the input dimension and output of the Identity Block The dimensions are the same and can be connected in series to deepen the network.
insert image description here

insert image description here

4.2 BatchNormalization

BatchNormalization:

  • All outputs are guaranteed to be between 0 and 1.
  • All output data have a normal distribution with a mean close to 0 and a standard deviation close to 1. Make it fall into the sensitive area of ​​the activation function, avoid gradient disappearance, and speed up convergence.
  • Accelerate the convergence speed of the model, and has a certain generalization ability.
  • Can reduce the use of dropout.
    insert image description here

insert image description here

4.3 ResNet50 Code Example

import torch
import torch.nn as nn

class IdentityBlock(nn.Module):
    def __init__(self, in_channels, filters, kernel_size, stride):
        super(IdentityBlock, self).__init__()

        filters1, filters2, filters3 = filters

        # 主路径的第一部分
        self.conv1 = nn.Conv2d(in_channels, filters1, (1, 1))
        self.bn1 = nn.BatchNorm2d(filters1)
        self.relu1 = nn.ReLU()

        # 主路径的第二部分
        self.conv2 = nn.Conv2d(filters1, filters2, kernel_size, stride=stride, padding=1)
        self.bn2 = nn.BatchNorm2d(filters2)
        self.relu2 = nn.ReLU()

        # 主路径的第三部分
        self.conv3 = nn.Conv2d(filters2, filters3, (1, 1))
        self.bn3 = nn.BatchNorm2d(filters3)

        self.relu = nn.ReLU()

    def forward(self, x):
        identity = x

        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu1(x)

        x = self.conv2(x)
        x = self.bn2(x)
        x = self.relu2(x)

        x = self.conv3(x)
        x = self.bn3(x)

        x += identity

        x = self.relu(x)
        return x

class ConvBlock(nn.Module):
    def __init__(self, in_channels, filters, kernel_size, stride):
        super(ConvBlock, self).__init__()

        filters1, filters2, filters3 = filters

        # 主路径的第一部分
        self.conv1 = nn.Conv2d(in_channels, filters1, (1, 1), stride=stride)
        self.bn1 = nn.BatchNorm2d(filters1)
        self.relu1 = nn.ReLU()

        # 主路径的第二部分
        self.conv2 = nn.Conv2d(filters1, filters2, kernel_size, padding=1)
        self.bn2 = nn.BatchNorm2d(filters2)
        self.relu2 = nn.ReLU()

        # 主路径的第三部分
        self.conv3 = nn.Conv2d(filters2, filters3, (1, 1))
        self.bn3 = nn.BatchNorm2d(filters3)

        # 快捷路径
        self.shortcut_conv = nn.Conv2d(in_channels, filters3, (1, 1), stride=stride)
        self.shortcut_bn = nn.BatchNorm2d(filters3)

        self.relu = nn.ReLU()

    def forward(self, x):
        identity = self.shortcut_conv(x)
        identity = self.shortcut_bn(identity)

        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu1(x)

        x = self.conv2(x)
        x = self.bn2(x)
        x = self.relu2(x)

        x = self.conv3(x)
        x = self.bn3(x)

        x += identity
        x = self.relu(x)
        return x

class ResNet50(nn.Module):
    def __init__(self, num_classes=1000):
        super(ResNet50, self).__init__()

        self.pad = nn.ZeroPad2d((3, 3, 3, 3))
        self.conv1 = nn.Conv2d(3, 64, (7, 7), stride=(2, 2))
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2d((3, 3), stride=(2, 2))

        # 第一阶段
        self.stage2a = ConvBlock(64, [64, 64, 256], 3, stride=(1, 1))
        self.stage2b = IdentityBlock(256, [64, 64, 256], 3, stride=(1, 1))
        self.stage2c = IdentityBlock(256, [64, 64, 256], 3, stride=(1, 1))

        # 第二阶段
        self.stage3a = ConvBlock(256, [128, 128, 512], 3, stride=(2, 2))
        self.stage3b = IdentityBlock(512, [128, 128, 512], 3, stride=(1, 1))
        # ... 继续其余的阶段

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(2048, num_classes)

    def forward(self, x):
        x = self.pad(x)
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        # 第一阶段
        x = self.stage2a(x)
        x = self.stage2b(x)
        x = self.stage2c(x)

        # 第二阶段
        x = self.stage3a(x)
        x = self.stage3b(x)
        # ... 继续其余的阶段

        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)

        return x

model = ResNet50()
print(model)

5. Migration Learning & Inception

5.1 Convolutional Neural Network Migration Learning-fine tuning

  • In practice, few people train networks from scratch because the dataset is not large enough. A common practice is to use a pre-trained network (such as a 1000-category network trained on ImageNet) to refine-tuning (also called fine-tuning), or as a feature extractor.
  • Simply speaking, transfer learning is to quickly move a convolutional neural network model trained on a data set to another data set through simple adjustments.
  • As the number of layers of the model and the complexity of the model increase, the error rate of the model also decreases. But to train a complex convolutional neural network requires a lot of labeling information, and it takes days or even weeks. In order to solve the problem of labeling data and training time, transfer learning can be used.

Two common types of transfer learning scenarios:

  1. Convolutional networks are used as feature extractors. Use the network pre-trained on ImageNet, remove the last fully connected layer, and use the rest as a feature extractor (for example, AlexNet is a 4096-dimensional feature vector before the final classifier). The features extracted in this way are called CNN codes. Once you have such features, you can use a linear classifier (Liner SVM, Softmax, etc.) to classify the image.
  2. Fine-tuning convolutional network. Replace the input layer (data) of the network and continue training with new data. When Fine-tune, you can choose to fine-tune all or part of the layers. Usually, the previous layers extract generic features of the image (eg edge detection, color detection), which are useful for many tasks. The latter layer extracts features related to a specific category, so fine-tune often only needs the layer after Fine-tuning.
  • Some papers show that the parameters of all convolutional layers in the trained inception model can be retained, and only the last fully connected layer is replaced. The network before the last fully connected layer is called the bottleneck layer.
  • Principle: In the trained inception model, because the output of the bottleneck layer is passed through a single-layer fully connected layer, the neural network can distinguish 1000 types of images very well, so it can be considered that the node vector output by the bottleneck layer can be As a more expressive feature vector for any image. Therefore, on the new data set, the trained neural network can be directly used to extract the features of the image, and then the extracted feature vector is used as input to train a new single-layer fully connected neural network to deal with the new classification problem.
  • Generally speaking, when the amount of data is sufficient, the effect of transfer learning is not as good as complete retraining. However, the training time and training samples required for transfer learning are much smaller than training a complete model.
  • When it comes to the inception model, it is actually a convolutional neural network with a completely different structure from Alexnet. In the Alexnet model, different convolutional layers are connected in series, while the inception structure in the inception model combines different convolutional layers in parallel

5.2 Inception

The Inception network is an important milestone in the history of CNN development. Before Inception, most popular CNNs just stacked more and more convolutional layers to make the network deeper and deeper, hoping to get better performance. But there are following problems:

  1. The size of the protruding parts in the image varies widely.
  2. Due to the large difference in information location, it is difficult to choose an appropriate convolution kernel size for the convolution operation. Images with more global information distribution prefer larger convolution kernels, and images with relatively local information distribution prefer smaller convolution kernels.
  3. Very deep networks are more prone to overfitting. It is difficult to transmit gradient updates to the entire network.
  4. Simply stacking larger convolutional layers is very computationally expensive.

5.3 Inception module

Solution:
why not run filters with multiple sizes on the same level? The network will essentially become slightly "wider", rather than "deeper". The author therefore designed the Inception module.

Inception module: It performs a convolution operation on the input with 3 filters of different sizes (1x1, 3x3, 5x5), in addition it performs max pooling. The outputs of all sublayers are finally concatenated and sent to the next Inception module.

On the one hand, it increases the width of the network, on the other hand, it increases the adaptability of the network to the scale

insert image description here

Inception module for dimensionality reduction:
As mentioned earlier, deep neural networks require a lot of computing resources. In order to reduce the cost of computing power, the author adds an additional 1x1 convolutional layer before the 3x3 and 5x5 convolutional layers to limit the number of input channels. Although it might seem counter-intuitive to add an extra convolution operation, 1x1 convolutions are much cheaper than 5x5 convolutions, and the reduction in the number of input channels also helps reduce computational costs.
insert image description here

5.4 InceptionV1–Googlenet

  1. GoogLeNet adopts Inception modular (9) structure, a total of 22 layers;
  2. In order to avoid the disappearance of the gradient, the network additionally adds 2 auxiliary softmaxes to conduct the gradient forward.
    insert image description here

5.5 InceptionV2

Inception V2 adds BatchNormalization when inputting:

  • All outputs are guaranteed to be between 0 and 1.
  • All output data have a normal distribution with a mean close to 0 and a standard deviation close to 1. Make it fall into the sensitive area of ​​the activation function, avoid gradient disappearance, and speed up convergence.
  • Accelerate the convergence speed of the model, and has a certain generalization ability.
  • Can reduce the use of dropout.

v2

  • The author proposes that a small network composed of two consecutive 3x3 convolutional layers (stride=1) can be used instead of a single 5x5 convolutional layer. This is the InceptionV2 structure.
  • The 5x5 convolution kernel parameter is 25/9=2.78 times that of the 3x3 convolution kernel.
    insert image description here

insert image description here

In addition, the author decomposes the n*n convolution kernel size into two convolutions of 1×n and n×1.
insert image description here

The first three principles are used to build three different types of Inception modules
insert image description here

5.6 InceptionV3-Network Structure Diagram

InceptionV3 integrates all the upgrades mentioned in the previous Inception v2, and also uses 7x7 convolution
insert image description here

5.7 Inception V3

Inception V3 design ideas and Trick:

  1. It is very effective to decompose into small convolutions, which can reduce the amount of parameters, reduce overfitting, and increase the nonlinear expression ability of the network.
  2. From input to output, the convolutional network should gradually reduce the size of the picture and gradually increase the number of output channels, that is, to structure the space and transform the spatial information into high-level abstract feature information.
  3. Inception Module uses multiple branches to extract high-level features of different abstraction levels, which is very effective and can enrich the expressive ability of the network.

5.8 Inception V4

insert image description here

  1. The left picture is the basic Inception v2/v3 module, using two 3x3 convolutions instead of 5x5 convolutions, and using average pooling, this module mainly processes feature maps with a size of 35x35;
  2. The Chinese image module uses 1xn and nx1 convolutions instead of nxn convolutions, and also uses average pooling. This module mainly processes feature maps with a size of 17x17;
  3. The right picture replaces the 3x3 convolution with 1x3 convolution and 3x1 convolution.

In general, the basic Inception module in Inception v4 still follows the structure of Inception v2/v3, but the structure looks more concise and unified, and more Inception modules are used, and the experimental effect is better.
insert image description here

5.9 Summary of Inception Structure

Inception model advantages:

  1. The 1x1 convolution kernel is used, which is cost-effective, and can add a layer of feature transformation and nonlinear transformation with a small amount of calculation.
  2. Batch Normalization is proposed, and through certain means, the input value distribution of each layer of neurons is pulled to a normal distribution with a mean of 0 and a variance of 1, so that it falls into the sensitive area of ​​the activation function, avoiding the disappearance of the gradient, and speeding up the convergence.
  3. Introduce the Inception module, a structure combining 4 branches.

5.10 Convolutional Neural Network Transfer Learning

  • Now the most commonly used structures in engineering are vgg, resnet, and inception. Designers usually directly apply the original model to train the data once, and then select a model with a better effect for fine-tuning and model reduction.
  • Models used in engineering must be fast while being highly accurate.
  • The commonly used method of model reduction is to reduce the number of convolutions and reduce the number of modules of resnet.

5.11 InceptionV3 example

#-------------------------------------------------------------#
#   InceptionV3的网络部分
#-------------------------------------------------------------#
from __future__ import print_function
from __future__ import absolute_import

import warnings
import numpy as np

from keras.models import Model
from keras import layers
from keras.layers import Activation,Dense,Input,BatchNormalization,Conv2D,MaxPooling2D,AveragePooling2D
from keras.layers import GlobalAveragePooling2D,GlobalMaxPooling2D
from keras.engine.topology import get_source_inputs
from keras.utils.layer_utils import convert_all_kernels_in_model
from keras.utils.data_utils import get_file
from keras import backend as K
from keras.applications.imagenet_utils import decode_predictions
from keras.preprocessing import image


def conv2d_bn(x,
              filters,
              num_row,
              num_col,
              strides=(1, 1),
              padding='same',
              name=None):
    if name is not None:
        bn_name = name + '_bn'
        conv_name = name + '_conv'
    else:
        bn_name = None
        conv_name = None
    x = Conv2D(
        filters, (num_row, num_col),
        strides=strides,
        padding=padding,
        use_bias=False,
        name=conv_name)(x)
    x = BatchNormalization(scale=False, name=bn_name)(x)
    x = Activation('relu', name=name)(x)
    return x


def InceptionV3(input_shape=[299,299,3],
                classes=1000):


    img_input = Input(shape=input_shape)

    x = conv2d_bn(img_input, 32, 3, 3, strides=(2, 2), padding='valid')
    x = conv2d_bn(x, 32, 3, 3, padding='valid')
    x = conv2d_bn(x, 64, 3, 3)
    x = MaxPooling2D((3, 3), strides=(2, 2))(x)

    x = conv2d_bn(x, 80, 1, 1, padding='valid')
    x = conv2d_bn(x, 192, 3, 3, padding='valid')
    x = MaxPooling2D((3, 3), strides=(2, 2))(x)

    #--------------------------------#
    #   Block1 35x35
    #--------------------------------#
    # Block1 part1
    # 35 x 35 x 192 -> 35 x 35 x 256
    branch1x1 = conv2d_bn(x, 64, 1, 1)

    branch5x5 = conv2d_bn(x, 48, 1, 1)
    branch5x5 = conv2d_bn(branch5x5, 64, 5, 5)

    branch3x3dbl = conv2d_bn(x, 64, 1, 1)
    branch3x3dbl = conv2d_bn(branch3x3dbl, 96, 3, 3)
    branch3x3dbl = conv2d_bn(branch3x3dbl, 96, 3, 3)

    branch_pool = AveragePooling2D((3, 3), strides=(1, 1), padding='same')(x)
    branch_pool = conv2d_bn(branch_pool, 32, 1, 1)
    
    # 64+64+96+32 = 256  nhwc-0123
    x = layers.concatenate(
        [branch1x1, branch5x5, branch3x3dbl, branch_pool],
        axis=3,
        name='mixed0')

    # Block1 part2
    # 35 x 35 x 256 -> 35 x 35 x 288
    branch1x1 = conv2d_bn(x, 64, 1, 1)

    branch5x5 = conv2d_bn(x, 48, 1, 1)
    branch5x5 = conv2d_bn(branch5x5, 64, 5, 5)

    branch3x3dbl = conv2d_bn(x, 64, 1, 1)
    branch3x3dbl = conv2d_bn(branch3x3dbl, 96, 3, 3)
    branch3x3dbl = conv2d_bn(branch3x3dbl, 96, 3, 3)

    branch_pool = AveragePooling2D((3, 3), strides=(1, 1), padding='same')(x)
    branch_pool = conv2d_bn(branch_pool, 64, 1, 1)
    
    # 64+64+96+64 = 288 
    x = layers.concatenate(
        [branch1x1, branch5x5, branch3x3dbl, branch_pool],
        axis=3,
        name='mixed1')

    # Block1 part3
    # 35 x 35 x 288 -> 35 x 35 x 288
    branch1x1 = conv2d_bn(x, 64, 1, 1)

    branch5x5 = conv2d_bn(x, 48, 1, 1)
    branch5x5 = conv2d_bn(branch5x5, 64, 5, 5)

    branch3x3dbl = conv2d_bn(x, 64, 1, 1)
    branch3x3dbl = conv2d_bn(branch3x3dbl, 96, 3, 3)
    branch3x3dbl = conv2d_bn(branch3x3dbl, 96, 3, 3)

    branch_pool = AveragePooling2D((3, 3), strides=(1, 1), padding='same')(x)
    branch_pool = conv2d_bn(branch_pool, 64, 1, 1)
    
    # 64+64+96+64 = 288 
    x = layers.concatenate(
        [branch1x1, branch5x5, branch3x3dbl, branch_pool],
        axis=3,
        name='mixed2')

    #--------------------------------#
    #   Block2 17x17
    #--------------------------------#
    # Block2 part1
    # 35 x 35 x 288 -> 17 x 17 x 768
    branch3x3 = conv2d_bn(x, 384, 3, 3, strides=(2, 2), padding='valid')

    branch3x3dbl = conv2d_bn(x, 64, 1, 1)
    branch3x3dbl = conv2d_bn(branch3x3dbl, 96, 3, 3)
    branch3x3dbl = conv2d_bn(
        branch3x3dbl, 96, 3, 3, strides=(2, 2), padding='valid')

    branch_pool = MaxPooling2D((3, 3), strides=(2, 2))(x)
    x = layers.concatenate(
        [branch3x3, branch3x3dbl, branch_pool], axis=3, name='mixed3')

    # Block2 part2
    # 17 x 17 x 768 -> 17 x 17 x 768
    branch1x1 = conv2d_bn(x, 192, 1, 1)

    branch7x7 = conv2d_bn(x, 128, 1, 1)
    branch7x7 = conv2d_bn(branch7x7, 128, 1, 7)
    branch7x7 = conv2d_bn(branch7x7, 192, 7, 1)

    branch7x7dbl = conv2d_bn(x, 128, 1, 1)
    branch7x7dbl = conv2d_bn(branch7x7dbl, 128, 7, 1)
    branch7x7dbl = conv2d_bn(branch7x7dbl, 128, 1, 7)
    branch7x7dbl = conv2d_bn(branch7x7dbl, 128, 7, 1)
    branch7x7dbl = conv2d_bn(branch7x7dbl, 192, 1, 7)

    branch_pool = AveragePooling2D((3, 3), strides=(1, 1), padding='same')(x)
    branch_pool = conv2d_bn(branch_pool, 192, 1, 1)
    x = layers.concatenate(
        [branch1x1, branch7x7, branch7x7dbl, branch_pool],
        axis=3,
        name='mixed4')

    # Block2 part3 and part4
    # 17 x 17 x 768 -> 17 x 17 x 768 -> 17 x 17 x 768
    for i in range(2):
        branch1x1 = conv2d_bn(x, 192, 1, 1)

        branch7x7 = conv2d_bn(x, 160, 1, 1)
        branch7x7 = conv2d_bn(branch7x7, 160, 1, 7)
        branch7x7 = conv2d_bn(branch7x7, 192, 7, 1)

        branch7x7dbl = conv2d_bn(x, 160, 1, 1)
        branch7x7dbl = conv2d_bn(branch7x7dbl, 160, 7, 1)
        branch7x7dbl = conv2d_bn(branch7x7dbl, 160, 1, 7)
        branch7x7dbl = conv2d_bn(branch7x7dbl, 160, 7, 1)
        branch7x7dbl = conv2d_bn(branch7x7dbl, 192, 1, 7)

        branch_pool = AveragePooling2D(
            (3, 3), strides=(1, 1), padding='same')(x)
        branch_pool = conv2d_bn(branch_pool, 192, 1, 1)
        x = layers.concatenate(
            [branch1x1, branch7x7, branch7x7dbl, branch_pool],
            axis=3,
            name='mixed' + str(5 + i))

    # Block2 part5
    # 17 x 17 x 768 -> 17 x 17 x 768
    branch1x1 = conv2d_bn(x, 192, 1, 1)

    branch7x7 = conv2d_bn(x, 192, 1, 1)
    branch7x7 = conv2d_bn(branch7x7, 192, 1, 7)
    branch7x7 = conv2d_bn(branch7x7, 192, 7, 1)

    branch7x7dbl = conv2d_bn(x, 192, 1, 1)
    branch7x7dbl = conv2d_bn(branch7x7dbl, 192, 7, 1)
    branch7x7dbl = conv2d_bn(branch7x7dbl, 192, 1, 7)
    branch7x7dbl = conv2d_bn(branch7x7dbl, 192, 7, 1)
    branch7x7dbl = conv2d_bn(branch7x7dbl, 192, 1, 7)

    branch_pool = AveragePooling2D((3, 3), strides=(1, 1), padding='same')(x)
    branch_pool = conv2d_bn(branch_pool, 192, 1, 1)
    x = layers.concatenate(
        [branch1x1, branch7x7, branch7x7dbl, branch_pool],
        axis=3,
        name='mixed7')

    #--------------------------------#
    #   Block3 8x8
    #--------------------------------#
    # Block3 part1
    # 17 x 17 x 768 -> 8 x 8 x 1280
    branch3x3 = conv2d_bn(x, 192, 1, 1)
    branch3x3 = conv2d_bn(branch3x3, 320, 3, 3,
                          strides=(2, 2), padding='valid')

    branch7x7x3 = conv2d_bn(x, 192, 1, 1)
    branch7x7x3 = conv2d_bn(branch7x7x3, 192, 1, 7)
    branch7x7x3 = conv2d_bn(branch7x7x3, 192, 7, 1)
    branch7x7x3 = conv2d_bn(
        branch7x7x3, 192, 3, 3, strides=(2, 2), padding='valid')

    branch_pool = MaxPooling2D((3, 3), strides=(2, 2))(x)
    x = layers.concatenate(
        [branch3x3, branch7x7x3, branch_pool], axis=3, name='mixed8')

    # Block3 part2 part3
    # 8 x 8 x 1280 -> 8 x 8 x 2048 -> 8 x 8 x 2048
    for i in range(2):
        branch1x1 = conv2d_bn(x, 320, 1, 1)

        branch3x3 = conv2d_bn(x, 384, 1, 1)
        branch3x3_1 = conv2d_bn(branch3x3, 384, 1, 3)
        branch3x3_2 = conv2d_bn(branch3x3, 384, 3, 1)
        branch3x3 = layers.concatenate(
            [branch3x3_1, branch3x3_2], axis=3, name='mixed9_' + str(i))

        branch3x3dbl = conv2d_bn(x, 448, 1, 1)
        branch3x3dbl = conv2d_bn(branch3x3dbl, 384, 3, 3)
        branch3x3dbl_1 = conv2d_bn(branch3x3dbl, 384, 1, 3)
        branch3x3dbl_2 = conv2d_bn(branch3x3dbl, 384, 3, 1)
        branch3x3dbl = layers.concatenate(
            [branch3x3dbl_1, branch3x3dbl_2], axis=3)

        branch_pool = AveragePooling2D(
            (3, 3), strides=(1, 1), padding='same')(x)
        branch_pool = conv2d_bn(branch_pool, 192, 1, 1)
        x = layers.concatenate(
            [branch1x1, branch3x3, branch3x3dbl, branch_pool],
            axis=3,
            name='mixed' + str(9 + i))
    # 平均池化后全连接。
    x = GlobalAveragePooling2D(name='avg_pool')(x)
    x = Dense(classes, activation='softmax', name='predictions')(x)


    inputs = img_input

    model = Model(inputs, x, name='inception_v3')

    return model

def preprocess_input(x):
    x /= 255.
    x -= 0.5
    x *= 2.
    return x

if __name__ == '__main__':
    model = InceptionV3()

    model.load_weights("inception_v3_weights_tf_dim_ordering_tf_kernels.h5")
    
    img_path = 'elephant.jpg'
    img = image.load_img(img_path, target_size=(299, 299))
    x = image.img_to_array(img)
    x = np.expand_dims(x, axis=0)

    x = preprocess_input(x)

    preds = model.predict(x)
    print('Predicted:', decode_predictions(preds))


6. Model lightweight - Mobilenet

The MobileNet model is a lightweight deep neural network proposed by Google for embedded devices such as mobile phones. The core idea it uses is depthwise separable convolution.

6.1 Mobilenet—depthwise separable convolution

insert image description here

6.2 Mobilenet–depthwise separable convolution

For a convolution point:
Suppose there is a 3×3 convolutional layer with 16 input channels and 32 output channels. Specifically, 32 convolution kernels of 3×3 size will traverse each data in 16 channels, and finally the required 32 output channels can be obtained, and the required parameters are 16×32×3×3=4608.

Apply depth separable convolution, use 16 convolution kernels of 3×3 size to traverse the data of 16 channels respectively, and obtain 16 feature maps, and then use 32 convolution kernels of 1×1 size to traverse the 16 feature maps , the required parameters are 16×3×3+16×32×1×1=656.

It can be seen that depthwise separable convolution can reduce the parameters of the model.
insert image description here

6.3 Mobilenet

insert image description here

insert image description here

6.4 MobileNet code example

#-------------------------------------------------------------#
#   MobileNet的网络部分
#-------------------------------------------------------------#
import warnings
import numpy as np

from keras.preprocessing import image

from keras.models import Model
from keras.layers import DepthwiseConv2D,Input,Activation,Dropout,Reshape,BatchNormalization,GlobalAveragePooling2D,GlobalMaxPooling2D,Conv2D
from keras.applications.imagenet_utils import decode_predictions
from keras import backend as K


def MobileNet(input_shape=[224,224,3],
              depth_multiplier=1,
              dropout=1e-3,
              classes=1000):


    img_input = Input(shape=input_shape)

    # 224,224,3 -> 112,112,32
    x = _conv_block(img_input, 32, strides=(2, 2))

    # 112,112,32 -> 112,112,64
    x = _depthwise_conv_block(x, 64, depth_multiplier, block_id=1)

    # 112,112,64 -> 56,56,128
    x = _depthwise_conv_block(x, 128, depth_multiplier,
                              strides=(2, 2), block_id=2)
    # 56,56,128 -> 56,56,128
    x = _depthwise_conv_block(x, 128, depth_multiplier, block_id=3)

    # 56,56,128 -> 28,28,256
    x = _depthwise_conv_block(x, 256, depth_multiplier,
                              strides=(2, 2), block_id=4)
    
    # 28,28,256 -> 28,28,256
    x = _depthwise_conv_block(x, 256, depth_multiplier, block_id=5)

    # 28,28,256 -> 14,14,512
    x = _depthwise_conv_block(x, 512, depth_multiplier,
                              strides=(2, 2), block_id=6)
    
    # 14,14,512 -> 14,14,512
    x = _depthwise_conv_block(x, 512, depth_multiplier, block_id=7)
    x = _depthwise_conv_block(x, 512, depth_multiplier, block_id=8)
    x = _depthwise_conv_block(x, 512, depth_multiplier, block_id=9)
    x = _depthwise_conv_block(x, 512, depth_multiplier, block_id=10)
    x = _depthwise_conv_block(x, 512, depth_multiplier, block_id=11)

    # 14,14,512 -> 7,7,1024
    x = _depthwise_conv_block(x, 1024, depth_multiplier,
                              strides=(2, 2), block_id=12)
    x = _depthwise_conv_block(x, 1024, depth_multiplier, block_id=13)

    # 7,7,1024 -> 1,1,1024
    x = GlobalAveragePooling2D()(x)
    x = Reshape((1, 1, 1024), name='reshape_1')(x)
    x = Dropout(dropout, name='dropout')(x)
    x = Conv2D(classes, (1, 1),padding='same', name='conv_preds')(x)
    x = Activation('softmax', name='act_softmax')(x)
    x = Reshape((classes,), name='reshape_2')(x)

    inputs = img_input

    model = Model(inputs, x, name='mobilenet_1_0_224_tf')
    model_name = 'mobilenet_1_0_224_tf.h5'
    model.load_weights(model_name)

    return model

def _conv_block(inputs, filters, kernel=(3, 3), strides=(1, 1)):
    x = Conv2D(filters, kernel,
               padding='same',
               use_bias=False,
               strides=strides,
               name='conv1')(inputs)
    x = BatchNormalization(name='conv1_bn')(x)
    return Activation(relu6, name='conv1_relu')(x)


def _depthwise_conv_block(inputs, pointwise_conv_filters,
                          depth_multiplier=1, strides=(1, 1), block_id=1):

    x = DepthwiseConv2D((3, 3),
                        padding='same',
                        depth_multiplier=depth_multiplier,
                        strides=strides,
                        use_bias=False,
                        name='conv_dw_%d' % block_id)(inputs)

    x = BatchNormalization(name='conv_dw_%d_bn' % block_id)(x)
    x = Activation(relu6, name='conv_dw_%d_relu' % block_id)(x)

    x = Conv2D(pointwise_conv_filters, (1, 1),
               padding='same',
               use_bias=False,
               strides=(1, 1),
               name='conv_pw_%d' % block_id)(x)
    x = BatchNormalization(name='conv_pw_%d_bn' % block_id)(x)
    return Activation(relu6, name='conv_pw_%d_relu' % block_id)(x)

def relu6(x):
    return K.relu(x, max_value=6)

def preprocess_input(x):
    x /= 255.
    x -= 0.5
    x *= 2.
    return x

if __name__ == '__main__':
    model = MobileNet(input_shape=(224, 224, 3))

    img_path = 'elephant.jpg'
    img = image.load_img(img_path, target_size=(224, 224))
    x = image.img_to_array(img)
    x = np.expand_dims(x, axis=0)
    x = preprocess_input(x)
    print('Input image shape:', x.shape)

    preds = model.predict(x)
    print(np.argmax(preds))
    print('Predicted:', decode_predictions(preds,1))  # 只显示top1


7. Design skills of convolutional neural network

Problem background:

  • It is not that easy to become proficient in the ability to train neural networks. As with previous machine learning thinking, the details make the difference. However, training a neural network has more details to deal with. What are your data and hardware limitations? What kind of network should you start with? How many convolutional layers should you build? How to set your incentive function?
  • The learning rate is the most important hyperparameter for tuning neural network training, and it is also one of the most difficult parameters to optimize. Too small and you might never get to a solution; too big and you might just miss the optimal solution. If you use the adaptive learning rate method, this means that you have to spend a lot of money on hardware resources to meet the computing needs.
  • Design choices and hyperparameter settings greatly affect the training and performance of CNNs, but for new entrants in the field of deep learning, the development of design architecture intuition may require scarcity and dispersion of resources.

insert image description here

1) Architecture Follows Applications
You may be fascinated by the shiny new models invented by imaginative labs like Google Brain or Deep Mind, but many of them are either impossible or impractical for your need. Perhaps you should use the model that makes the most sense for your particular application, which may be very simple, but still powerful, such as VGG.
insert image description here

2) Proliferation of paths
Every year the ImageNet Challenge winner uses a deeper network than the previous year's winner. From AlexNet to Inception to Resnets, there is a trend of "the number of paths in the network grows exponentially".

3) Go for simplicity
Bigger isn't necessarily better.

4) Increase symmetry
Whether in architecture or in biology, symmetry is considered a sign of quality and craftsmanship.

5) Pyramid shape
You are always making a trade-off between representational power and reducing redundant or useless information. CNNs usually downsample the activation function and increase the connection channels from the input layer to the final layer.

6) Overtraining
Another trade-off is training accuracy and generalization. Using regularization methods like drop-out or drop-path to improve the generalization ability is an important advantage of neural networks. Train the network on problems harder than the actual use cases to improve generalization performance.

7) Covering the space of the problem
In order to expand the training data and improve the generalization ability, noise and artificially increase the size of the training set. Examples include random rotation, cropping, and some image enhancement operations.

8) Incremental functional structures
When architectures become successful, they simplify the "job" of each layer. In very deep neural networks, each layer only incrementally modifies the input. In ResNets, the output of each layer may be similar to the input. So, in practice, please use short skip lengths in ResNet.

9) Normalize the input of the layer
Normalization is a shortcut that can make the work of the calculation layer easier, and in practice can improve the accuracy of training. Normalization puts the input samples of all layers on an equal footing (similar to unit conversion), which allows backpropagation to train more efficiently.

10) Use a fine-tuned pre-trained network (fine tuning)
Mike Tung, CEO of Diffbot, a machine learning company, said, "If your visual data is similar to ImageNet, then using a pre-trained network will help you learn faster." Low-level CNNs can often be reused because they are mostly able to detect common patterns like lines and edges. For example, replace the classification layer with a layer of your own design, and use your specific data to train the last few layers.

11) Using cyclical learning rates
Experimenting with learning rates can consume a lot of time and will lead you to errors. Adaptive learning rates can be computationally expensive, but cyclic learning rates are not. When using a cyclic learning rate, you can set a set of max and min bounds and vary it within that range.

Guess you like

Origin blog.csdn.net/m0_63260018/article/details/132265213