(2) Convolutional Neural Network - AlexNet

overview

Due to the influence of computer performance, although LeNet has achieved good results in image classification, it has not attracted much attention. Knowing that in 2012, the AlexNet network proposed by Alex and others won the ImageNet competition by far surpassing the second place, and the convolutional neural network and even deep learning have aroused widespread attention again.

Alex Krizhevsky et al. trained a large convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 competition into 1000 different categories. On the test paper, a very high accuracy rate was obtained (top-1 and top-5 error rates of 37.5% and 17.0%). By improving the network, it won the championship in the ImageNet LSVRC competition in 2012, and the accuracy rate far exceeded the second place (top-5 test error rate of 15.3%, second place 26.2%. This has caused great controversy in academia It was a sensation and opened the era of deep learning. Although a large number of convolutional neural network structures that were faster and more accurate than AlexNet appeared one after another, AlexNet still has a lot of learning and reference as the founder. It is the follow-up CNN and even R- Other networks such as CNN set the tone, so below we will start with AlexNet to understand the general structure of convolutional neural networks.

AlexNet Features

The AlexNet network includes 60 million parameters and 650 million neurons, 5 convolutional layers, pooling layers behind some convolutional layers, 3 fully connected layers, and the output is a softmax layer.

AlexNet deepens the network structure on the basis of LeNet and learns richer and higher-dimensional image features. Features of AlexNet:

1. Deeper network structure
2. Use stacked convolutional layers, namely convolutional layer + convolutional layer + pooling layer to extract image features 3.
Use Dropout to suppress overfitting
4. Use data enhancement Data Augmentation to suppress overfitting 5. Use
Relu to replace the previous sigmoid as the activation function
6. Multi-GPU training

activation function

In the original perceptron model, the relationship between input and output is as follows:
insert image description here
the above functional formula is only a simple linear relationship, and such a network structure has great limitations. Even if many network layers with such a structure are superimposed, the output and input are still in a linear relationship, and cannot handle input and output with nonlinear relationships.
insert image description here

Therefore, a nonlinear conversion is performed on the output of each neuron, that is, the result of the above weighted summation is input into a nonlinear function, that is, the activation function. In this way, due to the introduction of the activation function, the superposition of multiple network layers is no longer a simple linear transformation, but has stronger expressive capabilities.

sigmoid activation function

Initially, sigmoid and tanh functions were the most commonly used activation functions.
insert image description here
When the number of network layers is small, the characteristics of the sigmoid function can well meet the role of the activation function: it compresses a real number between 0 and 1. When the input number is very large, the result will be close to 1; when the input When a very large negative number is used, a result close to 0 will be obtained.

This characteristic can well simulate whether the neuron is activated and transmits information backward after being stimulated (the output is 0, it is hardly activated; the output is 1, it is completely activated).

A big problem with sigmoid is gradient saturation. Observe the curve of the sigmoid function. When the input number is large (or small), the function value tends to remain unchanged, and its derivative becomes very small. In this way, in a network structure with a large number of layers, when performing backpropagation, due to the accumulation of many small sigmoid derivatives, the result tends to zero and the update speed is slower.

ReLu activation function

insert image description here
Aiming at the problem of slow training convergence caused by sigmoid gradient saturation, ReLU was introduced in AlexNet. ReLU is a piecewise linear function, the output is 0 if it is less than or equal to 0; the output is identical if it is greater than 0.

Compared with sigmoid, ReLU has the following advantages:
1. Under the calculation overhead: sigmoid's forward propagation has exponential operation and reciprocal operation, while ReLu has linear output; in back propagation, sigmoid has exponential operation, and ReLU has output part , the derivative is always 1;
2. Gradient saturation problem;
3. Sparsity: Relu will cause the output of some neurons to be 0, which causes the sparsity of the network, reduces the interdependence of parameters, and alleviates overfitting problems occur.

There is a problem here. As mentioned earlier, the activation function should be non-linear in order to make the network structure more expressive. Then the use of ReLU here is essentially a linear piecewise function, how to perform nonlinear transformation.

Here, the neural network is viewed as a huge transformation matrix M, whose input is the matrix A composed of all training samples, and the output is the matrix B, then: B=M⋅A. If M here is a linear transformation, then all training samples A are linearly transformed and output as B.

So for ReLU, because it is segmented, the part of 0 can see that the neurons are not activated, and different neurons are activated or not activated, and the transformation matrix formed by the neurons is different.

There are two training samples a1, a2, and the transformation matrices formed by the neural network during training are M1, M2. Since the activated neurons in the neural network corresponding to M1 transformation are different from M2, M1 and M2 are actually two different linear transformations. That is to say, the linear transformation matrix Mi used by each training sample is different, and it experiences nonlinear transformation in the whole training sample space.

To put it simply, the same features in different training samples have different neurons when they are learned by the neural network (neurons with an activation function value of 0 will not be activated). In this way, the final output is actually a non-linear transformation of the input samples.

A single training sample is a linear transformation, but the linear transformation of each training sample is different, so the entire training sample set is a nonlinear transformation.

data augmentation

Due to the large number of training parameters and strong expressive ability of the neural network, it needs a relatively large amount of data, otherwise it is easy to overfit. When the training data is limited, some new data can be generated from the existing training data set through some transformations to quickly expand the training data. For image data sets, some deformation operations can be performed on the image: flipping, random cropping, translation, color and light transformation...

The following operations are performed on the data in AlexNet:

1. Random cropping, randomly cropping the 256×256 picture to 227×227, and then flipping it horizontally.

2. During the test, cut the upper left, upper right, lower left, lower right, and middle respectively 5 times, and then flip it over, a total of 10 cuts, and then average the results.

3. Do PCA (Principal Component Analysis) on the RGB space, and then do a (0, 0.1) Gaussian perturbation on the principal component, that is, transform the color and light. As a result, the error rate has dropped by 1%.

cascaded pooling

Pooling in LeNet is non-overlapping, that is, the pooling window size and step size are equal.
insert image description here
The pooling (Pooling) used in AlexNet is overlapping, that is, when pooling, the step size of each move is smaller than the pooling window length. The size of AlexNet pooling is a 3×3 square, and each pooling moves a step size of 2, so that there will be overlap. Overlapping pooling can avoid overfitting, and this strategy contributes 0.3% of the Top-5 error rate. Compared with the non-overlapping scheme s = 2, z = 2, the dimensions of the output are equal and can suppress overfitting to a certain extent.

Local Correspondence Normalization

insert image description here
insert image description here
insert image description here

Dropout

This is a more commonly used method to suppress overfitting.

The main purpose of introducing Dropout is to prevent overfitting. In the neural network, Dropout is realized by modifying the structure of the neural network itself. For a neuron of a certain layer, the neuron is set to 0 by a defined probability, and this neuron does not participate in forward and backward propagation, just like in the network is deleted, while keeping the number of neurons in the input layer and output layer unchanged, and then update the parameters according to the learning method of the neural network. In the next iteration, some neurons are randomly deleted again (set to 0) until the end of training.

Dropout should be regarded as a great innovation in AlexNet, and it is now one of the necessary structures in neural networks. Dropout can also be regarded as a model combination. The network structure generated each time is different. By combining multiple models, overfitting can be effectively reduced. Dropout only needs twice the training time to achieve model combination ( Similar to the effect of averaging), very efficient. As shown below:
insert image description here

Alexnet network structure

insert image description here
The network consists of 8 layers with weights; the first 5 layers are convolutional layers, and the remaining 3 layers are fully connected layers. The output of the last fully connected layer is the input of 1000-dimensional softmax, and softmax will generate a distribution network of 1000 labels containing 8 weighted layers; the first 5 layers are convolutional layers, and the remaining 3 layers are fully connected layers. The output of the last fully connected layer is the input of a 1000-dimensional softmax, which produces a distribution of 1000 class labels.

It can be clearly seen from the figure that the network structure is divided into upper and lower sides. This is because the network is distributed on two GPUs. This is mainly because the NVIDIA GTX 580 GPU only uses 3GB of memory and cannot accommodate such a large network.

It should be noted that although the AlexNet network is represented by the structure of the above figure, the size of the input image is not 224x224x3, but should be 227x227x3. You can use the size of 244 to deduce it, and you will find that the result of boundary filling is a decimal. Obviously it is wrong, so I won't do the derivation here.

The parameters and structure of each layer of AlexNet are as follows:

Input layer: 227x227x3
C1: 96x11x11x3 (number of convolution kernels/height/width/depth)
C2: 256x5x5x48 (number of convolution kernels/height/width/depth)
C3: 384x3x3x256 (number of convolution kernels/height/width/ Depth)
C4: 384x3x3x192 (number of convolution kernels/height/width/depth)
C5: 256x3x3x192 (number of convolution kernels/height/width/depth)

Network structure analysis

1. Convolution layer C1

The processing flow of this layer is: Convolution –> ReLU –> Pooling –> Normalization.

Convolution, the input is 227×227, using 96 convolution kernels of 11×11×3, the resulting FeatureMap is 55×55×96.

ReLU, input the FeatureMap output by the convolutional layer into the ReLU function.

Pooling, using 3×3 pooling units with a step size of 2 (overlapped pooling, the step size is smaller than the width of the pooling unit), the output is 27×27×96 ((55−3)/2+1=27)

Local response normalization, using k=2, n=5, α=10−4, β=0.75 for local normalization, the output is still 27×27×96, the output is divided into two groups, the size of each group 27×27×48

2. Convolution layer C2

The processing flow of this layer is: Convolution –> ReLU –> Pooling –> Normalization

Convolution, the input is 2 groups of 27×27×48. Use 2 groups, each group has 128 convolution kernels with a size of 5×5×48, and make edge padding=2, and the convolution step size is 1. Then the output FeatureMap is 2 groups, and the size of each group is 27 27 128. ((27+2∗2−5)/1+1=27)

ReLU, input the FeatureMap output by the convolutional layer into the ReLU function

The size of the pooling operation is 3×3, the step size is 2, the size of the pooled image is (27−3)/2+1=13, and the output is 13×13×256

Local response normalization, using k=2, n=5, α=10−4, β=0.75 for local normalization, the output is still 13×13×256, the output is divided into 2 groups, the size of each group 13×13×128

3. Convolution layer C3

The processing flow of this layer is: Convolution –> ReLU

Convolution, the input is 13×13×256, using 2 groups of 384 convolution kernels with a size of 3×3×256, edge padding=1, and the step size of the convolution is 1. The output FeatureMap is 13 13 384

ReLU, input the FeatureMap output by the convolutional layer into the ReLU function

4. Convolution layer C4

The processing flow of this layer is: Convolution –> ReLU

This layer is similar to C3.

Convolution, the input is 13×13×384, divided into two groups, each group is 13×13×192. Use 2 groups, each group has 192 convolution kernels with a size of 3×3×192, and edge padding =1, the step size of convolution is 1. Then the output FeatureMap is 13×13 times384, divided into two groups, each group is 13×13×192

ReLU, input the FeatureMap output by the convolutional layer into the ReLU function

5. Convolution layer C5

The processing flow of this layer is: convolution–>ReLU–>pooling

Convolution, the input is 13×13×384, divided into two groups, each group is 13×13×192. Use 2 groups, each group is 128 convolution kernels with a size of 3×3×192, edge padding=1, and the convolution step size is 1. The output FeatureMap is 13×13×256

ReLU, input the FeatureMap output by the convolutional layer into the ReLU function

Pooling, the size of the pooling operation is 3×3, the step size is 2, the size of the pooled image is (13−3)/2+1=6, that is, the output after pooling is 6×6×256

6. Fully connected layer FC6

The process of this layer is: (convolution) full connection -->ReLU -->Dropout

Convolution -> Full Connection: The input is 6×6×256, this layer has 4096 convolution kernels, and the size of each convolution kernel is 6×6×256. Since the size of the convolution kernel is just the same as the size of the feature map (input) to be processed, that is, each coefficient in the convolution kernel is only multiplied by a pixel value of the size of the feature map (input), corresponding to one-to-one, therefore, the layers are called fully connected layers. Since the convolution kernel has the same size as the feature map, there is only one value after the convolution operation. Therefore, the size of the convolutional pixel layer is 4096×1×1, that is, there are 4096 neurons.

ReLU, these 4096 operation results generate 4096 values ​​​​through the ReLU activation function

Dropout, suppressing overfitting, randomly disconnecting some neurons or not activating some neurons

7. Fully connected layer FC7

The process is: full connection–>ReLU–>Dropout

Fully connected, the input is a vector of 4096

ReLU, these 4096 operation results generate 4096 values ​​​​through the ReLU activation function

Dropout, suppressing overfitting, randomly disconnecting some neurons or not activating some neurons

8. Output layer

The 4096 data output by the seventh layer are fully connected with the 1000 neurons of the eighth layer, and after training, 1000 float values ​​are output, which is the prediction result.

Parameters and numbers of each layer of AlexNet

The parameters of the convolutional layer = the number of convolutional kernels * convolutional kernels + bias

C1: 96 convolution kernels of 11×11×3, 96×11×11×3+96=34848

C2: 2 groups, each with 128 convolution kernels of 5×5×48, (128×5×5×48+128)×2=307456

C3: 384 convolution kernels of 3×3×256, 3×3×256×384+384=885120

C4: 2 groups, each group has 192 convolution kernels of 3×3×192, (3×3×192×192+192)×2=663936

C5: 2 groups, each with 128 convolution kernels of 3×3×192, (3×3×192×128+128)×2=442624

FC6: 4096 convolution kernels of 6×6×256, 6×6×256×4096+4096=37752832

FC7: 4096∗4096+4096=16781312

output: 4096∗1000=4096000

The convolution kernels in the convolutional layers C2, C4, and C5 are only connected to the FeatureMap on the upper layer of the same GPU. It can be seen from the above that most of the parameters are concentrated in the fully connected layer, and in the convolutional layer due to weight sharing, there are fewer weight parameters.

Model frame shape structure

Since AlexNet is trained using two graphics cards, its network structure is actually grouped. Moreover, the convolution kernels on C2, C4, and C5 are only connected to the convolution kernels on the same GPU in the previous layer. For a single graphics card, it is not applicable. This article is based on the implementation of Keras, ignoring its structure about dual graphics cards, and will be locally normalized
insert image description here

About the dataset

The data set used in the experiment is ImageNet. ImageNet is a dataset of more than 15 million labeled high-resolution images, with approximately 22,000 categories. The images were collected from around the web and labeled using Amazon's Mechanical Turk crowdsourcing service.

Since 2010, the ILSVRC competition has been held, and the data used is a subset of ImageNet, with about 1000 images per category, and a total of 1000 categories. In total there are approximately 1.2 million training images, 50,000 validation images, and 150,000 test images. The ImageNet competition gives two error rates, top-1 and top-5. The top-5 error rate means that the five categories with the highest probability predicted by your model do not contain the correct category.

ImageNet consists of images of variable resolution, while neural network input dimensions are fixed. Therefore, we downsample the image to a fixed resolution rectangular image of 256×256, we first rescale the image so that the length of the short side is 256, and then crop the center 256×256 image from the resulting image. We did not preprocess the images in any other way, we trained our network on the raw RGB values ​​of the pixels.

training study

The model training uses the stochastic gradient descent method. There are 180 pictures in each batch. The weight update formula is as follows:
insert image description here

Where i is the iteration index, v is the momentum, 0.9 is the momentum parameter, ε is the learning rate, and 0.0005 is the weight decay coefficient, which not only plays a role of regularization, but also reduces the training error of the model.

All weights are initialized with a Gaussian distribution with mean 0 and variance 0.01. The biases of the 2nd, 4th, and 5th convolutional layers and all fully connected layers are initialized to 1, and the biases of other layers are initialized to 0. Learning rate ε=0.01, all layers use this learning rate, during the training process, When the error rate is not decreasing, divide the learning rate by 10, and reduce it by 3 times before terminating the training. We trained 1.2 million images 90 times, and it took 5 to 6 days in total.

keras code example

class AlexNet:
    @staticmethod
    def build(width,height,depth,classes,reg=0.0002):
        model = Sequential()
        inputShape = (height,width,depth)
        chanDim = -1

        if K.image_data_format() == "channels_first":
            inputShape = (depth,height,width)
            chanDim = 1

        model.add(Conv2D(96,(11,11),strides=(4,4),input_shape=inputShape,padding="same",kernel_regularizer=l2(reg)))
        model.add(Activation("relu"))
        model.add(BatchNormalization(axis=chanDim))
        model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))
        model.add(Dropout(0.25))

        model.add(Conv2D(256,(5,5),padding="same",kernel_regularizer=l2(reg)))
        model.add(Activation("relu"))
        model.add(BatchNormalization(axis=chanDim))
        model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))
        model.add(Dropout(0.25))

        model.add(Conv2D(384,(3,3),padding="same",kernel_regularizer=l2(reg)))
        model.add(Activation("relu"))
        model.add(BatchNormalization(axis=chanDim))
        model.add(Conv2D(384,(3,3),padding="same",kernel_regularizer=l2(reg)))
        model.add(Activation("relu"))
        model.add(BatchNormalization(axis=chanDim))
        model.add(Conv2D(256,(3,3),padding="same",kernel_regularizer=l2(reg)))
        model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))
        model.add(Dropout(0.25))

        model.add(Flatten())
        model.add(Dense(4096,kernel_regularizer=l2(reg)))
        model.add(Activation("relu"))
        model.add(BatchNormalization())
        model.add(Dropout(0.25))

        model.add(Dense(4096,kernel_regularizer=l2(reg)))
        model.add(Activation("relu"))
        model.add(BatchNormalization())
        model.add(Dropout(0.25))

        model.add(Dense(classes,kernel_regularizer=l2(reg)))
        model.add(Activation("softmax"))

        return model

Reference:
https://blog.csdn.net/lcczzu/article/details/91991725
https://www.cnblogs.com/wangguchangqing/p/10333370.html
https://www.cnblogs.com/zyly/p/ 8781224.html
https://blog.csdn.net/chaipp0607/article/details/72847422

Guess you like

Origin blog.csdn.net/ximu__l/article/details/129461880