Vernacular super detailed interpretation (2) -----AlexNet

1. Introduction to AlexNet

In 2012, Alex Krizhevsky can be regarded as a deeper and broader version of LeNet, which can be used to learn more complex objects.
Alex's point

  • 1. Use ReLU to get nonlinearity, and use ReLU nonlinear function as the activation function
  • 2. Use dropout trick to selectively ignore hidden layer neurons during training to slow down the overfitting of the learning model
  • 3. Use the maximum overlap pool for pooling operations to avoid the average effect of the average pool
  • 4. Using GPU can reduce training time, which is 10 times faster than CPU processing, so it can be used for larger data sets and images.

Second, AlexNet details

2.1 Local response normalization (LRN)

Recommended reading: https://blog.csdn.net/program_developer/article/details/79430119
There is a concept of "side inhibition" in neurobiology, which means that activated neurons will inhibit neighboring neurons. The purpose of normalization is to "inhibit", local response normalization is to learn from the idea of ​​side inhibition to achieve local inhibition. Especially when we use Relu, this kind of suppression will work.

2.2 Data Augmentation

There is a view that the neural network is fed by data. If you can increase the training data and provide massive data for training, it can effectively improve the accuracy of the algorithm, because this can effectively avoid overfitting, which can further Enlarge and deepen the network structure. When the training data is limited, some new data can be generated from the existing training data set through some transformations to quickly expand the training data.
Among them, the simplest and most versatile method of image data deformation: flipping the image horizontally, randomly cropping from the original image, translation transformation, color transformation, and illumination transformation.
Insert picture description here
When AlexNet is training, it is processed as follows during data enhancement:

  • (1) Random cropping. Random cropping of 256 256 size pictures to 224 224, and then horizontal flip, which is equivalent to increasing the number of samples by ((256-224)^2)*2 = 2048 times.
  • (2) During the test, the upper left, upper right, lower left, lower right, and middle were cut 5 times, and then flipped, a total of 10 cuts, and then the results were averaged. The author said that if you do not perform random cropping, large networks are basically overfitting.
  • (3) PCA (Principal Component Analysis) is performed on the RGB space, and then a Gaussian disturbance of (0,0.1) is performed on the principal component, that is, the color and illumination are transformed. As a result, the error rate is reduced by 1%.

2.3 dropout

Recommended reading: https://blog.csdn.net/program_developer/article/details/80737724

2.3.1 Introduction to dropout

The main purpose of introducing dropout is to prevent overfitting. In the neural network, Dropout is implemented by modifying the structure of the neural network itself. For a neuron of a certain layer, the neuron is set to 0 by a defined probability (usually 0.5), and this neuron does not participate in forward and backward propagation. , As if deleted in the network, while keeping the number of neurons in the input layer and output layer unchanged, and then update the parameters according to the neural network learning method. In the next iteration, some neurons are randomly deleted again (the neurons are set to 0) until the end of training.
Dropout should be regarded as a big innovation in AlexNet. Dropout can also be regarded as a combination of models. The network structure generated each time is different. By combining multiple models, overfitting can be effectively reduced. Dropout only It takes twice the training time to achieve the effect of model combination, which is very efficient.
Insert picture description here

2.3.2 Dropout specific workflow

Suppose we want to train such a neural network, as shown in the figure below:
Insert picture description here
As shown in the figure above, this is a normal neural network whose
input is x and output is y. The normal process is: we first propagate x through the network, Then the error is propagated back to determine how to update the parameters and let the network learn. After using Dropout, the process becomes as follows:

  • 1. First, randomly (temporarily) delete half of the hidden neurons in the network, and the input and output neurons remain unchanged (the dotted line in the figure below shows some temporarily deleted neurons)
    Insert picture description here
  • 2. Then propagate the input x forward through the modified network, and then propagate the resulting loss back through the modified network. After a small batch of training samples perform this process, the corresponding parameters (w, b) are updated according to the stochastic gradient descent method on the neurons that have not been deleted.
  • Then repeat the process:
    • Restore the neurons that were temporarily deleted (the deleted neurons are restored to their original state, and the neurons that were not deleted have been updated)
    • Randomly select a half-size subset from hidden layer neurons and temporarily delete them (backup the parameters of deleted neurons)
    • For a small batch of training samples, the loss is previously propagated and then backpropagated, and the parameters (w, b) are updated according to the stochastic gradient descent method (the part of the neurons that are not deleted are updated, and the deleted neurons remain deleted Previous results)
  • Keep repeating this process.

2.3.2 Use of dropout in neural networks

(1) In the training model phase, it is
inevitable that a probability process must be added to each unit of the training network.
Insert picture description here
The corresponding formula changes are as follows:

  • Network calculation formula without Dropout:
    Insert picture description here
  • Using the Dropout network calculation formula: the
    Insert picture description here
    above formula Bernoulli function is to generate a probability r vector, that is, a vector of 0 and 1 is randomly generated.
    The code level implementation makes a certain neuron stop working with probability P, in fact, is to make his activation function value become 0 with probability P. For example, the number of network neurons in a certain layer of our network is 1000, and its activation function output value For y1, y2, y3,..., y1000, we set the dropout ratio to 0.5, then after this layer of neurons undergoes Dropout, about 400 neurons will be set to 0.

(2) In the stage of testing the model
, when predicting the model, the weight parameter of each neural unit should be multiplied by the probability p. Insert picture description here
Dropout formula during test phase:
Insert picture description here

2.4 Overlapping Pooling

The general pooling (Pooling) does not overlap, the window size of the pooling area is the same as the step size, as shown in the following figure:
Insert picture description here
The pooling used in AlexNet is overlapping, that is, in the pool When changing, the step length of each movement is smaller than the window length of pooling. The size of AlexNet pooling is a 3*3 square, and the step size of each pooling movement is 2, so there will be overlap. Overlapping pooling can avoid overfitting. This strategy contributes to a 0.3% Top-5 error rate.

Three, AlexNet network structure

3.1 Overall picture of network structure

The AlexNet model is introduced as follows:

  • 1. AlexNet has 8 layers, including 5 convolutional layers and 3 fully connected layers
  • 2. Each layer uses the Relu function as the activation function
  • 3. AlexNet has a special computing layer-LRN layer, which pioneered the use of the LRN (local corresponding normalization) layer
    Full picture of network structure
    . The version package of PyTorch contains the official implementation of AlexNet. We directly use the official version to see the network General structure:
import torchvision
model = torchvision.models.alexnet(pretrained = False)
print(model)

Insert picture description here

3.2 Calculation

The structure of each layer and the related calculations of input and output are described in detail here.
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here

Four, references

Guess you like

Origin blog.csdn.net/dongjinkun/article/details/109307416