[Image Classification] [Deep Learning] [Pytorch Version] Detailed Explanation of AlexNet Model Algorithm

[Image Classification] [Deep Learning] [Pytorch Version] Detailed Explanation of AlexNet Model Algorithm


Preface

AlexNet was developed by Krizhevsky, Alex and others from the University of Toronto in "ImageNet Classification with Deep Convolutional Neural Networks [NIPS-2012]" [Paper address】The core idea of ​​the model proposed in the article is to use a deep convolutional neural network for image classification, and use the dropout regularization method to reduce overfitting.
The emergence of AlexNet marks the revival of neural networks and the rise of deep learning. It successfully proves the effectiveness of deep neural networks in large-scale image classification tasks and provides a strong foundation for subsequent deep learning research. Powerful motivation.


AlexNet explains

The role of convolutional layers

In a convolutional neural network, the main function of the convolutional layer is to extract features from the input data through convolution operations. It uses a set of learnable convolution kernels (also called filters or convolution matrices) to convolve the convolution kernel with the input data to generate an output feature map. Convolution is key to improving the performance of the model and extracting higher-level abstract features.

The following are several important meanings of convolutional layers in deep learning:

  1. Feature extraction: The convolutional layer can learn to extract local features of the input data. By convolving the convolution kernel with the input data, the convolution layer can identify features such as edges, texture, and shape in the image.
  2. Parameter sharing: The convolutional layer has the characteristic of parameter sharing. In the convolutional layer, the parameters of each convolution kernel are shared to different locations throughout the input data, greatly reducing the number of parameters of the network, improving the efficiency of the model, and making the model more robust and generalizable.
  3. Spatial locality: The convolutional layer processes input data through local receptive fields. Each neuron only connects a small local area in the input data instead of global connections, which can better capture the spatial locality characteristics of the input data.
  4. Downsampling: Convolutional layers are usually combined with pooling layers to perform downsampling operations. Pooling layers can reduce the size of feature maps and extract more abstract features.
    By stacking multiple convolutional layers, a deep convolutional neural network (CNN) can be constructed to solve various computer vision tasks (image classification, target detection and image segmentation ).

convolution process

In a convolutional neural network, the implementation of the convolutional layer is actually the cross-correlation operation defined in mathematics, which is the basic operation in the convolutional layer. The specific calculation process is as shown in the figure:

The light purple picture on the left represents the 7×7 input picture (or feature map) matrix; the pink picture in the middle represents the 3×3 convolution kernel matrix, and the light cyan picture at the back represents the feature map matrix output by the convolution layer.

The steps of the convolution operation are as follows:

  1. Perform element-wise product of the convolution kernel and a local area of ​​the input matrix;
  2. Sum the product results to obtain an element of the output matrix;
  3. Move the convolution kernel by one step on the input matrix and repeat the above operation until the entire input matrix is ​​covered;
  4. Repeat the above process, and each convolution kernel generates an output matrix;

The convolution process describes the mathematical operations of convolution operations in deep learning. The following are important concepts involved in the convolution process.

concept describe
Convolution kernel The value in the convolution kernel is the weight used when performing convolution calculations on sub-block pixels of the same size as the convolution kernel in the image.
Convolution calculation (convolution) The pixels in the image have strong spatial dependence, and convolution processes the image based on the spatial dependence of the pixels.
feature map The output of the convolution kernel
Padding Adds extra edge elements (usually 0) around the edges of the input matrix to control the size of the output matrix
Stride Determines the distance that the convolution kernel moves each time on the input matrix

Feature map size calculation formula

The size of the output feature map (matrix) of the convolutional layer is determined by the following factors:

  • Size of input matrix: F h × F w {F_{\rm{h}}} \times {F_{ {\rm{w}}}} Fh×Fw
  • Convolution kernel size: k h × k w {k_{\rm{h}}} \times {k_{ {\rm{w}}}} kh×kw
  • Size of stride: s s s
  • Padding size: p a d pad pad

The basic formula is:
The height of the output feature map: F h = F h _ p r e v − k h + 2 × p a d s + 1 { F_{\rm{h}}} = \frac{ { {F_{ {\rm{h\_prev}}}} - {k_{\rm{h}}} + 2 \times pad}}{s} + 1 Fh=sFh_prevkh+2×pad+1, Exit special conquest picture: F w = F w _ p r e v − k w + 2 × p a d s + 1 {F_{\ rm{w}}} = \frac{ { {F_{ {\rm{w\_prev}}}} - {k_{\rm{w}}} + 2 \times pad}}{s} + 1 Fw=sFw_prevkw+2×pad+1.
The following is an input feature matrix of 8×8 with padding of 1. After passing through a convolution kernel with a step size of 2 and a size of 3×3, an output feature matrix of size 3×3 is generated. :

Similarly, the size of the pooling layer output feature map (matrix) is determined by the following factors:

  • Size of input matrix: F h × F w {F_{\rm{h}}} \times {F_{ {\rm{w}}}} Fh×Fw
  • Convolution kernel size: k h × k w {k_{\rm{h}}} \times {k_{ {\rm{w}}}} kh×kw
  • Size of stride: s s s

The pooling layer no longer fills the output feature map. Its basic formula is:
The height of the output feature map: F h = F h _ p r e v − k h s + 1 {F_{\rm{h}}} = \frac{ { {F_{ {\rm{h\_prev}}}} - {k_{\rm{h}}}}}{s} + 1 Fh=sFh_prevkh+1, Exit special expedition picture: F w = F w _ p r e v − k w s + 1 {F_{\rm{w} }} = \frac{ { {F_{ {\rm{w\_prev}}}} - {k_{\rm{w}}}}}{s} + 1 Fw=sFw_prevkw+1.
The following is an input feature matrix of 4×4. After a pooling kernel with a step size of 1 and a size of 3×3, an output feature matrix of size 2×2 is generated:

The size of the pooling kernel in a convolutional neural network is usually 2×2

The role of Dropout

Dropout is a regularization technology used in deep neural networks. Its core idea is to randomly drop (zero) neurons in the network during the training process, that is, to randomly turn off some neurons (or nodes). ) and the connections between them and the neurons in the next layer to reduce the co-adaptation between neurons, thereby improving the generalization ability of the model.

Features of Dropout:

  • Randomly turn off neurons: The probability of each neuron being turned off is the same for a given hyperparameter.
  • Shutdown during forward propagation: During forward propagation, the output value of an inactive neuron is set to zero.
  • Turning on during backpropagation: During the backpropagation process, the weights of inactive neurons will not be adjusted, and only truly activated neurons participate in weight updates.
  • Ensemble learning effect: A method that can enable ensemble learning (Ensemble Learning), because each iteration will randomly create a different sub-model, which consists of a network after discarding some neurons. Multiple sub-models are integrated by averaging or voting. The prediction of the network reduces the elevation angle of the model and improves the generalization performance.
  • Reduce co-adaptation: Neurons in neural networks tend to adapt to each other and process input data together through each other's activation patterns. This co-adaptability may lead to overfitting, i.e. performing well on training data but poor generalization on test data, especially when the training data is sparse or small. Dropout randomly discards neurons, forcing the network to learn and adapt to the data in different subsets of neurons, thereby reducing co-adaptability between neurons and helping to improve the generalization ability of the model.

AlexNet model structure

The following figure is a detailed schematic diagram of the VGGnet model structure drawn by the blogger based on the original paper:

layer_name kernel size whether the kernel padding stride input_size
Conv1 11 96 2 4 3×224×224
Maxpool1 3 - 0 2 96×55×55
Conv2 5 256 2 1 96×27×27
Maxpool2 3 - 0 2 256×27×27
Conv3 3 384 1 1 256×13×13
Conv4 3 384 1 1 384×13×13
Conv5 3 256 1 1 384×13×13
Maxpool3 3 - 0 2 256×13× 13
FC1 - 2048 - - 256×1×1
FC2 - 2048 - - 2048×1×1
FC3 - 1000 - - 2048×1×1

AlexNet can be divided into two parts: the first part (backbone) mainly consists of convolutional layer and pooling layer (pooling layer), and the second part consists of fully connected layer (classifier).

Highlights of AlexNet illustrate
ReLU activation function It solves the gradient disappearance problem of SIgmoid when the network is deep, converges faster than tanh during training, and effectively prevents the occurrence of over-fitting.
Cascading pooling operation The step size of pooling is smaller than the kernel size, so that there will be overlap and coverage between outputs, which can cause information interaction and retain necessary connections between adjacent pixels, improve the richness of features, and avoid the blurring effect of average pooling .
Dropout operation The Dropout operation will set the output of each hidden layer neuron with a probability less than 0.5 to 0, and selectively remove some neural nodes, reducing the interaction between complex neurons and preventing overfitting.

AlexNet Pytorch code

backbone part

# backbone部分
# 卷积层组:conv2d+ReLU
self.features = nn.Sequential(
            # input[3, 224, 224]  output[96, 55, 55]
            nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            # input[96, 55, 55] output[96, 27, 27]
            nn.MaxPool2d(kernel_size=3, stride=2),
            # input[96, 27, 27] output[256, 27, 27]
            nn.Conv2d(96, 256, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            # input[256, 27, 27] output[256, 13, 13]
            nn.MaxPool2d(kernel_size=3, stride=2),
            # input[256, 13, 13] output[384, 13, 13]
            nn.Conv2d(256, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            # input[384, 13, 13] output[384, 13, 13]
            nn.Conv2d(384, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            # input[384, 13, 13] output[256, 13, 13]
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            # input[384, 13, 13] output[256, 6, 6]
            nn.MaxPool2d(kernel_size=3, stride=2),
        )

Classifier part

# 分类器部分:Dropout+FC+ReLU
self.classifier = nn.Sequential(
	# 以0.5的概率选择性地将隐藏层神经元的输出设置为零
    nn.Dropout(p=0.5),
    nn.Linear(256 * 6 * 6, 2048),
    nn.ReLU(inplace=True),
    nn.Dropout(p=0.5),
    nn.Linear(2048, 2048),
    nn.ReLU(inplace=True),
    nn.Linear(2048, num_classes),
)

Complete code

import torch.nn as nn
import torch
from torchsummary import summary

class AlexNet(nn.Module):
    def __init__(self, num_classes=1000, init_weights=False):
        super(AlexNet, self).__init__()
        # backbone部分
        # 卷积层组:conv2d+ReLU
        self.features = nn.Sequential(
            # input[3, 224, 224]  output[96, 55, 55]
            nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            # input[96, 55, 55] output[96, 27, 27]
            nn.MaxPool2d(kernel_size=3, stride=2),
            # input[96, 27, 27] output[256, 27, 27]
            nn.Conv2d(96, 256, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            # input[256, 27, 27] output[256, 13, 13]
            nn.MaxPool2d(kernel_size=3, stride=2),
            # input[256, 13, 13] output[384, 13, 13]
            nn.Conv2d(256, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            # input[384, 13, 13] output[384, 13, 13]
            nn.Conv2d(384, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            # input[384, 13, 13] output[256, 13, 13]
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            # input[384, 13, 13] output[256, 6, 6]
            nn.MaxPool2d(kernel_size=3, stride=2),
        )
        # 分类器部分:Dropout+FC+ReLU
        self.classifier = nn.Sequential(
            # 以0.5的概率选择性地将隐藏层神经元的输出设置为零
            nn.Dropout(p=0.5),
            nn.Linear(256 * 6 * 6, 2048),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(2048, 2048),
            nn.ReLU(inplace=True),
            nn.Linear(2048, num_classes),
        )
        # 对模型的权重进行初始化操作
        if init_weights:
            self._initialize_weights()

    def forward(self, x):
        x = self.features(x)
        x = torch.flatten(x, start_dim=1)
        x = self.classifier(x)
        return x

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # Conv2d的权重初始化为服从均值为0,标准差为sqrt(2 / fan_in)的正态分布
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    # Conv2d的偏置置0
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                # FC的权重初始化服从指定均值和标准差的正态分布
                nn.init.normal_(m.weight, 0, 0.01)
                # FC的偏置置0
                nn.init.constant_(m.bias, 0)
if __name__ == '__main__':
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model = AlexNet().to(device)
    summary(model, input_size=(3, 224, 224))              

summary can print the network structure and parameters, making it easy to view the built network structure.


Summarize

It introduces the principle and convolution process of convolution as simply and in detail as possible, and explains the structure and pytorch code of the AlexNet model.

Guess you like

Origin blog.csdn.net/yangyu0515/article/details/134242680