Image feature extraction (detailed explanation of the convolution process of VGG and Resnet feature extraction)

Image feature extraction (detailed explanation of convolution process of VGG and Resnet algorithm)

Chapter 1 Image Feature Extraction Cognition

1.1 Principles and performance of common algorithms

As we all know, computers do not recognize images, only numbers. In order to enable computers to "understand" images and thus have real "vision", in this chapter we will study how to extract useful data or information from images, and obtain "non-image" representations or descriptions of images, such as numerical values ​​and vectors and symbols etc. This process is feature extraction, and the extracted representations or descriptions of these "non-images" are features. With these numerical or vector features, we can teach the computer how to understand these features through the training process, so that the computer has the ability to recognize images.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-NWuPPnPC-1640420041379)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640417319980.png)]

1.2 What are image features

Features are the corresponding (essential) characteristics or characteristics of a certain type of objects that are different from other types of objects, or a collection of these characteristics and characteristics. Features are data that can be extracted by measurement or processing. For images, each image has its own characteristics that can be distinguished from other types of images. Some are natural features that can be felt intuitively, such as brightness, edge, texture, and color; some require transformation or processing. can only be obtained, such as moments, histograms, and principal components.

1.3 Eigenvectors and their geometric interpretation

We often combine multiple or multiple characteristics of a certain type of object to form a feature vector to represent this type of object. If there is only a single numerical feature, the feature vector is a one-dimensional vector. If it is a combination of n characteristics , then it is an n-dimensional feature vector. This type of feature vector is often used as the input of the recognition system. In fact, an n-dimensional feature is a point located in an n-dimensional space, and the task of recognition and classification is to find a division of this n-dimensional space. For example, to distinguish 3 different yuān plants, you can choose their petal length and petal width as features, so that a 2-dimensional feature represents a plant object, such as (5.1,3.5). If you add Sepal length and sepal width, each Iris plant object is represented by a 4-dimensional feature vector, such as (5.1, 3.5.1.4, 0.2).

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-qkGvPrua-1640420041380)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640417445238.png)]

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-8Tme7jrw-1640420041381)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640417455082.png)]

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-76VvhqH5-1640420041382)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640417469257.png)]

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-4sLrG3Bu-1640420041383)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640417477144.png)]

1.4 General Principles of Feature Extraction

Image recognition is actually a classification process. In order to identify the category of an image, we need to distinguish it from other images of different categories. This requires that the selected features not only describe images well, but more importantly, distinguish images of different categories well. **We want to select those image features that are less different between images of the same class (smaller intra-class distance), and more different (larger class distance) between images of different classes, which we call the most The most discriminative feature. In addition, prior knowledge plays an important role in feature extraction, and how to rely on prior knowledge to help us select features is also a problem that will continue to be concerned later.

Chapter 2 Common Feature Extraction Algorithms

2.1 Common image feature extraction algorithms

SIFT
HOG
ORB
HAAR
Deep Learning (neural network feature extraction)

SIFT (Scale Invariant Feature Transform)

The essence of SIFT feature extraction is to find key points (feature points) in different scale spaces and calculate the direction of key points. The key points found by SIFT are some very prominent points that will not change due to factors such as lighting, affine transformation, and noise, such as corner points, edge points, bright spots in dark areas, and dark points in bright areas.

HOG (Histogram of Oriented Gradients)

The essence of HOG feature extraction is to form features by calculating and counting the histogram of gradient direction in the local area of ​​the image. Hog features combined with SVM classifiers have been widely used in image recognition, especially in pedestrian detection with great success.

Comparison of SIFT and HOG

What they have in common: Both are feature extraction methods based on histograms of gradient orientations in the image.
Differences: SIFT features are usually used together with interest points obtained using SIFT detectors. These points of interest are associated with a specific orientation and scale. Usually, the SIFT feature of the square area in an image is calculated after the corresponding direction and scale transformation is performed. The unit size of the HOG feature is small, so a certain spatial resolution can be preserved, and the normalization operation makes the feature insensitive to local contrast changes.

ORB

The running time of the ORB feature description algorithm is much better than that of the SIFT algorithm, and it can be used for real-time feature detection. ORB features are based on the feature point detection and description technology of FAST corner points, which are invariant to scale and rotation, and are also invariant to noise and perspective affine. The good performance makes the application scenarios of using ORB for feature description very wide.

HER

The most classic algorithm for face detection is Haar-like feature + Adaboost. This is the most commonly used method of object detection (originally used for face detection), and it is also the most used method.
Training process: input image->image preprocessing->extract features->train classifier (two classification)->get trained model; test process: input image->image preprocessing->
extract features->import model- >Secondary classification (is it the object to be detected).

Chapter 3 Deep Learning to Extract Features

3.1 Neural Network Extraction of Image Features

Small blocks of graphics can be composed of basic edges. More structured and complex ones require higher-level feature representations. High-level representations are composed of low-level representations. As shown in the figure: This is the feature extracted by each layer of the neural network. Since it is automatically learned through the neural network, it is also an unsupervised feature learning process. Intuitively speaking, it is to find the small patch of make sense and then combine it to get the feature of the previous layer, and recursively learn the feature upward. When training on different objects, the resulting edge basis is very similar, but the object parts and models will be completely different.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-wQehMDyk-1640420041388)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640417905514.png)]

3.2 Understanding VGG network and Resnet network

VGGNet : In 2014, the Oxford University Computer Vision Group (Visual Geometry Group) and researchers from Google DeepMind jointly developed a new deep convolutional neural network: VGGNet, and won the second place in the ILSVRC2014 competition classification project (the first place is GoogLeNet, also proposed in the same year) and the first place in the positioning project. It can be seen that the effect of VGGNet is very good. VGGNet explored the relationship between the depth of the convolutional neural network and its performance, successfully constructed a 16-19 layer deep convolutional neural network, and proved that increasing the depth of the network can affect the final performance of the network to a certain extent, making errors At the same time, the scalability is very strong, and the generalization of migrating to other image data is also very good

Resnet : ResNet won the 2015 ImageNet competition. The main innovation lies in the design of a residual structure using skip connection, which makes the network reach a deep level and improves performance at the same time. As the number of layers increases, the neural network shows a degradation problem, that is, the deep network is not as good as the shallower network performance; and this is not caused by overfitting, because the degradation gap is shown on the training set. Therefore, an intuition is: if the neural network can easily achieve equivalent mapping between layers, that is, the input of a block is equal to the output of this block, then the deeper network should not have worse performance than the shallow network. degradation problem. Therefore, Dr. He et al. designed a skip connection structure, which enables the network to have stronger identity mapping capabilities, thereby expanding the depth of the network and improving the performance of the network.

3.3 VGG16

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-nC5H5J8N-1640420041391)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640418004354.png)]

VGG16 consists of 13 convolutional layers, 5 pooling layers, and 3 fully connected layers. 13+3 = 16, so it is called VGG16.
The 5 pooling layers are not counted. It does not perform calculations. Why it does not perform calculations later will be explained in later chapters. The features extracted from a picture through a multi-layer network.

3.4 VGG network series

insert image description here

3.5.1 How Convolutional Layers Work

Convolution Stereo Diagram

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-4wNYS5rw-1640420041393)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640418100419.png)]

A color image has three RGB channels, and a (6×6) image dimension is (6×6×3), where 3 corresponds to three channels (color RGB), and convolution operation is performed on such an image , the number of channels of the filter used must be consistent with it. For example, a filter with a dimension of (3×3×3) can be used here to obtain a convolved image with a dimension of (4×4), and the final output dimension is (4×4×2). If (y) filters of this dimension are used, then the convolved image dimension should be (4×4×y)

3.5.2 Plane Diagram of Convolution Process

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-la2Qfouc-1640420041394)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640418216601.png)]

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-z7iR7AAK-1640420041394)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640418463028.png)]

Convolutional layer calculation : The matrix adopts the point multiplication method, and the convolution kernel and each element of the sliding position are multiplied by the input matrix unit, and the result of each multiplication is added as a unit of the output feature matrix.

Changes in shape and size of convolution before and after :

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-GtckMacv-1640420041395)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640418516220.png)]

3.5.3 How the pooling layer works

There are usually two ways of pooling layer, the first is maximum pooling, and the second is average pooling. In VGG16, maximum pooling is used. The pooling layer does not participate in the calculation. The role of the pooling layer is to reduce Dimensions, the maximum value is taken for each step of movement, which is why VGG16 is equal to 13 + 3, not 13 + 3 + 5. In VGG16, a 2x2 matrix is ​​used, and the moving step is 2.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-blbJaxOM-1640420041395)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640418623110.png)]

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-z8VYAQaW-1640420041396)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640418637408.png)]

3.5.4 Working principle of fully connected layer

Fully connected can be seen as a special convolutional layer. Each node of the fully connected layer is connected to each node of the previous layer, which integrates the output features of the previous layer, so the weight parameters of this layer are the most. For example, in VGG16, the first fully connected layer FC1 has 4096 nodes, and the upper layer POOL2 has 7 7 512 = 25088 nodes. Since the fully connected layer can be regarded as a special case of the convolutional layer, such as VGG16, the POOL2 to FC1 layer is fully connected, and the output nodes of pool2 are arranged according to the vector, that is, there are 25088 dimensions, and the size of each dimension is 1*1 , the convolution kernel can be regarded as num_filters = 4096, channel = 25088, kernel_size = 1, stride=1, no pad.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-1DOWayg7-1640420041397)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640418685404.png)]

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-NyYg4tMQ-1640420041397)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640418706869.png)]

3.5.5 VGG16 Network Performance Evaluation

VGG advantages :
The structure of VGGNet is very simple, and the entire network uses the same size convolution kernel size (3x3) and maximum pooling size (2x2).
The combination of several small filter (3x3) convolutional layers is better than one large filter (5x5 or 7x7) convolutional layer: it is verified that the performance can be improved by continuously deepening the network structure.

Disadvantages of VGG :
VGG consumes more computing resources and uses more parameters (here is not a 3x3 convolution pot), resulting in more memory usage (140M). Most of the parameters are from the first fully connected layer. VGG has 3 fully connected layers PS: According to tests, even if these fully connected layers are removed, there is no impact on performance, but this significantly reduces the number of parameters. Note: Many pretrained methods use VGG models (mainly 16 and 19). Compared with other methods, VGG has a large parameter space, so it usually takes longer to train a vgg model. Fortunately, there are public pretrained models ( pre-trained model) makes it very convenient for us to use.

3.6 Research and Discovery of Increased Neural Network Depth

In computer vision, the depth of the network is an important factor to achieve a good effect of the network, and the "level" of the input feature becomes higher as the depth of the network increases. However, as the depth of the network continues to deepen, the gradient disappears and/or explodes as an obstacle to training a deep network, resulting in the failure of the network to converge. Although, the normalized initialization and the normalization of the input of each layer make the depth of the convergent network ten times that of the original. Although the network converges, the network begins to degrade (increasing the number of network layers leads to greater errors), as shown in the following figure:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-6wIQoMak-1640420041398)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640418916400.png)]

3.6.1 Resnet network

Resnet's residual unit:

ResNet was proposed in 2015 and won the first place in the classification task of the ImageNet competition, because it coexists "simple and practical". After that, many methods were completed on the basis of ResNet50 or ResNet101, in the fields of detection, segmentation, recognition, etc. be widely used. It uses a connection method called "shortcut connection". As the name suggests, shortcut means "cutting the corner". The following is the network structure of this resnet:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-nEroXGY4-1640420041399)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640418943816.png)]

where σ represents the activation function ReLU:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-WCJ3LcvE-1640420041399)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640418983433.png)]

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-kHe2losM-1640420041400)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640418993146.png)]

When it is necessary to change the input and output dimensions (such as changing the number of channels), a linear transformation Ws can be performed on x during shortcut, as follows:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-ANpPxTmB-1640420041400)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640419017527.png)]

3.6.2 Resnet residual unit

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-GdLhsTJc-1640420041401)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640419046787.png)]

These two structures are respectively for ResNet34 (left picture) and ResNet50/101/152 (right picture), and the whole structure is generally called a "building block". The picture on the right is also called "bottleneck design". The purpose is to reduce the number of parameters. In practice, considering the calculation cost, the residual block is calculated and
optimized, that is, the two 3x3 convolutional layers are replaced by 1x1 + 3x3 + 1x1, as shown on the right. The middle 3x3 convolutional layer in the new structure first
reduces the calculation under a dimensionality reduction 1x1 convolutional layer, and then restores it under another 1x1 convolutional layer, which not only maintains the accuracy but also reduces the amount of calculation.
The first 1x1 convolution reduces the 256-dimensional channel to 64 dimensions, and then restores it by 1x1 convolution at the end.

The number of parameters used as a whole: 1x1x256x64 + 3x3x64x64 + 1x1x64x256 = 69632. If the bottleneck is not used, it is two 3x3x256 convolutions. The number of parameters: 3x3x256x256x2 = 1179648, which is 16.94 times worse. For conventional ResNet, it can be used in networks with 34 layers or less. For Bottleneck Design's ResNet, it is usually used in deeper networks such as 101, in order to reduce calculations and parameters.

3.6.3 Convolution parameters of various models of Resnet

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-uGRqaXLg-1640420041401)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640419174596.png)]

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-d9QnPl7W-1640420041402)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640419183522.png)]

3.6.4 Resnet50 network composition

The Resnet50 network contains 49 convolutional layers and 1 fully connected layer. As shown in the figure below, the Resnet50 network structure can be divided into seven parts. The first part does not contain residual blocks, and mainly performs convolution, regularization, activation functions, and maximum pooling calculations on the input. The second, third, fourth, and fifth parts of the structure all contain residual blocks. In the Resnet50 network structure, the residual block has three layers of convolution, the network has a total of 1+3×(3+4+6+3)=49 convolutional layers, plus the last fully connected layer is a total of 50 layers , which is also the origin of the name Resnet50. The input of the network is 224×224×3. After the convolution calculation of the first five parts, the output is 7×7×2048. The pooling layer will convert it into a feature vector, and finally the classifier will calculate the feature vector and Output class probabilities.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-lRsVxd1X-1640420041402)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640419224934.png)]

3.6.5 Resnet50 convolution and pooling

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-qXVXJLVD-1640420041403)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640419271987.png)]

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-Clx1xKR6-1640420041404)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640419280783.png)]

Note : When the size is not divisible, the convolution is rounded down, and the pooling is rounded up.

3.6.6 Comparison between Resnet and ordinary neural network (forward propagation and error backward propagation)

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-3N8RA4k8-1640420041405)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640419361844.png)]

Forward process, the final result represents the direct forward process, the operation of continuous addition (the residual element considered is a unit, and the interior of the residual element is still a multiplication of two layers), that is, from the first layer can be directly to the
first L layer, while the traditional network is a multiplication operation, and the amount of calculation is obviously different. (from multiplication to multiplication)

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-U7IY5Bj8-1640420041406)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640419403644.png)]

For the residual element, the forward process is linear, and the subsequent input is equal to the input plus the result of each residual element, while the ordinary network is the multiplication operation of each layer of convolution; the residual The first major feature of the network, the reverse update solves the problem of gradient disappearance.

3.6.7 Why the residual network can solve the gradient disappearance

An example of BP neural network, which uses the sigmiod function as the activation function, the second figure below its derivative, the maximum value is 0.25.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-hmZFn8ho-1640420041407)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640419484155.png)]

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-vFbhBCFI-1640420041408)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640419541070.png)]

It has been transmitted in the error direction of the BP network, and the activation function is derived by the chain rule of derivation. As the number of network layers increases, due to continuous multiplication, even if the maximum value is taken, the multiplication of several 0.25 slowly approaches 0. , the gradient basically disappears later, and the weights will no longer change. Even if there are more networks, nothing can be learned.

3.6.8 Resnet can solve gradient disappearance

Resnet error back propagation process:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-ZaJieXJn-1640420041409)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640419576825.png)]

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-KzEKgp2G-1640420041409)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640419614550.png)]

It can be seen that the gradient of backpropagation consists of two items:

1. Derivation of xl, gradient is 1
2. Derivation of multi-layer ordinary neural network is:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-JpxvdjLJ-1640420041410)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640419723501.png)]

Even if the derivation of the ordinary neural network approaches 0, the sum of the two items, and the addition of 1 cannot be 0, so the error can be effectively propagated to the deep network, so the use of the residual network structure can avoid the problem of gradient disappearance .

3.6.9 Resnet network performance evaluation

In a network that is too deep, the gradient is prone to disappear during backpropagation. Once the derivative of a certain step is less than 1, then continue to backpropagate. That is to say, the shallow network can't learn anything. This is the reason why the network is too deep and the effect decreases. After adding the shortcut structure in ResNet, during backpropagation, not only the gradient is passed between every two blocks, but also the gradient before derivation is added, which is equivalent to artificially increasing the gradient passed forward in each block , it will reduce the possibility of gradient disappearance, so that it can learn deeper features well.

Feature redundancy : It is believed that in the forward convolution, the convolution of each layer actually only extracts part of the information of the image. In this way, the deeper the layer, the more serious the loss of original image information, but only for the original image. A small part of the features are extracted. This obviously produces something like underfitting. Adding the shortcut structure is equivalent to adding all the information of the previous layer image to each block, and retaining more original information to a certain extent.
In general, because every convolution (including the corresponding activation operation) will waste some information: such as the randomness (blindness) of the convolution kernel parameters, the inhibition of the activation function, and so on. At this time, the shortcut in ResNet is equivalent to taking the previously processed information directly and processing it together now, which has the effect of reducing damage. The learning effect of deep features is still relatively superior

3.7 Resnet50 extracts features and implements image similarity comparison

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-jPunrnaV-1640420041410)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640419916838.png)]

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-0iovngT4-1640420041413)(%E5%9B%BE%E5%83%8F%E7%89%B9% E5%BE%81%E6%8F%90%E5%8F%96%EF%BC%88VGG%E5%92%8CRESNET50%EF%BC%89.assets/1640419928102.png)]
Did this article help you? If so, welcome to follow, like and collect, and there will be a lot of welfare pushes in the future! ! ! ! ! ! ! !
insert image description here

Guess you like

Origin blog.csdn.net/m0_49501453/article/details/122144902