Principles of Convolutional Neural Networks in Computer Vision

Insert picture description here
A simple neuron: there are three excitations on the left, multiply the excitations by the corresponding weights, then sum them and add the corresponding biases, and finally get the final output y through the excitation function.
Insert picture description here
If we put the neurons in columns Arrangement and full connection between columns will result in a BP neural network: In the algorithm of BP neural network, it mainly includes two parts, signal forward propagation and error back propagation. Under normal circumstances, an output value will be obtained from left to right. The error value can be obtained by comparing the output value with the expected value. By calculating the partial derivative of each node, the error gradient of each node can be obtained. The loss value is applied back to the error gradient to realize the back propagation of the error.
Insert picture description here

The purpose of convolution is to perform image feature extraction, which has a local perception mechanism in the form of a sliding window, and the convolution kernel does not change during the sliding process, so it has a weight sharing mechanism. The following figure illustrates the benefits of the weight sharing mechanism:

Insert picture description here

#Must understand:

One picture is convolved, and the result can also be understood as another picture. The convolution here is very different from the convolution in digital signal processing. In signal textbooks, before doing the element product summation, the convolution kernel is flipped diagonally, that is, rolled up (not rotated), and then the product summation is performed, but this mirroring operation is skipped in computer vision. The code can be simplified and the neural network can work normally . Technically speaking, what is actually done here is cross-correlation rather than convolution. In the deep learning literature, by convention, this is called a convolution operation.

! The vertical detection kernel can detect vertical edges. If the original image is small, the detected edges are wider, but when the image is large, the vertical edges can be detected well, and the same applies to the horizontal.
Insert picture description here
! In the field of deep learning, to detect the edges of complex graphics, it is not necessary to choose the nine numbers (convolution template) given by the researchers, but to treat these nine numbers as nine parameters and use the reverse The propagation algorithm understands these nine parameters and allows the neural network to learn them automatically through data feedback. The neural network can learn some low-level features such as edges, and can detect the edges of the book at any angle. This is what the convolutional neural network is about to do .

! The purpose of the convolutional layer: parameter sharing (research found that when extracting a feature of an image, a convolution kernel can be applied to the entire image, which can use the same parameters in different areas of the image) and sparse connection (other convolution output The pixel value will not affect a certain output), so that fewer parameters can be used for training to prevent overfitting.

! Padding: n×n image, f×f edge detection convolution kernel can get (n-f+1)×(n-f+1) size image after convolution, which has two disadvantages: ① every time The image will be reduced after the convolution operation, and we do not want to reduce the features of our image (for example, a 100-layer network, and finally get very small features useless). ②The corner and edge pixels are less used in the output, which means that a lot of information about the position of the edge of the image is lost. Solution: Fill the image before the convolution operation. It is customary to fill it with 0. If p is the number of layers to fill the peripheral pixels, the output image size will become (n+2p-f+1)×(n+ 2p-f+1). As for how many pixels to fill, there are usually two choices: Valid convolution and Same convolution . Valid convolution does not fill pixels and the output image size will be reduced; Same convolution means that the input size and output size are the same.
Note: In computer vision, f is often an odd number, so it is convenient to calculate a pixel and asymmetric filling will not occur when filling.

! Convolution step length: the distance that the convolution kernel moves horizontally and vertically, counted as s, the output image size becomes: [(n+2p-f)/s]+1×[(n+2p-f)/s]+ 1, if the quotient is not an integer, it is rounded down .

! Three-dimensional convolution: not only limited to grayscale images, but also want to detect the characteristics of RGB color images, you must use a three-dimensional convolution kernel, where the dimensionality must match the depth of the image (the reason is later). We imagine the three-bit convolution kernel as a cube, and perform translation, multiplication, and addition operations in the original image, so the output after three-dimensional convolution has only one channel. If you only want to detect the edge of the image of one channel, you can set the kernel of the first layer as an operator, and set the kernels of the latter two layers to 0, so that the convolution kernel will only be useful for the red channel, so the parameter selection is different. Get different feature detectors. According to computer vision conventions, the width and height of the convolution kernel can be different from the width and height of the input image, but the number of channels must be the same . In theory, it is feasible for us to focus on only one channel.
Insert picture description here
To summarize: 1. The number of channels of the convolution kernel and the number of channels of the input feature layer must be the same
2. The number of channels of the output feature matrix is ​​the same as the number of convolution kernels
Insert picture description here

! What if you want to use multiple convolution kernels at the same time, that is, to detect edges in various directions: we can stack the images output by the convolution of the unused edge detection convolution kernels , so the summary is as follows: If the input image is n× n×a, the convolution kernel is set to f×f×a, then the matrix output by the convolution is (n-f+1)×(n-f+1)×the number of convolution kernels—here the step size is 1 and there is no padding. So the number of output channels is equal to the number of features to be detected.

! How to build a convolutional neural network: the output of different convolution kernels and the original image after convolution will form a convolutional neural network layer. The broadcast mechanism of python adds the same deviation to each value of the output matrix, and then applies nonlinearity Activation function, each convolved matrix outputs a different matrix of the same size, and then repeat the previous steps to stack these matrices to get a layer of output. The convolution kernel is represented by the variable W1. In the process of convolution, each number of the convolution kernel is multiplied. Its function is similar to W1a[0], plus the deviation b1. In the second convolution kernel, W1a[ 1]+b2, which is z=wx+b in the neural network, and finally all the convolution kernel outputs are stacked to form an output after a nonlinear function: a[1]. Each element in the convolution kernel is a weight, and a convolution kernel shares a bias. This can be used to calculate the number of parameters. Regardless of the size of the picture, the number of parameters is determined after the convolution kernel is determined . This is a feature of the convolutional neural network: avoid overfitting. I don't know how to look down at the legend.

Here formulas bad play, a direct shot:
!  If a layer is a convolutional layer: let denote the size of the convolution kernel, the superscript l denotes the lth layer, denotes the number of padding, denotes the step size, the input of this layer will be data of a certain dimension: n×n ×Number of channels in the previous layer
Insert picture description here
! Example to build a neural network: extract 7×7×40 features from a 39×39×3 input image, that is, 1960 features, then process the convolutional layer and expand it into 1960 units. What you need to master is: As the computational depth of the neural network continues to deepen, the height and width will remain the same for a period of time, and then gradually decrease as the network depth deepens, and the number of channels is increasing, which is a trend in many convolutional neural networks .
Insert picture description here
! Although the convolutional layer can be used to build a network well, most architects will add a pooling layer and a fully connected layer. The latter two are easier to design than the convolutional layer. The purpose of the pooling layer is to perform feature maps. Sparse processing reduces the amount of matrix calculations, increases the calculation speed, and improves the robustness of the extracted features . There are two types of pooling: maximum pooling and average pooling. The latter is used less.

! Maximum pooling: The input can be regarded as a collection of certain features, and each element of the output is the maximum value of its corresponding color area. A large number means that some specific features may be extracted. If some features are not extracted, The maximum value of the corresponding area is still very small, such as the upper right area, where the step size is 2: It
Insert picture description here
must be admitted that the main reason people use maximum pooling is that this method works well in many experiments, although the above intuitive understanding has been In being quoted, the maximum pooling output size can also be expressed by the formula [(n+2p-f)/s]+1 × [(n+2p-f)/s]+1, and its filter size f and step The long s is called a super parameter , and it is often set to f=2 and s=2. The effect is equivalent to reducing the height and width by half . In addition, padding is rarely used when pooling is maximized, that is, p=0. In addition ,There are no parameters that need to be learned in the pooling process, which are just static attributes of a certain layer of the neural network.
In general, poolsize and stride are the same .
Expansion: The input is a few channels, and the output is a few channels, because the method of calculating the maximum pooling is to perform the calculation process just now for each channel separately, and each channel performs the maximum pooling calculation separately.

! Average pooling: Choose the average value of each area instead of the maximum value, but the maximum pooling is more commonly used than the average pooling. It should be noted that in the literature, the pooling layer and the convolutional layer are often regarded as one layer respectively, but there are also cases where only the weighted layer is calculated when the number of network layers is counted.

! Fully connected layer: The previous convolution and pooling are equivalent to feature engineering, the latter full connection is equivalent to feature weighting , and the convolution is equivalent to the intentional weakening of full connection. According to the inspiration of the local field of view, the weak outside the local area is affected The direct wipe is zero impact, and a little bit of force has been made, and the parameters used in different parts are actually the same. Weakening reduces the number of parameters, saves the amount of calculation, and specializes in local areas, not greedy for more and more complete; forcing further reduction of parameters is more. Each neuron in the fully connected layer is fully connected with all the neurons in the previous layer, and the fully connected layer can integrate the local information with category discrimination in the convolutional layer or the pooling layer . The output value of the last layer of fully connected layer is passed to an output, which can be classified by softmax logistic regression (softmax regression). This layer can also be called a softmax layer. Usually, the CNN training algorithm also uses the BP algorithm.

! For convolutional neural networks: the pooling layer has no parameters; the convolutional layer has fewer parameters; a large number of parameters exist in the fully connected layer; as the neural network deepens, the number of activation values ​​gradually decreases. If the number of activation values ​​drops too fast, It will also affect the performance of the network (decrease too fast, and the extracted features will become less). Regarding the calculation of the activation value: the size of the two-dimensional convolution kernel is f=5, then the activation value of a convolution kernel is 5×5+1, and the number of activation values ​​of 6 convolution kernels is 6×26.

Note: Why is the activation function introduced?
Introduce non-linear factors to make the network have the ability to solve non-linear problems. After ReLU is inactivated, it can no longer be activated, so it is not recommended to use a particularly large learning rate for learning at the beginning, which may cause many neurons to inactivate.
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_42308217/article/details/109609140