[Deep Learning Theory] (2) Convolutional Neural Network

Hello everyone, I recently learned the CS231N Stanford Computer Vision Open Class. It was so wonderful that I would like to share it with you.

As shown in the figure below, there is now an image with a shape of 32x32x3 . If a fully connected neural network is used , the image is stretched into a one-dimensional vector  [None, 3072] ; 10 linear classifiers are used, and each classifier is a one-dimensional vector Vector, containing 3072 weights, and the image is dot-multiplied, each classifier gets 1 value, and the output feature vector is [None, 10]

The disadvantage of using fully connected neural network to deal with image classification problem is that after flattening a two-dimensional image into a one-dimensional vector, the spatial information of the image is lost . To solve this problem, the convolution operation is introduced.


1. Convolutional Neural Network Composition

A convolutional neural network consists of a convolutional layer, a pooling layer (downsampling), and a fully connected layer . The convolutional layer extracts features ; the pooling layer post-processes the features, transforming the features into larger ones, and the graph becomes smaller and blurred; the fully connected layer fuses different features and outputs the final result.


2. Convolution

The convolution operation can be understood as the convolution kernel sliding on the original image, multiplying the corresponding position elements, that is, multiplying the image pixels by the weight of the corresponding position convolution kernel, adding the multiplied results, and filling in the generated feature map. corresponding location.

2.1 Single channel image

Take a single-channel image as an example. As shown in the figure below, the left image is the original pixel, the green area in the left image is the receptive field, and the middle image is the convolution kernel. Each convolution result generates a feature map, as shown in the right image. The weights in the convolution kernel of each convolution are unchanged. In the receptive field area, the original pixel and the corresponding image value of the convolution kernel are multiplied and added, and the result is filled in the corresponding position of the feature map.

There are as many feature maps as there are convolution kernels. A convolution kernel generates a feature map, and if a set of convolution kernel weights is changed, a second feature map is obtained. By stacking the feature maps generated by multiple convolution kernels, a multi-channel feature map is generated. as the input to the next convolutional layer.

However, now there is a small problem. During the convolution process, the pixels in the upper left corner of the original image only participate in one convolution operation, while the middle pixels participate in many convolution operations. This will result in the loss of edge information during the convolution process .

How to solve this problem, we can fill a circle with 0 at the edge of the original image, which is called padding operation . (1) After adding 0, the edge pixels can participate more in the convolution operation . (2) If 0 padding is not used, the size of the feature map generated during the convolution process will continue to decrease; after 0 padding is used, the size of the feature map before convolution is the same as the size of the feature map after convolution

Now the convolution kernel moves step=(1,1) on the image , that is, the convolution kernel slides one frame at a time in the horizontal and vertical directions. If the convolution kernel moves on the image with a step size=(2, 2) , that is, the convolution kernel slides two spaces at a time in the horizontal and vertical directions.


2.2 Multi-channel images

In real life, it is generally a color image, that is, RGB three color channels . What I just introduced is that the original image has only one channel.

In the convolution operation, as many channels as the input image has, the convolution kernel has as many channels. As shown in the figure below, the convolution kernels of the three channels slide on the original image of the three channels, and the corresponding elements are multiplied and added together, and a convolution kernel generates a feature map .

As shown in the figure below, the input image has three channels, and a convolution kernel also has three channels. Each channel of the convolution kernel corresponds to each channel of the original image, and the weight of each channel of the convolution kernel is different . The convolution kernel and the corresponding position of the original image are multiplied and added, plus an offset, and filled in the corresponding position of the feature map.


2.3 The purpose of convolution

The purpose of convolution is to extract the features that meet the definition of the convolution kernel in the original image .

It can be understood that the convolution kernel defines a feature, and the convolution operation is to extract this feature on the original image . If some pixel values ​​in the original image conform to the convolution kernel feature, then the corresponding pixel value in the feature map is relatively large. The values ​​elsewhere are very low . Different convolution kernels can extract different features in the original image , as shown in the following two convolution kernels.


2.4 Summary of Convolution Process

The shape of the input image is [32, 32, 3], convolved with 6 convolution kernels of 5x5x3, and the shape of the feature map is [28, 28, 6], and then 10 convolution kernel volumes of 5x5x6 are used to generate The shape of the feature map is [24, 24, 10]


3. Pooling

Now there are feature maps output by the convolution layer. As the number of convolution kernels increases, the number of feature map channels also increases, and the amount of parameters increases. Now select some representative pixel values ​​from the feature map as representatives, and do not need to consider such complicated parameters.

The selection methods are: max pooling : select a maximum value from each pooling window ; average pooling: calculate the average value of each pooling window.

The role of pooling :

(1) Make the convolutional neural network translation invariant . No matter where the object is in the original image, the translation deformation is filtered through the pooling layer.

(2) Reduce the amount of parameters . Extract representative pixel values ​​from the feature map.

(3) Prevent overfitting . The feature map is processed into a larger size, the main features are retained, and the noise is filtered.

The convolution kernel of the convolution operation has weight, and the pooling window of the pooling operation has no weight, just need to find the corresponding value


4. Fully connected

The full connection flattens the output feature map into a long vector, performs model aggregation, and obtains the final result.


5. Features of Convolutional Neural Networks

(1) Local receptive field . Each neuron is only connected to a region of the input neuron, which is called the receptive field.

Different from the fully connected neural network, the fully connected network is that each neuron has to process all the output information of the previous layer; while in the convolutional neural network, each neuron only pays attention to a small area in the receptive field.

(2) Weight sharing . The convolution kernel is shared. During a convolution process, the weight parameters of a convolution kernel are unchanged (the weight parameters of each channel of a convolution kernel can be different)

(3) Pooling (downsampling) reduces parameters, prevents overfitting, and introduces translation invariance .


6. Supplement small knowledge points

6.1 Size change of feature map

If the size of the input image is (N, N) , the size of the convolution kernel is (F, F) , and the size of the output feature map is (NF)/stride +1

As shown in the figure below, when the step size = 3, the output feature map size is a decimal , which should be avoided. Use 0 padding around the original image to make the output feature map size an integer.

After using padding, the output feature map size formula becomes: (N+2P-F)/stride +1

p represents a few circles of padding around the original image. If p = (F-1)/2 , and stride = 1 , then the input and output feature maps are of the same size .


6.2 1x1 Convolution

1x1 convolution means that the length and width of the convolution kernel are equal to 1 . As shown in the figure below, the shape of the input feature map is [64, 64, 192], the shape of a convolution kernel is [1, 1, 192], and the corresponding elements are multiplied and added to obtain the shape of the convolution output feature map as [64, 64, 1]

Different numbers of 1x1 convolution kernels can be used to increase or reduce the dimension . If the number of 1x1 convolution kernels is 128, the shape of the input image is reduced from [64, 64, 192] to [64, 64, 128] ].

The role of 1x1 convolution : (1) dimensionality reduction or increase; (2) cross-channel information fusion; (3) reduce the amount of parameters; (4) increase the depth of the model and improve the nonlinear representation ability.

As shown in Figure (a) below, using a 5x5x256 convolution kernel to calculate, there are 200,000 parameters. The number of operands is 160.5M, which can be understood as each output feature map is obtained by one convolution, and the amount of operation of each convolution is the size = 5*5 of each convolution kernel

As shown in Figure (b), the 1x1 convolution kernel is used to reduce the dimension first, and then the 5x5 convolution is used to extract features. The parameter amount of each 1x1 convolution kernel is only 1*1*256. Greatly reduces the amount of computation

Guess you like

Origin blog.csdn.net/dgvv4/article/details/123519574