Convolutional Neural Network (CNN), the basis of machine learning

Convolutional Neural Network (CNN) consists of input layer, convolutional layer, activation function, pooling layer, and fully connected layer, namely INPUT-CONV-RELU-POOL-FC

(1) Convolutional layer: use it for feature extraction, as follows:

The input image is 32*32*3, 3 is its depth (ie R, G, B), the convolutional layer is a 5*5*3 filter (receptive field), note here: the depth of the receptive field must be the same as the input The images have the same depth. A 28*28*1 feature map can be obtained through the convolution of a filter and the input image. The above picture uses two filters to obtain two feature maps;

We usually use multiple convolutional layers to get deeper feature maps. as follows:

The process of convolution is illustrated as follows:

The input image and the corresponding position elements of the filter are multiplied and summed, and finally b is added to obtain the feature map. As shown in the figure, the depth of the first layer of filter w0 is multiplied by the corresponding elements in the blue box of the input image and then summed to get 0, and the other two depths get 2, 0, then there are 0+2+0+1 =3 is the first element 3 of the feature map on the right in the figure. After the convolution, the blue box of the input image slides again, stride=2, as follows:

As shown in the above figure, after completing the convolution, a 3*3*1 feature map is obtained; here is also a point to note, that is, the zero pad item, which is to add a border to the image, and the border elements are all 0. (It has no effect on the original input. ) generally have

F=3 => zero pad with 1

F=5 => zero pad with 2

F=7=> zero pad with 3, the border width is an empirical value, and the zero pad is added to make the input image and the convolved feature map have the same dimension, such as:

The input is 5*5*3, the filter is 3*3*3, and the zero pad is 1, then the input image after adding the zero pad is 7*7*3, and the size of the feature map after convolution is 5*5 *1 ((7-3)/1+1), same as input image;

The calculation method of the size of the feature map is as follows:

 Another feature of convolutional layers is the principle of "weight sharing". As shown below:

Without this principle, the feature map consists of 10 feature maps of 32*32*1, that is, there are 1024 neurons on each feature map, and each neuron corresponds to a 5*5*3 area on the input image. That is, there are 75 connections between a neuron and this area of ​​the input image, that is, 75 weight parameters, there are 75*1024*10=768000 weight parameters, which is very complicated, so the convolutional neural network introduces " Weights" sharing principle, that is, 75 weight parameters corresponding to each neuron on a feature map are shared by each neuron, so only 75*10=750 weight parameters are needed, and the threshold of each feature map is Also shared, that is, 10 thresholds are required, then a total of 750+10=760 parameters are required.

Replenish:

(1) Doing 1*1 convolution for multi-channel images is actually multiplying each channel of the input image by a coefficient and adding them together, which is equivalent to "connecting" the original independent channels in the original image together;

 (2) When the weight is shared, it is only shared in each channel on each filter ;

 

Pooling layer: compress the input feature map, on the one hand, make the feature map smaller and simplify the network computational complexity; on the other hand, perform feature compression to extract the main features, as follows:

There are generally two types of pooling operations, one is Avy Pooling and the other is max Pooling, as follows:

Similarly, a 2*2 filter is used, and max pooling is to find the maximum value in each area, where stride=2, and finally extract the main features from the original feature map to get the right image.

(Avy pooling is not used much now (in fact, it is the average pooling layer), the method is to sum each 2*2 area element, and then divide by 4 to get the main features), and the general filter takes 2*2, The maximum is 3*3, the stride is 2, and the compression is 1/4 of the original.

Note: The pooling operation here is the reduction of the feature map, which may affect the accuracy of the network, so it can be compensated by increasing the depth of the feature map (the depth here becomes 2 times the original).

 

Fully connected layer: Connect all features and send the output value to a classifier (such as a softmax classifier).

The overall structure is roughly as follows:

另外:CNN网络中前几层的卷积层参数量占比小,计算量占比大;而后面的全连接层正好相反,大部分CNN网络都具有这个特点。因此我们在进行计算加速优化时,重点放在卷积层;进行参数优化、权值裁剪时,重点放在全连接层。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325542726&siteId=291194637