CNN Convolutional Neural Network Basics Notes

1. Development history of convolutional neural network (CNN)

  • In 1986, Runmelheart, Hinton and others proposed the back propagation algorithm (Back Propagation, BP)
  • In 1998, LeCun used the BP algorithm to train the LeNet5 network, marking the true advent of CNN (but the hardware level at this time made it difficult to train the network)
  • In 2006, Hinton first proposed the concept of Deep Learning in their Science Paper
  • In 2012, Hinton's student Alex Krizhevsky used a GPU to master a Deep Learning model in his dormitory, and won the championship in the visual field competition ILSVRC 2012 in one fell swoop. On the million-level ImageNet data set, the effect greatly exceeded that of traditional methods ( 70% more -> 80% more)

  The learning video comes from station b: up:

PILIBALA Wz's personal space_bilibili_bilibili

2. Fully connected layer

The fully connected layer is composed of many neurons connected together.

  • x1, x2, x3 are the three excitations of this neuron, w1, w2, w3 are the weights corresponding to these three excitations.
  • -1 is the bias of this neuron

 1. Back Propagation(BP)

The BP algorithm includes two processes: forward propagation of signals and back propagation of errors . That is, when calculating the error output, proceed in the direction of "input → output", and when adjusting the weights and thresholds, proceed in the direction of "output → input".

 Example: Using BP neural network for license plate number recognition

 1. First read a color RGB image, each pixel contains 3 values ​​​​(RGB components)

  • First grayscale it (get the middle picture), and each pixel value will have only one component.
  • Then perform binarization processing to obtain a black and white image.

 2. Use a sliding window with 5 rows and 3 columns to slide on the binarized black and white image. Every time you slide to a place, calculate the proportion of white pixels in the sliding window to the pixels covered by the entire sliding window.

  • When the sliding window slides to the far right, the number of columns is not enough. You can perform zero padding or make a judgment. When the sliding window is about to cross the boundary, temporarily turn it into a sliding window with 5 rows and 2 columns. window
  • By traversing the entire image through this method, a 5*5 matrix is ​​obtained.

 3. Expand the resulting 5*5 matrix by rows and splice it into a row vector (1 row and 25 columns), so that this row vector can be used as the input layer of the neural network.

4. After the input layer is established, look at the output layer. One-hot encoding is a commonly used way to encode tags 

  • The picture above shows the one-hot encoding corresponding to each number from 0 to 9, without repetition.

 5. After having the input and output, the neural network can be trained. In the actual training process, you can set the number of input nodes of the input layer to 25, set the number of output nodes of the output layer to 10, and set the hidden layer in the middle according to the actual situation.

2. Convolution layer

Convolutional layer is a unique network structure in convolutional neural network 

Convolution : A sliding window slides on the feature map and is calculated (multiply the value on the convolution kernel and the value on the feature map, and then add them together to get a value in the final matrix. Each sliding step Calculate a value and finally get the convolution result) 

 The calculation method of convolution: as shown in the orange box on the left side of the picture above       

(1*1)+(0*0)+(0*1)+(1*0)+(1*1)+(0*0)+(1*1)+(1*0)+(1*1) = 4

The purpose of convolution is to extract image features

The characteristics of convolution

  • It has a local perception mechanism: sliding calculations are performed on the feature map in the form of a sliding window, so it has local perception capabilities.
  • Weight sharing: During the sliding process, the value of the convolution kernel will not change, so it has the characteristic of weight sharing.

1. Advantages of weight sharing (compared to BP neural network)

  • The parameters here refer to the weights of the neurons
  • The sharing of weights greatly reduces the number of parameters of the convolutional neural network.

2. The process of convolution 

In practical applications, convolution operations are often performed on multi-dimensional feature matrices.

The depth of the convolution kernel must be consistent with the input feature matrix (the depth here refers to the channel, that is, the number of channels), and they are all three-dimensional. The final matrix obtained by convolution is the three-channel input feature matrix and the three-channel convolution The kernels correspond to convolution respectively, and then a convolution matrix is ​​obtained after corresponding addition.

 3. Summary

1. The channel of the convolution kernel is the same as the channel of the input feature layer (both three channels)

2. The output feature matrix channel is the same as the number of convolution kernels (a two-channel output feature matrix is ​​finally obtained through convolution kernel 1 and convolution kernel 2)

4. think

(1) How to calculate if offset bias is added?

Just add each element of the matrix obtained by the final convolution and the offset

 (2) How to calculate the activation function?

Commonly used activation functions

  • Why use activation function? Introduce nonlinear factors in the linear calculation process, so that it has the ability to solve nonlinear problems
  • The Relu activation function actually filters out all negative values, leaving only positive values ​​unchanged, which is often used in practical applications
  • In the process of backpropagating errors, if the sigmoid activation function is used , solving the derivative is very troublesome.
  • If you use the Relu activation function, the weight cannot be activated again after entering the inactivation state. Therefore, during the training process, it is recommended not to use a particularly large learning rate for learning at the beginning, otherwise it will easily lead to the inactivation of a large number of neurons.

(3) What should I do if there is an out-of-bounds situation during the convolution process?

 

  • Under normal circumstances, you can use padding to directly pad zeros around the image for processing. After zero padding, convolution can be performed normally without out-of-bounds situations.

 

  • Pixel p of padding: In general, in the actual application process, both sides are filled with zeros at the same time, that is, left, right, up and down are symmetrical: zero padding operation
  • In the picture above, only one side is supplemented, so just adding a p is enough
  • N = (4 - 3 + 1)/2 + 1 = 2, so we finally get a 2*2 feature matrix

3. Pooling layer

 The purpose of the pooling layer is to sparsely process the feature map and reduce the amount of data calculation

maxpooling downsampling : find the maximum value within the corresponding range of the pooling kernel to perform the maximum downsampling operation 

averagepooling downsampling layer : find the average value within the corresponding range of the pooling kernel to perform the average downsampling operation 

Characteristics of the pooling layer

1. There are no training parameters, but the operation of calculating the maximum or average value on the original feature map

2. It only changes the width (w) and height (h) of the feature matrix, and does not change the depth (channel)

3. Generally, the size of the pooling kernel (poolsize) and the stride (stride) are the same, and the feature map can be reduced by a certain proportion, making the calculation more convenient (this is only a general case, but not absolute)

4. Calculation of errors

Explain using three-layer BP neural network

 parameter:

  • First layer: Input layer: two nodes     \huge \chi_{1}   and   \huge \chi_{2}
  • Second layer: hidden layer: the three nodes     \huge \sigmaare activation functions, \huge \omega _{11}^{(1)}   and    \huge \omega _{21}^{(1)}  the sum is the corresponding weight \huge b_{1}^{(1)}and the bias.

Take the intermediate node as an example to calculate 

 Find     the value of \huge y_{1}  sum  in the same way\huge y_{2}

 

1. Calculation process of Softmax activation function

 Why use Softmax activation function?

\huge y_{1}  Because: we hope that  the output  \huge y_{2}  sum conforms to the probability distribution

 

2.Cross Entropy Loss, cross entropy loss function

1. For multi-classification problems ( softmax output , the sum of all output probabilities is 1)

\huge H=-\sum _{i} o _{i}^{*} log(o_{i})

2. For two classification problems ( sigmoid output , each output node is irrelevant to each other)

\LARGE H=-\frac{1}{N}\sum_{i=1}^{N} [ o_{i}^{*}log o_{i}+(1-o_{i}^{*})log(1- o_{i})]

Using Softmax output is consistent with probability output (the sum of all probabilities is 1)
using Sigmoid output is not consistent with probability output.

According to the formula, we can get the calculation formula of Loss using cross entropy.

 

5. Back propagation of errors

 Calculate    \huge \omega _{11}^{(2)}  the error gradient (i.e. find the partial derivative) 

Calculate the partial derivatives in the yellow box separately

 

 The value obtained is equivalent to back-propagating the error to each node, and obtaining the loss gradient of each node.

 

6. Weight update

 The expression for updating the weights is very simple, but we cannot be sure whether the gradient direction we are looking for is the direction that reduces the loss the fastest .

 

In actual training, we often set up batches for training. After each batch is trained, the Loss and gradient of the batch will be calculated. Although there is a gradient, this gradient is optimal for this batch. , which is not necessarily true for the entire data. Therefore, in order to better perform gradient updates in batch training, the concept of optimizer is introduced. 

 

7.  Optimizer

 

Make the following optimizations: add one more momentum part

In addition to calculating the gradient of the current batch, the gradient of the previous batch is also added.

 \LARGE g_{(t+1)} is the gradient of this batch, \LARGE g_{(t)}is the gradient of the previous batch, \LARGE g_{(t+1)}^{*} and is the actual updated gradient of this batch.

 Adding momentum can effectively suppress the interference of sample noise.

 

This optimizer improves the learning rate. 

 As the batch progresses,    \large \frac{\alpha }{\sqrt{S_{t}+\varepsilon }}  it will eventually become smaller and smaller -> It means that the intensity of the update is getting smaller and smaller, and it looks like the learning rate used in the update is gradually getting smaller.

Disadvantages : The learning rate drops too fast, and training may stop before convergence.

It also adjusts the learning rate and is an improved version of the Adagrad optimizer. Compared with the Adagrad optimizer, RMSProp actually adds two coefficients to control the attenuation of the learning rate. 

 

 

6. Comparison of the effects of several optimizers 

 

  • Although SGD is slow, the direction of gradient update is ideal.
  • SGD with Momentum had a bad start but quickly found the right path
  • Adagrad and RMSProp are not only directionally correct but also fast
  • The effect of the Adam optimizer is not shown here

 

In actual projects, the more commonly used ones are:

  1. SGD with Momentum
  2. Adam

Many people may choose to use the Adam optimizer because it has better results, but in papers, many authors still use the SGD optimizer with momentum. How to choose still depends on the actual situation
.

Guess you like

Origin blog.csdn.net/weixin_45897172/article/details/128345978