Deep learning MNIST handwritten digit recognition-convolutional neural network training model and generating gui window for online recognition basic knowledge

A basic knowledge preparation

1. Convolutional

As the name implies, the convolutional layer is composed of a set of convolutional units (also known as "convolution kernels"). These convolutional units can be understood as filters, and each filter will extract a specific feature

Understanding of the function of the convolutional layer

  • It is generally believed that the spatial relationship of the image is that the local pixels are more closely related, and the distant pixels are less related. Therefore, each neuron does not need to perceive the global image, as long as the local is perceived, and then the higher layer Local information is integrated to obtain global information.
  • Given an input picture, use a convolution kernel to scan the picture. The number in the convolution kernel is called the weight. Each position in this picture is scanned by the same convolution kernel, so the weight is the same. This is weight sharing.
  • As mentioned in the weight sharing part, we can only get one feature with one convolution kernel operation, and may not get all the features, so we have introduced multi-core convolution. Use different convolution kernels to learn different features (each convolution kernel learns different weights) to extract the original image features. It should be noted that the size of each convolution kernel should be the same in the process of multi-core convolution.

 

The following dynamic picture vividly shows the calculation process of the convolutional layer:

 

2. Pooling layer (POOL) 

 

The filter of the convolutional layer is responsible for finding the law from the image. The more filters, the more parameters, which means that the dimensions of the convolutional layer may be very large. We need a way to reduce the dimensionality. This is the role of the pooling layer (aka "downsampling layer") in the convolutional network.

Understanding of the functions of the pooling layer

  • Reduce the size of the picture, which is to increase the receptive field. The receptive field is the size of the area in the original image corresponding to a number in the digital matrix. Because pooling is to select a number in a certain range, that is, let this number represent the values ​​of all pixels in this range. Although this also loses some picture information, it also increases robustness.
  • Increase translation invariance. The simple position movement of a target in the picture should not affect the recognition result. The pooling captures exactly the characteristics of the target, not the location of the target, so translation invariance is increased.
  • Improve training speed. This is because the size of the picture is reduced while retaining the feature information.

There are three main forms of pooling: general pooling, overlapping pooling and pyramid pooling.

(1) General pooling

The size of the pooling window is n*n. Generally, the pooling window is square. The stride is equal to n. There is no overlap between pooling windows at this time. For those that exceed the range of the number matrix, only those within or outside the range are calculated and filled with 0. It can be divided into maximum pooling and average pooling.

  • Maximum pooling

The maximum value within the pooling window is used as the sampled output value.
If the input is a 4×4 matrix, the maximum pooling performed is a 2×2 matrix, sliding 2 steps at a time. The execution process is very simple, split the 4×4 input into different areas, and mark these areas with different colors. For a 2×2 output, each output element is the maximum element value in its corresponding color area.

For each 2 * 2 window, select the largest number as the value of the corresponding element of the output matrix. For example, if the largest number in the first 2 * 2 window of the input matrix is ​​6, then the first element of the output matrix is ​​6. And so on. 

  • Mean pooling

Ordinary mean pooling is to use the average value within the pooling window as the sampled output value. This pooling is not as common as max pooling.

 

3. Incentive layer

The output result of the convolutional layer is non-linearly mapped.

 

 

The excitation function used by CNN is generally ReLU (The Rectified Linear Unit), which is characterized by fast convergence and simple gradient calculation, but it is relatively fragile. The image is as follows.

 

Practical experience of the incentive layer:
①Don't use sigmoid! Don't use sigmoid! Don't use sigmoid!
② Try RELU first, because it is fast, but be careful
③ If 2 fails, please use Leaky ReLU or Maxout
④ In some cases, tanh has good results, but rarely

4. Fully connected layer

All neurons between the two layers have weights to reconnect, and usually the fully connected layer is at the tail of the convolutional neural network. That is, the connection method of neurons in the traditional neural network is the same:

 

The general CNN structure is

1.INPUT
2.[[CONV -> RELU]N -> POOL?]M
3.[FC -> RELU]*K
4.FC

5. The padding formula of the convolutional layer

There are two problems when convolving the original image directly. The first is that the image (feature map) will be reduced after each convolution, so that it won’t be scrolled a few times; second, compared to the points in the middle of the picture, the points on the edge of the picture are calculated less frequently in the convolution, resulting in Information on the edge is easy to lose.
To solve this problem, we can use the filling method. Before each convolution, we fill in a circle of blanks around the picture, so that the picture after convolution is as big as the original, and at the same time, the original edges are calculated more times.

 

 

For example, if we add the picture of (8,8) to (10,10), then after the filter of (3,3), it will be (8,8), unchanged.
It can ensure that the input data and output data have the same spatial size. Assuming that the number of zero padding is p, the convolution kernel is f * f, and the convolution kernel sliding step is s, then p should be set to

 

Formula to calculate the output size of the convolutional layer

Assuming that the original input image is m * m, the output image is n * n, the number of zero padding is p, the convolution kernel is f * f, and the convolution kernel sliding step is s, the output size is

Calculate the number of parameters after convolution

Assuming that the input image is (m, m, d), where d is the image depth (number of channels), the convolution kernel is f * f, and the number of convolution kernels is n, then the number of weights is the number of

biases: n

Formula to calculate the output size of the pooling layer

The pooling layer rarely uses zero padding. Assuming that the original input image is m * m, the output image is n * n, the convolution kernel is f * f, and the convolution kernel sliding step is s, the output size is

The code is organized and will be shared in my next article

Guess you like

Origin blog.csdn.net/weixin_47440593/article/details/109412527