This is probably the most detailed explanation of the neural network LeNet-5!

Hello everyone, I am Red Stone!

Speaking of the deep learning target detection algorithm, we have to mention the LeNet-5 network. LeNet-5, proposed by LeCun et al. in 1998, is a very efficient convolutional neural network for handwritten character recognition. From the paper "Gradient-Based Learning Applied to Document Recognition"

Thesis Portal:

http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf

1. Network structure

ff7818be02f52eb088b1b114f664cf2b.png

LetNet-5 is a simpler convolutional neural network. The above figure shows its structure: the input two-dimensional image (single channel), first goes through two convolutional layers to the pooling layer, then goes through the fully connected layer, and finally the output layer. Overall: input layer->convulational layer->pooling layer->activation function->convulational layer->pooling layer->activation function->convulational layer->fully connect layer->fully connect layer->output layer.

The entire LeNet-5 network includes a total of 7 layers (excluding the input layer), namely: C1, S2, C3, S4, C5, F6, OUTPUT.

Several parameters:

Layer Numbering Features:

  • English letters + numbers

  • English letters represent one of the following: C→convolutional layer, S→downsampling layer (pooling), F→full connection layer

  • The number represents the current layer, not the convolutional layer (pooling layer.ec)

Explanation of terms:

  • Parameters → weight w and bias b

  • Number of connections→Number of connections

  • Parameter calculation: each convolution kernel corresponds to a bias b, and the size of the convolution kernel corresponds to the number of weight w (pay special attention to the number of channels)

Second, the input layer (Input Layer)

The input layer (INPUT) is a 32x32 pixel image, note that the number of channels is 1.

3. Layer C1

The C1 layer is a convolutional layer, which uses 6 convolution kernels of 5×5 size, padding=0, stride=1 for convolution, and obtains 6 feature maps of 28×28 size: 32-5+1=28.

fb37e381b3a8fffa509089e3506e3147.png

Number of parameters : (5*5+1)*6=156, where 5*5 is the 25 parameters w of the convolution kernel, and 1 is the bias item b.

Number of connections : 156*28*28=122304, where 156 is the number of connections in a single convolution process, 28*28 is the output feature layer, and each pixel is obtained by the previous convolution, that is, a total of 28*28 convolutions .

4. S2 layer

The S2 layer is a downsampling layer, using 6 convolution kernels of 2×2 size for pooling, padding=0, stride=2, and 6 feature maps of 14×14 size are obtained: 28/2=14.

716bd9306e183202277b1373af1486d8.png

The S2 layer is actually equivalent to the downsampling layer + activation layer. First downsampling, and then the activation function sigmoid nonlinear output. First sum the 2x2 field of view of the C1 layer, and then enter the activation function, namely:

fb91f333c5cbffb5674db828409a0e0f.png

Number of parameters : (1+1)*6=12, where the first 1 is the weight w of the largest number in the 2*2 receptive field corresponding to pooling, and the second 1 is the bias b.

Number of connections : (2*2+1)*6*14*14= 5880, although only the sum of 2*2 receptive fields is selected, there are also 2*2 connections, 1 is the connection of the bias item, 14* 14 is the output feature layer, and each pixel is obtained by the previous convolution, that is, a total of 14*14 convolutions.

Five, C3 layer

The C3 layer is a convolutional layer, which uses 16 convolution kernels of 5×5xn size, padding=0, stride=1 for convolution, and obtains 16 feature maps of 10×10 size: 14-5+1=10.

Not all 16 convolution kernels are convoluted with the 6 channel layers of S2, as shown in the figure below, the first six feature maps (0,1,2,3,4,5) of C3 are composed of adjacent Three feature maps are used as input, and the corresponding convolution kernel size is: 5x5x3 ; the next 6 feature maps (6, 7, 8, 9, 10, 11) are corresponding to the four adjacent feature maps of S2 as input The size of the convolution kernel is: 5x5x4 ; the next 3 feature maps (12, 13, 14) feature maps are taken from the four feature maps of S2 as input. The corresponding convolution kernel size is: 5x5x4; the last feature is No. 15 The image is input by all (6) feature maps of S2, and the corresponding convolution kernel size is: 5x5x6 .

441c3b0509734db27072c31b8c97fbf2.png

It is worth noting that the convolution kernel is 5×5 and has 3 channels, and each channel is different, which is why 5*5 is multiplied by 3, 4, and 6 in the calculation below. This is how multi-channel convolution is calculated.

7f3df376b6f0ba12ed66e6078401f5b7.png

Number of parameters : (5*5*3+1)*6+(5*5*4+1)*6+(5*5*4+1)*3+(5*5*6+1)* 1=1516.

Number of connections : 1516*10*10 = 151600. 10*10 is the output feature layer, and each pixel is obtained from the previous convolution, that is, a total of 10*10 convolutions have been performed.

Sixth, S4 layer

The S4 layer is also a downsampling layer like S2, using 16 convolution kernels of 2×2 size for pooling, padding=0, stride=2, and 16 feature maps of 5×5 size are obtained: 10/2=5.

Number of parameters : (1+1)*16=32.

Number of connections : (2*2+1)*16*5*5= 2000.

7. C5 layer

The C5 layer is a convolutional layer, which uses 120 convolution kernels of 5×5x16 size, padding=0, stride=1 for convolution, and obtains 120 feature maps of 1×1 size: 5-5+1=1. That is, a fully connected layer equivalent to 120 neurons.

It is worth noting that, unlike the C3 layer, the 120 convolution kernels here are convoluted with the 16 channel layers of S4.

Number of parameters : (5*5*16+1)*120=48120.

Number of connections : 48120*1*1=48120.

Eight, F6 floor

F6 is a fully connected layer with a total of 84 neurons, fully connected with C5 layer, that is, each neuron is connected with 120 feature maps of C5 layer. Calculate the dot product between the input vector and the weight vector, plus a bias, and the result is output through the sigmoid function.

The F6 layer has 84 nodes, corresponding to a 7x12 bitmap, -1 means white, 1 means black, so that the black and white of the bitmap of each symbol corresponds to a code. The number of training parameters and connections for this layer is (120 + 1)x84=10164. The ASCII encoding diagram is as follows:

2e4df01adfee421e60b70a4986fc060b.png

Number of parameters : (120+1)*84=10164.

Number of connections : (120+1)*84=10164.

Nine, OUTPUT layer

The final Output layer is also a fully connected layer, Gaussian Connections, which uses the RBF function (that is, the radial Euclidean distance function) to calculate the Euclidean distance between the input vector and the parameter vector (currently replaced by Softmax).

The Output layer has 10 nodes in total, representing numbers 0 to 9 respectively. Assuming that x is the input of the previous layer and y is the output of RBF, the calculation method of RBF output is:

b8caa5044f41d3db9707af189a1e007c.png

In the above formula, i takes a value from 0 to 9, j takes a value from 0 to 7*12-1, and w is a parameter. The closer the value of RBF output is to 0, the closer to i, that is, the closer to the ASCII code map of i, indicating that the recognition result of the current network input is the character i.

The figure below shows the recognition process of the number 3:

b2090927f8d7dfd50af5be65c6fe6878.png

Number of parameters : 84*10=840.

Number of connections : 84*10=840.

10. Visual URL

http://yann.lecun.com/exdb/lenet/a35.html

http://scs.ryerson.ca/~aharley/vis/conv/flat.html

http://scs.ryerson.ca/~aharley/vis/conv/

Summarize

LeNet-5 is still different from the current general-purpose convolutional neural network in some detailed structures. For example, the activation function used by LeNet-5 is sigmoid, and the current images generally use tanh, relu, and leaky relu; LeNet-5 The processing of the pooling layer is also different from the current one; the final output layer of multi-classification generally uses softmax, which is not the same as LeNet-5.

LeNet-5 is a very efficient convolutional neural network for handwritten character recognition. CNN can obtain an effective representation of the original image, which enables CNN to recognize the laws of vision directly from the original pixels with minimal preprocessing. However, due to the lack of large-scale training data at that time, the computing power of the computer could not keep up, and the processing results of LeNet-5 for complex problems were not ideal.

Finally, Red Stone has prepared LeCun's LeNet-5 46-page paper "Gradient-Based Learning Applied to Document Recognition" for everyone. If you need it, you can reply to [ lenet5 ] in the background of this official account to get it !

In the next article, I will use PyTorch to reproduce the LeNet-5 network and introduce a complete demo example. See you next time!


recommended reading

(Click on the title to jump to read)

Dry goods | Public account historical article selection

My Deep Learning Getting Started Route

My roadmap for getting started with machine learning

Heavy !

AI Youdao's annual technical article electronic version PDF is here!

6750e5b526e4db806b29a7fe16f8a33e.png

Scan the QR code below, add AI Youdao Assistant WeChat, you can apply to join the group, and get a PDF of the complete collection of technical articles in 2020 (must note: join the group + location + school/company . For example: join the group + Shanghai + Fudan

0dcbf7ea3058020216cdc6f048fa572b.png

Long press to scan the QR code to apply for group membership

(There are many people to add, please be patient)

Thank you for sharing, like, watching Sanlian  478c7d334034b48cc207be6dcc0796c0.gif

Guess you like

Origin blog.csdn.net/red_stone1/article/details/121804658