Articles and codes have been archived in [Github warehouse: https://github.com/timerring/dive-into-AI ] or public account [AIShareLab] Reply Neural Network Basics can also be obtained.
Article Directory
CNN
History of Convolutional Neural Networks
Convolutional neural networks (CNN)
CNN is a neural network proposed for tasks in the image field. After several generations of development, most image tasks are dominated by CNN after 2012, such as image classification, image segmentation, and target detection. , image retrieval, etc.
The CNN structure is inspired by the visual system : In 1962, biologists Torsten WieseI and David H. Hubel (1981 Nobel Prize in Medicine) studied the cat's visual system and discovered for the first time that there is a hierarchical structure in the cat's visual system, and found that There are two important types of cells, simple cells and compIex cells , and different types of cells undertake visual perception functions at different levels of abstraction.
cat visual system experiment
- Open 3mm on cat brain, insert electrodes
- Let cats see light bars of various shapes, positions, brightness and movement
- Observing the activation of visual neurons in the brain
Neurons have a local receptive field (receptive field), also known as the receptive field
There are differences in the cell sensory area: for example, C cells and D cells are opposite (X in the figure indicates response, and triangle indicates no response)
Cells are selective for angles. As shown, the cell responds most strongly to vertical light bars.
Cells are selective to the direction of movement (as shown in the figure, the way a is more sensitive)
Inspiration for CNN:
- The visual system is processed hierarchically and hierarchically, from low-level to high-level abstraction → \to→ Analog stacking uses convolution and pooling
- Neurons actually have local sensory areas, specifically, they are locally sensitive → \to→ Analog neuron local connection
The first prototype of convolutional neural network - Neocognitron
In 1980, Japanese scholar Kunihiko Fukushima drew on the conclusions of the cat visual system experiment and proposed a neural network with a hierarchical structure, a new cognitive machine, stacking two structures similar to S cells and C cells. S cells and C cells can be compared to the convolution and pooling of modern CNNs.
Disadvantages: There is no backpropagation algorithm to update weights, and the model performance is limited.
Kunihiko Fukushima homepage: http://personalpage.flsi.or.jp/fukushima/index-e.html
The first large-scale commercial convolutional neural network - Lenet-5
In 1989, Lecun et al. began to study Lenet; in 1998, Lecun et al. proposed Lenet-5, which was successfully applied to handwritten postal code recognition on a large scale in the US postal system.
Disadvantages: No large amount of data and high-performance computing resources
The first stunning convolutional neural network - AlexNet
In 2012, AlexNet won the ILSVRC classification task championship with a score of 10.9 percentage points higher than the second place, which opened the prelude to the field of convolutional neural network notification image.
- 算料:ImageNet
- Computing power: GPU (GTX580 * 2)
- Algorithm: AlexNet
convolution operation
Convolutional Layer
Image recognition features:
- Features are localized : for example, the important feature of a tiger, the "king character", only appears in the head area - the convolution kernel only connects K*K areas at a time, and K*K is the size of the convolution kernel;
- Features may appear anywhere - convolution kernel parameter reuse (parameter sharing), sliding over the image (example image source: https://github.com/vdumoulin/conv_arithmetic)
0×0+1x1+3×2+4×3 =19
- Downsamples the image without changing the image target
convolution kernel
Convolution kernel : An operator with learnable parameters is used to extract features from an input image , and the output is usually called a feature map (feature maps).
The specific process can be simulated as follows according to the actual situation. For example, the first edge detection convolution kernel, if an image with little difference in pixels, after the convolution process of the convolution kernel, the high probability is the middle 8 shares minus the side The 8 1 shares, and finally 0, are displayed in black. If there is a very obvious part of the edge, the value is still large after reduction, and it is displayed in white, so the outline of the edge can be formed.
In 2012, the first convolution layer of the AlexNet network was visualized. The convolution kernel presents the characteristic patterns of edges, frequencies and colors.
Padding : Add extra rows/columns around the input image
effect:
- Make the resolution of the image unchanged after convolution , and facilitate the calculation of the change in the size of the feature map
- Make up for "lost" boundary information
Stride : The number of rows and columns that the convolution kernel slides is called the stride, which controls the size of the output feature map and will be reduced by a factor of 1/s.
The convolution will be rounded down, and if the boundary does not satisfy the information, it will be rounded down. (Even if there is edge information, if the stride is not satisfied, it will be discarded)
Output feature map size calculation:
F o = [ F in − k + 2 ps ] + 1 \mathrm{F}_{\mathrm{o}}=\left[\frac{\mathrm{F}_{\text { in }}-\mathrm{k}+2 \mathrm{p}}{\mathrm{s}}\right]+1Fo=[sFin −k+2p]+1
[ 4 − 3 + 2 ∗ 0 ] 1 + 1 = 2 \frac{[4-3+2 * 0]}{1}+1=2 1[4−3+2∗0]+1=2
[ 6 − 3 + 2 ∗ 1 ] 2 + 1 = 3 \frac{[6-3+2 * 1]}{2}+1=3 2[6−3+2∗1]+1=3
[ 5 − 3 + 2 ∗ 1 ] 1 + 1 = 5 \frac{[5-3+2 * 1]}{1}+1=5 1[5−3+2∗1]+1=5
Multi-channel convolution : RGB image is three-dimensional data of 3*h*w, the first dimension is 3, which means channel, the number of channels
A convolution kernel is a 3-D tensor, the first dimension is related to the input channel
Note: The convolution kernel size usually refers to height and width
As above, the size of the convolution kernel is 2x3x3x3. It is essentially a two-dimensional convolution.
pooling operation
Image recognition features
- Downsampling the image without changing the image target - reducing the amount of computation and reducing feature redundancy
Pooling: One pixel represents the pixel value of an area , reducing image resolution
How an area pixel is replaced by a pixel:
- Method 1: Max Pooling, take the maximum value
- Method 2: Average Pooling, taking the average
Many of the current models do not use pooling operations very much, but use a convolution with a step size of 2 instead of pooling, which can also reduce the resolution of the image. (Pooling can also be understood as a special kind of convolution. For example, Max pooling can be understood as a convolution kernel with a maximum weight of 1 and other weights of 0. Average Pooling can be understood as a volume with an average weight accumulation).
Therefore, the output size calculation is similar to the convolution operation: (Note: the pooling layer has no learnable parameters)
F o = ⌊ F in − k + 2 ps ⌋ + 1 \mathrm{F}_{\mathrm{o}}=\ left\lfloor\frac{\mathrm{F}_{\text {in }}-\mathrm{k}+2 \mathrm{p}}{\mathrm{s}}\right\rfloor+1Fo=⌊sFin −k+2p⌋+1
pooling effect:
-
Alleviate the over-sensitivity of convolutional layers to position
The first row is the original matrix, the second row is the matrix after convolution, and the third row is the matrix after pooling. The left and right comparison shows that the convolution result is affected after adding disturbance, but the pooling result is not affected. Reference: https://zhuanlan.zhihu.com/p/103350961
-
reduce redundancy
-
Reduce the image resolution, thereby reducing the amount of parameters
Lenet-5 and CNN structure evolution history
1998-Lecun-Gradient-Based Learning Applied to Document Recognition
Feature extractors: C1, S2, C3, S4
- C1 layer: convolution kernel K1=(6, 1, 5, 5), p=1, s=1, output=(6, 28, 28)
- S2 layer: maximum pooling layer, pooling window=(2,2), s=2, output=(6, 14, 14)
- C3 layer: convolution kernel K3=(16, 6, 5, 5), p=1, s=1, output=(16, 10, 10)
- S4 layer: Maximum pooling layer, pooling window=(2,2), s=2, output=(16, 5, 5)
Classifier: 3 FC layers
- FC layer: 3 FC layer output classification
CNN Evolution History
- 1980 Neocognition Kunihiko Fukushima
- 1998 Lenet-5 Lecun
- 2012 AlexNet Alex
- 2014 GoogLenet Google
- 2014 VGG-Net VGG
- 2015 ResNet Kaiming He
- 2017 DenseNet Gao Huang
- 2017 SE-Net Jie Hu
reference
Source of all convolution example images: https://github.com/vdumoulin/conv_arithmetic