The basics of getting started with convolutional neural networks

Articles and codes have been archived in [Github warehouse: https://github.com/timerring/dive-into-AI ] or public account [AIShareLab] Reply Neural Network Basics can also be obtained.

CNN

History of Convolutional Neural Networks

Convolutional neural networks (CNN)
CNN is a neural network proposed for tasks in the image field. After several generations of development, most image tasks are dominated by CNN after 2012, such as image classification, image segmentation, and target detection. , image retrieval, etc.

The CNN structure is inspired by the visual system : In 1962, biologists Torsten WieseI and David H. Hubel (1981 Nobel Prize in Medicine) studied the cat's visual system and discovered for the first time that there is a hierarchical structure in the cat's visual system, and found that There are two important types of cells, simple cells and compIex cells , and different types of cells undertake visual perception functions at different levels of abstraction.

cat visual system experiment

  1. Open 3mm on cat brain, insert electrodes
  2. Let cats see light bars of various shapes, positions, brightness and movement
  3. Observing the activation of visual neurons in the brain

Neurons have a local receptive field (receptive field), also known as the receptive field

There are differences in the cell sensory area: for example, C cells and D cells are opposite (X in the figure indicates response, and triangle indicates no response)

insert image description here

Cells are selective for angles. As shown, the cell responds most strongly to vertical light bars.

insert image description here

Cells are selective to the direction of movement (as shown in the figure, the way a is more sensitive)

Inspiration for CNN:

  1. The visual system is processed hierarchically and hierarchically, from low-level to high-level abstraction → \to Analog stacking uses convolution and pooling
  2. Neurons actually have local sensory areas, specifically, they are locally sensitive → \to Analog neuron local connection

The first prototype of convolutional neural network - Neocognitron

In 1980, Japanese scholar Kunihiko Fukushima drew on the conclusions of the cat visual system experiment and proposed a neural network with a hierarchical structure, a new cognitive machine, stacking two structures similar to S cells and C cells. S cells and C cells can be compared to the convolution and pooling of modern CNNs.

Disadvantages: There is no backpropagation algorithm to update weights, and the model performance is limited.

Kunihiko Fukushima homepage: http://personalpage.flsi.or.jp/fukushima/index-e.html

The first large-scale commercial convolutional neural network - Lenet-5

In 1989, Lecun et al. began to study Lenet; in 1998, Lecun et al. proposed Lenet-5, which was successfully applied to handwritten postal code recognition on a large scale in the US postal system.

Disadvantages: No large amount of data and high-performance computing resources

The first stunning convolutional neural network - AlexNet

In 2012, AlexNet won the ILSVRC classification task championship with a score of 10.9 percentage points higher than the second place, which opened the prelude to the field of convolutional neural network notification image.

  • 算料:ImageNet
  • Computing power: GPU (GTX580 * 2)
  • Algorithm: AlexNet

convolution operation

Convolutional Layer

Image recognition features:

  • Features are localized : for example, the important feature of a tiger, the "king character", only appears in the head area - the convolution kernel only connects K*K areas at a time, and K*K is the size of the convolution kernel;

  • Features may appear anywhere - convolution kernel parameter reuse (parameter sharing), sliding over the image (example image source: https://github.com/vdumoulin/conv_arithmetic)

insert image description here

0×0+1x1+3×2+4×3 =19

  • Downsamples the image without changing the image target

convolution kernel

Convolution kernel : An operator with learnable parameters is used to extract features from an input image , and the output is usually called a feature map (feature maps).

The specific process can be simulated as follows according to the actual situation. For example, the first edge detection convolution kernel, if an image with little difference in pixels, after the convolution process of the convolution kernel, the high probability is the middle 8 shares minus the side The 8 1 shares, and finally 0, are displayed in black. If there is a very obvious part of the edge, the value is still large after reduction, and it is displayed in white, so the outline of the edge can be formed.

In 2012, the first convolution layer of the AlexNet network was visualized. The convolution kernel presents the characteristic patterns of edges, frequencies and colors.

Padding : Add extra rows/columns around the input image

effect:

  • Make the resolution of the image unchanged after convolution , and facilitate the calculation of the change in the size of the feature map
  • Make up for "lost" boundary information

Stride : The number of rows and columns that the convolution kernel slides is called the stride, which controls the size of the output feature map and will be reduced by a factor of 1/s.

The convolution will be rounded down, and if the boundary does not satisfy the information, it will be rounded down. (Even if there is edge information, if the stride is not satisfied, it will be discarded)

Output feature map size calculation:
F o = [ F in − k + 2 ps ] + 1 \mathrm{F}_{\mathrm{o}}=\left[\frac{\mathrm{F}_{\text { in }}-\mathrm{k}+2 \mathrm{p}}{\mathrm{s}}\right]+1Fo=[sFin k+2p]+1


[ 4 − 3 + 2 ∗ 0 ] 1 + 1 = 2 \frac{[4-3+2 * 0]}{1}+1=2 1[43+20]+1=2

[ 6 − 3 + 2 ∗ 1 ] 2 + 1 = 3 \frac{[6-3+2 * 1]}{2}+1=3 2[63+21]+1=3

[ 5 − 3 + 2 ∗ 1 ] 1 + 1 = 5 \frac{[5-3+2 * 1]}{1}+1=5 1[53+21]+1=5

Multi-channel convolution : RGB image is three-dimensional data of 3*h*w, the first dimension is 3, which means channel, the number of channels

A convolution kernel is a 3-D tensor, the first dimension is related to the input channel

Note: The convolution kernel size usually refers to height and width

As above, the size of the convolution kernel is 2x3x3x3. It is essentially a two-dimensional convolution.

pooling operation

Image recognition features

  • Downsampling the image without changing the image target - reducing the amount of computation and reducing feature redundancy

Pooling: One pixel represents the pixel value of an area , reducing image resolution

How an area pixel is replaced by a pixel:

  • Method 1: Max Pooling, take the maximum value
  • Method 2: Average Pooling, taking the average

Many of the current models do not use pooling operations very much, but use a convolution with a step size of 2 instead of pooling, which can also reduce the resolution of the image. (Pooling can also be understood as a special kind of convolution. For example, Max pooling can be understood as a convolution kernel with a maximum weight of 1 and other weights of 0. Average Pooling can be understood as a volume with an average weight accumulation).

Therefore, the output size calculation is similar to the convolution operation: (Note: the pooling layer has no learnable parameters)
F o = ⌊ F in − k + 2 ps ⌋ + 1 \mathrm{F}_{\mathrm{o}}=\ left\lfloor\frac{\mathrm{F}_{\text {in }}-\mathrm{k}+2 \mathrm{p}}{\mathrm{s}}\right\rfloor+1Fo=sFin k+2p+1
pooling effect:

  1. Alleviate the over-sensitivity of convolutional layers to position

    The first row is the original matrix, the second row is the matrix after convolution, and the third row is the matrix after pooling. The left and right comparison shows that the convolution result is affected after adding disturbance, but the pooling result is not affected. Reference: https://zhuanlan.zhihu.com/p/103350961

  2. reduce redundancy

  3. Reduce the image resolution, thereby reducing the amount of parameters

Lenet-5 and CNN structure evolution history

1998-Lecun-Gradient-Based Learning Applied to Document Recognition

Feature extractors: C1, S2, C3, S4

  • C1 layer: convolution kernel K1=(6, 1, 5, 5), p=1, s=1, output=(6, 28, 28)
  • S2 layer: maximum pooling layer, pooling window=(2,2), s=2, output=(6, 14, 14)
  • C3 layer: convolution kernel K3=(16, 6, 5, 5), p=1, s=1, output=(16, 10, 10)
  • S4 layer: Maximum pooling layer, pooling window=(2,2), s=2, output=(16, 5, 5)

Classifier: 3 FC layers

  • FC layer: 3 FC layer output classification

CNN Evolution History

  1. 1980 Neocognition Kunihiko Fukushima
  2. 1998 Lenet-5 Lecun
  3. 2012 AlexNet Alex
  4. 2014 GoogLenet Google
  5. 2014 VGG-Net VGG
  6. 2015 ResNet Kaiming He
  7. 2017 DenseNet Gao Huang
  8. 2017 SE-Net Jie Hu

reference

Source of all convolution example images: https://github.com/vdumoulin/conv_arithmetic

Guess you like

Origin blog.csdn.net/m0_52316372/article/details/131451806