[In-depth and simple study notes] Li Mu's "Hands-on Learning Deep Learning 2.0" Basics of Convolutional Neural Networks

Fundamentals of Convolutional Neural Networks 05.15-05.22

This article is mainly about the notes of Li Mu: Hands-on Learning Deep Learning 2.0 online course.

Video address: https://zhuanlan.zhihu.com/p/29125290.
Full textbook: https://zh-v2.d2l.ai/Textbook
for this course: https://zh-v2.d2l.ai/chapter_preliminaries/pandas.html
Note address: https://gitee.com/lhm8013609/ mldl_-learning-notes.git

1 From fully connected layer to convolution

1.1 Two principles of neural network design:

Translation invariance, locality

  • Factors that are fixed in thinking, such as location, cannot be selected as features.
  • You only need to search a local area of ​​the image
1.1.1 Translation invariance:

The bottom layer of the neural network should respond similarly to the same image area regardless of where it appears in the image. This principle is called "translation invariance".

1.1.2 Locality:

The bottom layer of the neural network should only explore local areas in the input image, without considering the content of distant areas of the image. This is the "locality" principle. Eventually, these local features can be integrated to make predictions at the entire image level.

1.2 Re-examination of the fully connected layer

First, assuming that the two - dimensional image where X and H have the same shape. We can think that not only the input has spatial structure, but the hidden representation should also have spatial structure.

  • Transform input and output into matrices (width, height)
  • Transform weights into 4-D tensor (h, w) to ( h ′ h'h, w ′ w' w
    • Previously it was a length change from input to output.
    • Now it is a width and height change from input to output
1.2.1 How does the subscript correspond to the re-indexing of v to w?

h i , j = ∑ k , l w i , j , k , l x k , l = ∑ a , b v i , j , a , b x i + a , j + b h_{i,j} = \sum_{k,l}w_{i,j,k,l}x_{k,l}=\sum_{a,b}v_{i,j,a,b}x_{i+a,j+b} hi,j=k,lwi,j,k,lxk,l=a,bvi,j,a,bxi+a,j+b

i , j i,j i,j is the width and height dimensions of the output,k, lk,lk,l is the width and height dimensions of the input;

Make certain changes to the subscripts, k = i + a, l = j + bk=i+a,l=j+bk=i+a,l=j+b v v v islolw -like new indexvi , j , a , b = wi , j , i + a , j + b v_{i,j,a,b}=w_{i,j,i+a,j+b}vi,j,a,b=wi,j,i+a,j+b, where the indices a and b cover positive and negative offsets, for any given position (i, j) (i, j) in the hidden representationi,Pixel value [H]at j ) i, j [H]_{i,j}[H]i,j, by (i, j) (i, j) in xi,j ) is obtained by weighted summation of the pixels at the center, and the weights arevi, j, a, b v_{i,j,a,b}vi,j,a,b

This re-indexing can lead to convolution operations.

1.2.2 Use translation invariance and locality on the fully connected layer to obtain the convolutional layer
1.2.2.1 Translation invariance:

It can be seen from formula (1) that the translation of x will lead to the translation of h; but in terms of translation invariance , v should not depend on (i, j) (i, j)ij;

Solution :Remove the first two dimensions v i , j , a , b = v a , b v_{i,j,a,b}=v_{a,b} vi,j,a,b=va,b, with:
hi , j = ∑ a , bva , bxi + a , j + b h_{i,j} = \sum_{a,b}v_{a,b}x_{i+a,j+b}hi,j=a,bva,bxi+a,j+b
This is 2-dimensional convolution, which is actually called 2-dimensional cross-correlation in mathematics.

In fact, we are using coefficients [ V ] a , b [V]_{a,b}[V]a,bFor position (i, j) (i,j)(i,The pixels $ (i+a,j+b) near j ) are obtained by weighted summation. The weighted summation is obtained.Perform a weighted sum to get [H]{i,j} . Notice, . Notice,. Note that the parameter ratio of [V]{a,b}is the parameter ratio ofhas much fewer parameters than [V]_{i,j,a,b}$, because the former no longer depends on the position in the image.

Two-dimensional convolution is full connection or matrix multiplication, but the weight makes some things repetitive, and not every element can be freely transformed; when I limit the value of a model, the complexity of the model will be reduced, thinking The inventory of elements decreases. (Convolution is weight shared full connection?)

1.2.2.2 Locality:
  • When evaluating hi , j h_{i,j}hi,j, we should only consider the input xi , j x_{i,j}xi,jnearby parameters
  • Solution: When ∣ a ∣ , ∣ b ∣ > Δ |a|,|b|>\Deltaa,b>When Δ , makeva , b = 0 v_{a,b}=0va,b=0

h i , j = ∑ a = − Δ Δ ∑ b = − Δ Δ v a , b x i + a , j + b h_{i,j}=\sum^{\Delta}_{a=-\Delta}\sum^{\Delta}_{b=-\Delta}v_{a,b}x_{i+a,j+b} hi,j=a = DDb = DDva,bxi+a,j+b

a, b only in − Δ -\DeltaΔ toΔ \DeltaΔ change.

Formula (3) is a convolutional layer , and a convolutional neural network is a special type of neural network that contains a convolutional layer. In the deep learning research community, $V$ is called a convolution kernel or filter and is a learnable weight.

1.2.2.3 Summary

The convolutional layer is obtained by using translation invariance and locality on the fully connected layer, which mainly compresses two dimensions and limits the value range of a and b .

[The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-ZlNCoK3Y-1665580420184)(images/lhm/image-20210712135155826.png)]

1.3 What is convolution?

  • Convolution definition in mathematics

( f ∗ g ) ( x ) = ∫ f ( z ) g ( x − z ) d z (f*g)(x)= \int f(z)g(x-z)dz (fg)(x)=f(z)g(xz ) d z

That is, the convolution is a measure of fff andggThe overlap between g (when one of the functions is "flipped" and shifted by x).

  • When we have discrete objects (i.e. the domain is ZZZ ), the integral becomes a summation, and we get the following definition:

( f ∗ g ) ( i ) = ∑ a f ( a ) g ( i − a ) (f*g)(i)= \sum_a f(a)g(i-a) (fg)(i)=af(a)g(ia)

  • For a two-dimensional tensor, it is fff exists( a , b ) (a,b)(a,b ) andggg ( i − a , j − b ) (i−a,j−b) (ia,jThe corresponding sum on b ) :

( f ∗ g ) ( i , j ) = ∑ a ∑ b f ( a , b ) g ( i − a , j − b ) (f*g)(i,j)= \sum_a\sum_b f(a,b)g(i-a,j-b) (fg)(i,j)=abf(a,b)g(ia,jb)

Using the difference here instead of the sum actually reduces to consistency.

1.4 Summary

  • The translation invariance of images allows us to treat local images in the same way. That is to say, the same convolution kernel is used, that is, the weight is the same, so it is called weight shared full connection, weight sharing .
  • Locality means that only a small fraction of local image pixels are needed to compute the corresponding hidden representation.
  • In image processing, convolutional layers usually require fewer parameters than fully connected layers.
  • Convolutional neural network (CNN) is a special type of neural network that can contain multiple convolutional layers.
  • Multiple input and output channels allow the model to capture multi-faceted features of the image at each spatial location.

2 Image convolution

2.1 Cross-correlation operation

The input is a 2D tensor with height 3 and width 3 (i.e. shape 3×3). The height and width of the convolution kernel are both 2, and the shape of the convolution kernel window (or convolution window) is determined by the height and width of the kernel (i.e. 2×2).

Insert image description here

Figure 2.1.1 Two-dimensional cross-correlation operation

3 padding and stride

3.1 Concept

padding: add extra rows/columns around the input

Insert image description here

Stride: Stride refers to the sliding step size of rows/columns

3.2 Filling

  • If fill ph p_hphrow and pw p_wpwcolumn, the output shape is: ( nh − kh + ph + 1 ) × ( nw − kw + pw + 1 ) (n_h-k_h+p_h+1)×(n_w-k_w+p_w+1)(nhkh+ph+1)×(nwkw+pw+1)
  • Usually ph = kh − 1, pw = kw − 1 p_h=k_h-1,p_w=k_w-1ph=kh1,pw=kw1
    • When kh k_hkhWhen it is an odd number: fill in ph / 2 p_h/2 on the upper and lower sidesph/2 , such as a 3*3 convolution kernel, fill 1 row and 1 column on each side of the upper and lower sides.
    • When kh k_hkhWhen it is an even number, ph p_hphFor odd numbers, fill in ph/2 p_h/2 on the upper sideph/2 , 1 more line on the upper side; fillph /2 on the lower side p_h/2ph/2 , 1 less line on the lower side; such as a 4*4 convolution kernel, 3/2=1+1=2 lines on the upper side, and 1 line on the lower side.

3.3 Stride

Insert image description here

4 multiple input and multiple output channels

4.1 Multiple input channels

  • Each channel has a convolution kernel, and the result is the sum of the convolution results of all channels.

  • Enter XXX
    Insert image description here

4.2 Multiple output channels

4.3 Multiple input and output channels

5 pooling layer

5.2 Two-dimensional max pooling

  • Returns the maximum value in the sliding window

Insert image description here

  • Allow 1 pixel shift
    Insert image description here

5.3 Padding, strides, multiple channels

  • Fill is 1, stride is 2
  • There is no learnable parameter kernel
  • Perform a pooling layer on each channel and obtain the corresponding output channel
  • Number of output channels = number of input channels

5.4 Average pooling layer

  • Max pooling layer: the strongest pattern signal in each window
  • Average pooling layer: Replace "maximum" with "average" in max pooling layer

5.5 Summary

  • The pooling layer returns the maximum or average value in the window
  • Alleviating the sensitivity of convolutional layers to position
  • Also has window size, padding, and stride as hyperparameters

6 Summary

  1. Hyperparameters: Kernel size, padding, stride influence importance

    • Kernel size is most important, padding defaults, stride depends on model complexity control of feature extraction
  2. The side length of the convolution kernel is an odd number and is symmetrical after filling.

  3. Generally, the size of the convolution kernel is not designed by oneself. Most of them use classic network structures, such as ResNet, and only perform fine-tuning.

  4. How do hyperparameters participate in training? NAS (Neural Network Architecture Search)

  5. Automatic parameter adjustment neural network, autogluon

  6. The smaller the convolution kernel, the faster the calculation speed; such as3*3will compare5*5is faster, although in effect the two can be replaced in multi-layer convolution.

[Long Yi’s Programming Life] This official z account is mainly used to share artificial intelligence, embedded and other related study notes and projects, including but not limited to Python, C/C++, DL, ML, CV, ARM, Linux and other related technologies; Share resources, learn together and have fun together!

Guess you like

Origin blog.csdn.net/weixin_43658159/article/details/127290939