Wu Enda Convolutional Neural Network Chapter Notes (1) - Convolutional Neural Network


Video course link:
https://www.bilibili.com/video/BV1FT4y1E74V?
Note reference link:
https://blog.csdn.net/weixin_36815313/article/details/105728919

1. Edge Detection Example

1.1 Convolution operation

Convolution operation is the most basic component of convolutional neural network.
insert image description here

Suppose there is a 6 66× 6 6 6 grayscale images. Because it is a grayscale image, that is, there is no RGB three-channel, so it is6 66× 6 6 6× 1 1 1 matrix (as shown above). To detect vertical edges in an image, you can construct a3 33× 3 3 3 matrix. In the terminology of convolutional neural networks, it is calleda filter (filter), or also calleda kernel (kernel).
insert image description here

Assume constructing a 3 33× 3 3 3 for the filter (as shown above), 6for the above6× 6 6 The image of 6 is convoluted, and the convolution operation uses∗ * to represent.
insert image description here

The output of this convolution operation will be a 4 44× 4 4 4 matrix, you can think of it as a4 44× 4 4 4 image (as above).
insert image description here

first calculate 4 44× 4 4 4 The first element in the matrix, use3 33× 3 3 3 filter, cover it on the input image, and then perform element-wise products (element-wise products) operation, namely[ 3 × 1 0 × 0 1 × ( − 1 ) 1 × 1 5 × 0 8 × ( − 1 ) 2 × 1 7 × 0 2 × ( − 1 ) ] = [ 3 0 − 1 1 0 − 8 2 0 − 2 ] \left[ \begin{matrix} 3×1 & 0×0 & 1×(-1) \\ 1×1 & 5×0 & 8×(-1)\\ 2×1 & 7×0 & 2×(-1)\\ \end{matrix} \right]=\left[ \begin{matrix } 3 & 0 & -1\\ 1 & 0 & -8\\ 2 & 0 & -2\\ \end{matrix} \right]3×11×12×10×05×07×01×(1)8×(1)2×(1)=312000182,最后将该矩阵每个元素相加得到最左上角的元素,即 3 + 1 + 2 + 0 + 0 + 0 + ( − 1 ) + ( − 8 ) + ( − 2 ) = − 5 3+1+2+0+0+0+(-1)+(-8)+(-2)=-5 3+1+2+0+0+0+(1)+(8)+(2)=5
insert image description here

接下来计算第二个元素,因此要把 3 3 3× 3 3 3的过滤器向右移动一步(如上图)。继续做同样的元素乘法,然后加起来,即 0 × 1 + 5 × 1 + 7 × 1 + 1 × 0 + 8 × 0 + 2 × 0 + 2 × ( − 1 ) + 9 × ( − 1 ) + 5 × ( − 1 ) = − 4 0×1+5×1+7×1+1×0+8×0+2×0+2×(−1)+9×(−1)+5×(−1)=−4 0×1+5×1+7×1+1×0+8×0+2×0+2×(1)+9×(1)+5×(1)=4。然后继续将 3 3 3× 3 3 3的过滤器右移,直到计算完这一行的元素。
insert image description here

接下来为了计算下一行的元素,将 3 3 3× 3 3 3的过滤器下移(如上图)。重复进行元素乘法,然后加起来。通过计算得到 − 10 -10 1 0 . By analogy, other elements in the matrix are calculated.
insert image description here

Therefore 6 66× 6 6 6 matrix and3 33× 3 3 3 matrices areconvolvedto get4 44× 4 4 4 matrix. These pictures and filters are matrices of different dimensions, but the matrix on the left is easily understood as a picture, the one in the middle is understood as a filter, and the picture on the right can be understood as another picture. This isthe vertical edge detector.

1.2 Application Case of Vertical Edge Detector

insert image description here

The picture above is a simple 6 66× 6 6 6 image, the left half of the pixel value is 10, and the right half of the pixel value is 0. If you think of it as a picture, the part with a pixel value of 10 on the left is brighter, and the part with a pixel value of 0 on the right is darker. There is a particularly noticeable vertical edge in the middle of the picture, and this vertical line is the transition line from black to white.
When you usethe 3 33× 3 3 3 filters perform convolution operations, this3 33× 3 3 The filter of 3 can be regarded as such a picture, the left side is brighter, there is a transition in the middle, the color is darker than the left side, and then the right side is darker.
insert image description here

After the convolution operation, the matrix on the right is obtained. If you put this 4 44× 4 4 4 The matrix is ​​regarded as an image, and it can be seen that there is a brighter area in the middle, and darker areas on both sides, indicating that there is a particularly obvious vertical edge in the middle of the image.

2. More Edge Detection Content (More Edge Detection Example)

in computer vision, 3 33× 3 3 3 Several conventional digital combinations of filters:

  • Vertical edge filter: [ 1 0 − 1 1 0 − 1 1 0 − 1 ] \left[ \begin{matrix} 1 & 0 & -1\\ 1 & 0 & -1\\ 1 & 0 & -1\ \ \end{matrix} \right]111000111
  • Horizontal edge filter: [ 1 1 1 0 0 0 − 1 − 1 − 1 ] \left[ \begin{matrix} 1 & 1 & 1\\ 0 & 0 & 0\\ -1 & -1 & -1\ \ \end{matrix} \right]101101101
  • Sobel filter: Its advantage is that it increases the weight of the elements in the middle row, which makes the result more robust. [ 1 0 − 1 2 0 − 2 1 0 − 1 ] \left[ \begin{matrix} 1 & 0 & -1\\ 2 & 0 & -2\\ 1 & 0 & -1\\ \end{matrix } \right]121000121
  • Scharr equation: [ 3 0 − 3 10 0 − 10 3 0 − 3 ] \left[ \begin{matrix} 3 & 0 & -3\\ 10 & 0 & -10\\ 3 & 0 & -3\\ \end{matrix}\right]31030003103

In deep learning, all we need to do is to treat the 9 numbers in this matrix as 9 parameters, and use the backpropagation algorithm to let the neural network learn whatever it needs 3 33× 3 3 3 filter, and apply it on the whole image, the goal is to understand these 9 parameters.

3. Padding

In order to build a deep neural network, one of the basic convolution operations you need to learn to use is padding .
In the example in the previous chapter, if a 3 33× 3 3 3 filters convolved with a6 66× 6 6 6 images, will end up with a4 44× 4 4 4 output, which is a4 44× 4 4 4 matrix. That's because3 33× 3 3 3 filters at6 66× 6 6 In the matrix of 6 , there are only 4×4 possible positions. The mathematical explanation behind this is that if we have annn× n n image of n , withfff× f f The filter of f is convolved, then the dimension of the output image is( n − f + 1 ) (n-f+1)(nf+1)× ( n − f + 1 ) (n-f+1) (nf+1 ) .
However, there are two disadvantages in this process:
1. Disadvantage 1:Output shrinkage

  • Every time a convolution operation is performed, the output image will be reduced, such as from 6 6 in the previous example6× 6 6 6 shrinks to4 44× 4 4 4 . If there is a deep neural network, if the image is reduced every time it passes through a layer, a very small picture will be obtained.

2. Disadvantage 2: Information loss at the edge of the image

  • from 6 to 66× 6 6 As seen in the image of 6 , the pixels at the corner edges (marked with green shading) are only3 33× 3 3 3 's filter is convolved once, i.e. used only once in the output, since it is located in this3 33× 3 3 3 corner of the area. But if it is in the middle pixel (marked by the red box), there will be many3 33× 3 3 3 's area overlaps with it. So those pixels in the corner or edge area are used less in the output, which also means that a lot of information about the edge position of the image is lost.
    insert image description here

In order to solve the above two problems, there are usually the following solutions:

  • Padding the image before the convolution operation . You can pad another layer of pixels along the edge of the image, then 6 66× 6 6 The image of 6 is filled into an8 88× 8 8 8 images. At this time, if you use3 33× 3 3 3 filters on this8 88× 8 8 8 image convolution, you get a6 66× 6 6 6 images. In general, it can be filled with pixel value 0.
    insert image description here

Because we have filled a pixel around, the number of filling p = 1 p=1p=1 , the size of the output becomes( n + 2 p − f + 1 ) (n+2p-f+1)(n+2pf+1)× ( n + 2 p − f + 1 ) (n+2p-f+1) (n+2pf+1 ) . In this way, the disadvantage that information in the corners or edges of the image plays a lesser role is mitigated.
In fact, more pixels can be filled. As for how many pixels to choose to fill, there are usually two options, calledValid convolutionandSame convolution.

  • Valid convolution - no padding. if there is a nnn× n n image of n , with afff× f f The filter convolution of f will result in a ( n − f + 1 ) (n-f+1)(nf+1)× ( n − f + 1 ) (n-f+1) (nf+1 ) dimensional output.
  • Same convolution - After padding, the output size is the same as the input size. if you have a nnn× n n image of n , with ppp pixels fill the edge, the size of the output is( n + 2 p − f + 1 ) (n+2p-f+1)(n+2pf+1)× ( n + 2 p − f + 1 ) (n+2p-f+1) (n+2pf+1 ) . So if you wantn + 2 p − f + 1 = n n+2p-f+1=nn+2pf+1=In the case of n , the output and input are equal in size, thenp = ( f − 1 ) / 2 p=(f-1)/2p=(f1)/2

Traditionally, in computer vision, fff is usually an odd number. There are several possible reasons for this:

  1. if fff is an even number, then you can only use some asymmetric padding. onlyffWhen f is an odd number, Same convolutionwill have natural padding. We can fill the surroundings with the same amount, instead of filling more on the left and less on the right, so that the filling is asymmetrical.
  2. When you have a filter with an odd number of dimensions, say 3 33× 3 3 3 or5 55× 5 5 5 , it has a center point. Sometimes in computer vision, it is easier to point out the position of the filter if there is a central pixel.

4. Strided Convolutions

The stride in convolution is another basic operation in building convolutional neural networks.
insert image description here

Suppose using a 3 33× 3 3 3 filters to convolve this7 77× 7 7 7 image, take the upper left 3 3as before3× 3 3 The products of the elements in the 3 regions are summed again, and the final result is 91.
insert image description here

The difference from before is that we set the stride to 2. So next put 3 33× 3 3 A filter of 3 moves two steps to the right, multiplies and sums each element, and gives a result of 100.
insert image description here

Then proceed to 3 33× 3 3 The filter of 3 is moved by two steps, and the result of calculation is 83.

insert image description here

After completing the first line, move to the next line, and at this time also move two steps (as shown in the figure above). Similarly, multiply and sum the elements, resulting in 69.
insert image description here

And so on, finally get a 3 33× 3 3 3 output (as shown above). If using afff× f f filter convolution of f with a nnn× n n n image,paddingisppp witha strideofsss , then the size of the output is⌊ n + 2 p − fs + 1 × n + 2 p − fs + 1 ⌋ \lfloor \frac{n+2p-f}{s}+1×\frac{n+2p -f}{s}+1 \rfloorsn+2pf+1×sn+2pf+1 .
Note: Symbols⌊ ⌋ \lfloor \rfloor meansround down, that is, if the calculated quotient is not an integer, it will be rounded down.

5. Convolutions Over Volumes

The previous chapter talked about convolution of two-dimensional images, and this chapter is about convolution of three-dimensional stereo .

5.1 Calculation of 3D convolution

insert image description here

Suppose there is a 6 66× 6 6 6× 3 3 3 RGB color images, here3 33 refers to three color channels. To detect edges or other features of an image, use a3 33× 3 3 3× 3 3 3 filters are convolved, and this filter also has three layers, corresponding to the three channels of red, green, and blue.
insert image description here

To compute the output of this convolution operation, all you have to do is take this 3 33× 3 3 3× 3 3 The filter of 3 is first placed in the upper left corner, this 3 33× 3 3 3× 3 3 The filter of 3 has 27 numbers, take these 27 numbers in turn, multiply by the number of the corresponding channel in the input image, that is, firsttake the first 9 numbers of the red channel, multiply by the 9 numbers of the corresponding red channel in the input image, Then take the 9 numbers of the green channel, multiply by the 9 numbers of the corresponding green channel in the input image, and finally take the 9 numbers of the blue channel, multiply by the 9 numbers of the corresponding green channel in the input image, and multiply by Input the 9 numbers corresponding to the blue channel in the image, and then add these numbersto get the first number of the output.
insert image description here

Then slide the cube by one unit, then multiply and add the 27 numbers with the numbers in the 3 channels corresponding to it, and then get the next output, and so on.

5.2 3D Convolution with Multiple Filters

In actual image processing, it is difficult to fully extract complex image features only by a single filter. So in general, we use far more than one filter for convolution.
insert image description here

We convolve with the first filter to get a 4 44× 4 4 4 output, and then convolved with the second filter to get a 4×4 output, put the second output behind the first output, thus forming a 4×4×2 output, here 2 represents the number of filters.
In general, if you have annn× n n n× n c n_c ncThe input image, convolution on a fff× f f f× n c n_c ncfilter, and then you get a ( n − f + 1 ) (n−f+1)(nf+1)× ( n − f + 1 ) (n−f+1) (nf+1)× n c ′ n_{c^{'}} ncoutput image, here nc ′ n_{c^{'}}ncIndicates the number of filters and also the number of channels in the next layer.
Note: Here it is assumed that the stride is 1 and there is no padding .

6. One Layer of a Convolutional Network

6.1 A simple example of a single-layer convolutional network

insert image description here

For the example in the previous chapter, a three-dimensional image is processed using two filter convolutions, outputting two different 4×4 matrices. First add bias b 1 b_1 to the first outputb1(here b 1 b_1b1is a real number), and then apply a nonlinear activation function (reLu function is used here), the output is a 4 44× 4 4 4 matrix.
For the second output add a different biasb 2 b_2b2(here b 2 b_2b2is a real number), then apply a non-linear activation function (reLu function is used here), and finally get another 4 44× 4 4 4 matrix. These two matrices are then stacked to end up with a4 44× 4 4 4× 2 2 2 matrix. The forward propagationprocess
in a single-layer convolutional neural network(1)z [ l ] = w [ l ] a [ l − 1 ] + b [ l ] z^{[l]}=w^{[l] }a^{[l-1]}+b^{[l]}
z[l]=w[l]a[l1]+b[l]
(2) a [ l ] = g ( z [ l ] ) a^{[l]}=g(z^{[l]}) a[l]=g(z[ l ] )
Among themw [ l ] a [ l − 1 ] w^{[l]}a^{[l-1]}w[l]a[ l 1 ] corresponds to6 66× 6 6 6× 3 3 The input image of 3 and3 33× 3 3 3× 3 3 3 filters convolved to get4 44× 4 4 The output matrix of 4 , z [ l ] z^{[l]}z[ l ] corresponds to the added offset of4 44× 4 4 4 output matrix (as shown below).
insert image description here

This is a [ l ] a^{[l]}a[ l ]a [ l + 1 ] a^{[l+1]}aThe evolution process of [ l + 1 ] first executes the linear function, and then multiplies all elements to perform convolution. The specific method is to use the linear function plus bias, and then apply the activation function ReLU. In this way, a6 66× 6 6 6× 3 3 3 of dimensiona[ 0 ] a^{[0]}a[ 0 ] evolves into a4 44× 4 4 4× 2 2 2- dimensionala [ 1 ] a^{[1]}a[ 1 ] , this is a layer of a convolutional neural network.
Suppose each layer is3 33× 3 3 3× 3 3 3 filters, so each filter has 27 parameters, plus a biasbbb , a total of 28 parameters. Now we have 103 33× 3 3 3× 3 3 3 filters add up to 28×10, which is 280 parameters.
No matter how big the input image is, the number of parameters is always 280. Use these 10 filters to extract features such as vertical edges, horizontal edges and other features. Even though these images are large, there are very few parameters, which is a feature of convolutional neural networks calledavoiding overfitting.

6.2 Symbol Definition

Take the llth in the convolutional neural networkTake the l- layer convolutional layer as an example to define various marks of the convolutional layer.

  • f [ l ] f^{[l]} f[ l ] means thellthThe size of the l- layer filter;
  • p [ l ] p^{[l]} p[ l ] means thellthThe number of layerspadding,the number of paddingcan be specified asvalid convolution, that is, nopadding, orthe same convolutionpaddingis selected;
  • s [ l ] s^{[l]} s[ l ] means thellthThe strideof layer l ;
  • No. llThe dimension of the input image of layer l is n H [ l − 1 ] n^{[l-1]}_HnH[l1]× n W [ l − 1 ] n^{[l-1]}_W nW[l1]× nc [ l − 1 ] n_c^{[l-1]}nc[l1], which is the activation value of the previous layer;
  • No. llThe dimension of the output image of layer l isn H [ l ] n^{[l]}_HnH[l]× n W [ l ] n^{[l]}_W nW[l]× nc [ l ] n_c^{[l]}nc[l]
  • n H [ l ] n^{[l]}_H nH[l]Indicates the llthThe height of the output image of layer l , that is, n H [ l ] = ⌊ n H [ l − 1 ] + 2 p [ l ] − f [ l ] s [ l ] + 1 ⌋ n_H^{[l]}=\lfloor \frac{n_H^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}}+1\rfloornH[l]=s[l]nH[l1]+2p[l]f[l]+1
  • n W [ l ] n_W^{[l]} nW[l]Indicates the llthThe width of the output image of layer l , that is, n W [ l ] = ⌊ n W [ l − 1 ] + 2 p [ l ] − f [ l ] s [ l ] + 1 ⌋ n_W^{[l]}=\lfloor \frac{n_W^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}}+1\rfloornW[l]=s[l]nW[l1]+2p[l]f[l]+1
  • nc [ l ] n_c^{[l]}nc[l]Indicates the number of channels of the output image.
  • The dimension of a single filter is f [ l ] f^{[l]}f[l]× f [ l ] f^{[l]} f[ l ] ×nc [ l − 1 ] n_c^{[l-1]}nc[l1], the number of channels in the filter must be consistent with the number of channels in the input image;
  • For a single sample, the activation value a [ l ] a^{[l]}a[ l ] is a three-dimensional body whose dimension isn H [ l ] n_H^{[l]}nH[l]× n W [ l ] n_W^{[l]} nW[l]× nc [ l ] n_c^{[l]}nc[l]; for mmm samples, iemmm activation values​​a [ l ] a^{[l]}a[ l ] set, then outputA [ l ] A^{[l]}A[ l ] dimension ismmm× n H [ l ] n_H^{[l]} nH[l]× n W [ l ] n_W^{[l]} nW[l]× nc [ l ] n_c^{[l]}nc[l]
  • Weight parameter W [ l ] W^{[l]}W[ l ] is the set of all filters multiplied by the total number of filters, its dimension isf [ l ] f^{[l]}f[l]× f [ l ] f^{[l]} f[ l ] ×nc [ l − 1 ] n_c^{[l-1]}nc[l1]× nc [ l ] n_c^{[l]}nc[l]
  • Bias parameter b [ l ] b^{[l]}b[ l ] is usually represented in code as a1 11× 1 1 1× 1 1 1 ×nc [ l ] n_c^{[l]}nc[l]A 4D vector or 4D tensor of . Each filter has a bias parameter, which is a real number.
    insert image description here

7. A Simple Convolution Network Example

insert image description here

Suppose you input a picture, defined as xxx , for image classification or image recognition. The size of the input image is39 3939× 39 39 39× 3 3 3 , becausen H [ 0 ] = n W [ 0 ] n_H^{[0]}=n_W^{[0]}nH[0]=nW[0], that is, both height and width are equal to 39, nc [ 0 ] = 3 n_c^{[0]}=3nc[0]=3 , that is, the channel number of layer 0 is 3.
Suppose we use 103 33× 3 3 3× 3 3 3 filters to extract features, that is,filter sizef[ 1 ] = 3 f^{[1]}=3f[1]=3 ,the number of filtersnc [ 1 ] = 10 n_c^{[1]}=10nc[1]=10 . _ The height and width usevalid convolution, that is,the number of paddingp [ 1 ] = 0 p^{[1]}=0p[1]=0 ,strides[ 1 ] = 1 s^{[1]}=1s[1]=1 .
Therefore, the size of the output image obtained by convolution, or the activation value of the first layera [ 1 ] a^{[1]}a[ 1 ] has dimension37 3737× 37 37 37× 10 10 1 0 , where 37 is the formulan + 2 p − fs + 1 \frac{n+2p-f}{s}+1sn+2pf+The calculation result of 1 , that is,39 + 0 − 3 1 + 1 = 37 \frac{39+0-3}{1}+1=37139+03+1=3 7 . This completes the construction of the first convolutional layer.
insert image description here

Then go to the next convolutional layer, this time we use a filter of 5 55× 5 5 Matrix of 5 ,filter sizef[2] = 5 f^{[2]}=5f[2]=5 , because the number of channels of the output image of the previous layernc [ 1 ] n_c^{[1]}nc[1]is 10, so the filter dimension of this layer is 5 55× 5 5 5× 10 10 10 . _ In addition,the strides [ 2 ] = 2 s^{[2]}=2s[2]=2 ,padding amountp [ 2 ] = 0 p^{[2]}=0p[2]=0 , and there are 20 such filters, so the dimension of the output of this layer is17 1717× 17 17 17× 20 20 2 0 , namelythe height and width of the output imagen H [ 2 ] = n W [ 2 ] = 17 n_H^{[2]}=n_W^{[2]}=17nH[2]=nW[2]=1 7 ,number of channelsnc [ 2 ] = 20 n_c^{[2]}=20nc[2]=20 . _ Because the stride is 2, the dimension shrinks quickly, so the dimension size goes from37 to 3737× 37 37 3 7 reduced to17 1717× 17 17 1 7 , reduced by more than half. This completes the construction of the second convolutional layer.
insert image description here

Then build the last convolutional layer, assuming the filter is still 5 55× 5 5 5 , iefilter sizef[ 3 ] = 5 f^{[3]}=5f[3]=5 , because the dimension of the input image is17 1717× 17 17 17× 20 20 2 0 , so the filter dimension of this layer is5 55× 5 5 5× 20 20 20 . _ Supposestrides [ 3 ] = 2 s^{[3]}=2s[3]=2 ,padding amountp[3] = 0 p^{[3]}=0p[3]=0 , and 40 such filters are used, so the final output dimension is7 77× 7 7 7× 40 40 40 . _ Here, this39 3939× 39 39 39× 3 3 The input image of 3 is processed, and7 77× 7 7 7× 40 40 40 features, that is , 1960 features.
insert image description here

Then perform convolution processing on these 1960 features, which can be smoothed or expanded into 1960 units, which can be saved into a long vector after smoothing, and then input to the logistic regression unit or softmax regression unit , and finally use y ^ \hat {y}y^Represents the predicted output of the neural network.

8. Pooling Layers

In addition to convolutional layers, convolutional networks often use pooling layers to reduce the size of the model, increase computational speed, and improve the robustness of the extracted features.

8.1 Max Pooling

insert image description here

If the input is a 4 44× 4 4 4 matrix, the tree pooling that performs max pooling is a2 22× 2 2 2 matrix, ief = 2 f=2f=2 . In order to calculate the values ​​of the 4 elements on the right, we need2 22× 2 2 2 area to do the maximum value calculation. First on purple2 22× 2 2 The 2 area calculates the maximum value and the result is 9. Suppose strides = 2 s=2s=2 , so move 2 more strides to the right to get the blue2 22× 2 2 The maximum value in the 2 area is 2. Then move down 2 steps to get the green2 22× 2 2 The maximum value in the 2 area is 6. Finally move 2 more steps to the right to get the red2 22× 2 2 The maximum value in the 2 area is 3. The final output gets a2 22× 2 2 2 matrices.
for this2 22× 2 2 2 , each element of the output is the maximum element value in its corresponding color area. wherefilter sizef = 2 f=2f=2 andstrides = 2 s=2s=2 is the hyperparameter of max pooling, but these two hyperparameters are fixed values ​​because they cannot be learned in gradient descent.

8.2 Average Pooling

insert image description here

As the name suggests, this operation does not select the maximum value of each filter, but the average value. Assume filter size f = 2 f=2f=2 , strides = 2 s=2s=2 , so the average value of the purple area is 3.75, the average value of the blue area is 1.25, the average value of the green area is 4, and the average value of the red area is 2.
Currently, max pooling is more commonly used than average pooling. But there are exceptions, that is, deep neural networks, you can use average pooling to decompose the size of7 77× 7 7 7× 1000 1000 The output matrix of 1 0 0 0 , averaged over the whole space, finally got1 11× 1 1 1× 1000 1000 1 0 0 0 output matrix.

9. Convolutional neural network example

Suppose there is a sheet of size 32 3232× 32 32 32× 3 3 3 picture, thisRGBpicture contains a handwritten number, such as 7, you want to identify which of the 10 numbers it is 0-9, so let's build a convolutional neural network to achieve this function.
insert image description here

First build the first layer of convolutional layer, the input is a 32 3232× 32 32 32× 3 3 3 matrix, assuming that the dimension of the filter used in the first layer is5 55× 5 5 5× 3 3 3 ,the strideis 1,the paddingis 0, andthe number of filtersis 6, then the dimension of the output matrix of the convolutional layer is28 2828× 28 28 28× 6 6 6 , label this layerCONV1.
Then build a pooling layer immediately after the convolutional layer, here choose to usethe largest pooling. Suppose the parameterf = 2 f=2f=2 s = 2 s=2 s=2 p = 0 p=0 p=0 , that is, the filter used by the maximum pooling is2 22× 2 2 2 ,the strideis 2,paddingis 0, so the height and width of the output will be reduced by half, the number of channels remains the same, and the final output matrix dimension is14 1414× 14 14 14× 6 6 6 , mark the pooling layer asPOOL1.
When people calculate the number of layers of a neural network, they usually only count the layers with weights and parameters, while the pooling layer has no weights and parameters, only some hyperparameters. So here we takeCONV1andPOOL1together as a convolution and mark it asLayer1.
Note: Another type of division method is to usethe convolutional layeras a layer, andthe pooling layeras a separate layer.
insert image description here

Next apply another convolutional layer to Layer1 , assuming the dimension of the filter is 5 55× 5 5 5× 3 3 3 , ief = 5 f=5f=5 ,the strideis 1,the paddingis 0, andthe number of filtersis 16, so the output dimension of the convolutional layerCONV2is10 1010× 10 10 10× 16 16 16 . _
Continue to perform the maximum pooling calculation, assuming that the parameter isf = 2 f=2f=2 s = 2 s=2 s=2 , so the height and width are halved, the number of channels is the same as before, and the dimension of the output matrix is​​5 55× 5 5 5× 16 16 16 , labeledPOOL2. The convolutional layerCONV2and the pooling layerPOOL2are a convolution, that is,Layer2, because it has only one weight set and one convolutional layerCONV2.
insert image description here

The output of Layer2 is a 5 55× 5 5 5× 16 16 A matrix of 1 6 contains 400 elements. NowPOOL2into a 1D vector of size 400. We can imagine the flattened result as such a collection of neurons, and then use these 400 units to build the next layer.
The next layer contains 120 units, and these 400 units are closely connected to 120 units, which is our firstfully connected layer. It is equivalent to a single neural network layer, and its weight matrix is​​W [ 3 ] W^{[3]}W[ 3 ] , dimension is120 120120× 400 400 4 0 0 , there is also a bias parameterb [ 3 ] b^{[3]}b[ 3 ] , dimension is120 120120× 1 1 1 .
Then we add afully connected layer, assuming it contains 84 units, labeledFC4. Finally a softmaxis filled with these 84 units. If we want to recognize 10 numbers from 0-9 through handwritten digit recognition, thissoftmaxunit will have 10 outputs.
Regarding how to select these hyperparameters, the general practice is to try not to set hyperparameters yourself, but to check which hyperparameters have been adopted by others in the literature, and choose an architecture that works well in other people's tasks, then it may also apply to yourself s application.
Another common pattern in neural networks is one or more convolutions followed by a pooling layer, then one or more convolutional layers followed by another pooling layer, then several fully connected layers, and finally Asoftmaxlayer.
Next, let's talk about the activation value dimension, activation value size and number of parameters of the neural network.
insert image description here

There are the following points to note:

  • The pooling layer and the max pooling layer have no parameters;
  • Convolutional layers have relatively few parameters, and many parameters exist in fully connected layers of neural networks.

It can be seen from the above figure that as the depth of the neural network deepens, the size of the activation value will gradually become smaller. If the size of the activation value decreases too quickly, it will also affect the performance of the neural network. In the example, the activation value size is 6000 in the first layer, then reduced to 1600, slowly reduced to 84, and finally the softmax result is output.

10. Why use convolution? (Why Convolutions?)

The two main advantages of convolutional layers over using only fully connected layers are parameter sharing and sparse connections .
insert image description here

Suppose there is a 32 3232× 32 32 32× 3 3 3 images, using 6 dimensions for5 55× 5 5 5× 3 3 3 filters, the dimension of the output matrix is​​28 2828× 28 28 28× 6 6 6 . 32×32×3=3072, 28×28×6=4704. Suppose we build an ordinary neural network, one layer contains 32×32×3=3072 units, the next layer contains 28×28×6=4704 units, each neuron in the two layers is connected to each other, and then calculate Weight matrixWWW , its dimension is equal to 4074×3072≈14 million, so many parameters need to be trained. Although with the current technology, we can use more than 14 million parameters to train the network, because this 32×32×3 picture is very small, there is no problem in training so many parameters. But if this is a1000 10001000× 1000 1000 1 0 0 0 picture, weight matrixWWW can get very large.
If we build a convolutional layer, the dimension of each filter is5 55× 5 5 5× 3 3 3 , a filter has 75 parameters, plus the bias parameterbbb , then each filter has 76 parameters, because there are 6 filters in total, so the total number of parameters is 456. It can be seen that relatively speaking, the number of parameters involved in the convolution operation is significantly reduced.

  • Parameter sharing — assuming a 3 33× 3 3 The filter of 3 detects vertical edges, then the upper left corner of the picture, and each area next to it (the part marked by the blue box in the left matrix) can use this 3 33× 3 3 3 filters, that is,each feature detector and output can use the same parameters in different regions of the input imagein order to extract vertical edges or other features. It is not only suitable for low-order features such as edge features, but also for high-order features, such as extracting eyes on faces, cats or other feature objects.
  • Sparse connection — eg 0 in the upper left corner of the output matrix is ​​through 3 33× 3 3 3 convolution calculation, it only depends on this3 33× 3 3 In the input cell of 3, a certain element of the output matrix on the right is only connected to 9 of the 36 input features, and other pixel values ​​​​do not have any impact on the output .

insert image description here

Convolutional neural networks can reduce parameters through these two mechanisms so that we can train it with a smaller training set, thereby preventing overfitting.

Guess you like

Origin blog.csdn.net/baoli8425/article/details/119378163
Recommended