Principles of Deep Learning ----- Convolutional Neural Network

Series Article Directory

Principles of deep learning ----- linear regression + gradient descent method Principles
of deep learning ----- logistic regression algorithm
Principles of deep learning ----- fully connected neural network Principles
of deep learning ----- convolutional neural network
depth Learning principle-----recurrent neural network (RNN, LSTM)
time series forecasting-----based on BP, LSTM, CNN-LSTM neural network algorithm, single-feature electricity load forecasting
time series forecasting (multi-features)-- ---Multi-feature electricity load forecasting based on BP, LSTM, CNN-LSTM neural network algorithm


Series of teaching videos

Quick introduction to deep learning and actual combat
[hands-on teaching] based on BP neural network single-feature electricity load forecasting
[hands-on teaching] based on RNN, LSTM neural network single-feature electricity load forecasting
[hands-on teaching] based on CNN-LSTM neural network single-feature electricity consumption Load forecasting
[Multi-feature forecasting] Multi-feature electric load forecasting based on BP neural network
[Multi-feature forecasting] Multi-feature power load forecasting based on RNN and LSTM [Multi-feature forecasting
] Multi-feature power load forecasting based on CNN-LSTM network



foreword

  Convolutional neural network is an important knowledge point in deep learning. Currently, deep learning is divided into three major blocks: big data and data mining, computer vision, and natural language processing. Almost all deep learning algorithms in computer vision use convolutional neural networks as image feature extraction, so the status of convolutional neural networks in deep learning is unshakable.
  However, the convolutional neural network was proposed in 1998. The reason why it has not been widely used is that the performance of the computer was relatively low at that time, and the performance of the convolutional neural network was difficult to be utilized. Until 2012, AlexNet won the championship of the classification task of the ImageNet competition, and the classification accuracy far exceeded the classification results achieved by traditional methods. Since then, the development of deep learning has been out of control.


1. The nature of images

  Since the convolutional neural network is mostly used for image feature extraction, it is necessary to understand the nature of the image in the computer before learning the convolutional neural network model, which is conducive to learning the convolutional neural network. Currently popular image formats include grayscale and RGB formats.

1.1. Grayscale image

  The grayscale image is our common black and white image. The following image is a black and white image, which is a number 8. If you observe the image carefully, you can find that the edge of the image seems to be composed of small squares one after another. In fact, the image The whole is composed of one small square after another. Since the height of the image is 24 and the width is 16, the image is composed of 384 small squares of 24*16. insert image description here
  However, it can be seen from the picture that there are black, white, and gray areas, and the degree of gray is not the same. Each small block in the image represents a pixel, each small block has a pixel value, these pixel values ​​represent the intensity of the pixel, the size of the pixel value ranges from 0 to 255, where 0 is black and 255 is white, the image The darker the image, the closer the pixel value is to 0, and the lighter the image, the closer the pixel value is to 255. Therefore the image is saved as a matrix of numbers in the computer. Specifically, as shown in the figure below: insert image description here
  a grayscale image is represented by a matrix of numbers, but a more common color image in our lives is represented by a matrix of three numbers.

1.2, color map

  I don’t know if you have observed such a situation in your life, that is, sometimes you accidentally splash water on the screen of your mobile phone or TV, and you can observe small squares one after another through the water drops. These squares have different Colors, but only red, green and blue. When I was studying physics in junior high school, I knew that these three colors are called three primary colors, and various colors can be generated in different proportions. As shown in the figure below: insert image description here
  Therefore, when the computer represents a color image, it is represented by a matrix of three numbers. The specific form is shown in the figure below: insert image description here
  1 matrix for displaying red. The value range in the matrix is ​​also from 0 to 255. When the value is closer to 0, the red color will be darker. When the value is closer to 0 , the red color is lighter.
  1 matrix for displaying green. The value range in the matrix is ​​also from 0 to 255. When the value is closer to 0, the green color will be darker. When the value is closer to 0, the green color will be lighter. .
  1 matrix for displaying blue. The range of values ​​in the matrix is ​​also from 0 to 255. When the value is closer to 0, the blue color will be darker. When the value is closer to 0, the blue color will be expressed. The shallower it is.
  These pixel values ​​​​are between 0 and 255, where each number represents the intensity of the pixel. All these channel matrices are added together to become a three-channel image. When the shape of the image is loaded into the computer, the pixel matrix is ​​H×W ×3. where H is the number of pixels across the height, W is the number of pixels across the width, and 3 is the number of channels.


2. Overall structure

  First, let's take a look at the overall structure of the convolutional neural network. As shown in the figure below, compared with the fully connected neural network, convolutional neural networks have convolutional layers and pooling layers, and input data feature maps in convolutional neural networks. Through convolution operation and pooling operation, the effective features are extracted and input to the fully connected layer, and the data is classified or predicted.
insert image description here

3. Convolution layer

  Some terms specific to convolutional neural networks have emerged, such as padding and stride. The essence of the picture in the calculation was explained before. The picture is a digital matrix in the computer, and the input of the convolutional neural network is such a digital matrix. The format of the input data should be C H W, where C is the data The number of channels, taking the previous grayscale image and color image as an example, the C of the grayscale image is 1, and the color image is 3 (it should be noted that: the C here is not necessarily 1 or 3, the feature map after convolution calculation The channel will change, and the input data is not necessarily image data, so the channel of the constructed data feature map is not necessarily 1 or 3); at the same time, H here is the height of the data matrix, and W is the width of the data matrix.

3.1. Problems with full connection

  In a fully connected neural network, adjacent neurons are all connected together, so a neural network layer appears as a long strip, but when the input data is data with a 3D shape such as an image, the processing of the fully connected neural network is Flatten the data into a one-dimensional state. As shown in the figure below: insert image description here  3D shape data like images should contain important spatial information. For example, the pixel values ​​in spatially adjacent places should be similar values, and the RGB channels have a close correlation, but the correlation between pixels that are far apart is relatively low. But the fully connected layer ignores the shape and treats all the information as the same neuron, so it cannot use the information related to the shape.
  However, the convolutional layer in the convolutional neural network can keep its shape unchanged. When the input data is an image, the convolutional layer will accept the data in the form of 3D data and output it to the next layer in the form of 3D. Therefore, compared with the fully connected neural network, the convolutional neural network can better understand the data of the spatial shape.

3.2. Convolution operation

  The core of the convolutional neural network is the convolution layer with convolution operation. The convolution operation is equivalent to the filter operation in image processing, so the convolution kernel is also called a filter. Let's understand the convolution operation through a specific example. insert image description here  As shown in the figure above, the input data is a data with a spatial shape, and the convolution kernel is also a dimension with a long-height direction. Assume that (height, width) is used to represent the shape of the data and the convolution kernel. In this example, the data and The shapes of the convolution kernels are (3,3) and (2,2) respectively, and the size of the output data is (2,2). But it should be noted that the length and height of the convolution kernel are generally the same size (of course, different sizes can also be used).
  Now let's explain the operation process of the convolution operation in detail. For the input data, the convolution operation slides the data of the window with the same size as the convolution kernel at a certain interval and multiplies and sums the positions of the convolution kernel. As shown in the figure, the window size of the convolution kernel is 2 2, then take a data block of the same size as the convolution kernel from the upper left corner of the data and multiply it by the position corresponding to the convolution kernel, and finally sum the data. The specific calculation here is as follows: 0 ∗ 0 + 1 ∗ 1 + 3 ∗ 2 + 3 ∗ 4 = 19 0 * 0+1 * 1+3 * 2+3 * 4=1900+11+32+34=19   At this time, slide the window one step to the right, as shown in the figure, take the data in the window and the convolution kernel to perform convolution operation, and obtain the corresponding value. At this time, the window can no longer slide to the right, then slide the window down one step, and continue to repeat the above steps of sliding the window from left to right to fetch data and the convolution kernel to multiply the corresponding data and finally sum. The details are shown in the figure:insert image description here  In the fully connected neural network, the neural network has two parameters, one is the weight parameter and the other is the bias parameter. In the convolutional neural network, the convolution kernel is the weight parameter of the convolutional neural network. Of course, there are also bias parameters in the convolutional neural network. As shown in the figure, the shape of the offset parameter is usually a 11, and the parameter value of this offset will be added to all elements that pass the convolution operation. Specifically as shown in the figure:insert image description here

3.3. Filling

  Filling is a processing that is often used in convolution operations. The operation step is to fill in fixed data around the input data (usually this data is 0). As shown in insert image description here  the figure: In the example above, a circle of 0 data is filled around an input with a size of (3,3), and the input data size is changed to (5, 5), so the filling range is 1 . Use the filled data and the convolution kernel with a size of (2,2) to perform a convolution operation to obtain an output data with a size of (4,4). Of course, the padding range can also be set to any integer greater than 1.
  The main purpose of using padding is to resize the output data. For example, the size of the input data is (3,3), the size of the convolution kernel is (2,2), and the size of the output data obtained by the convolution operation is (2,2); when the amplitude of the input data is 1 Filling, the size of the input data at this time is (5, 5), and the size of the output data after the convolution operation is (4, 4). If the data is not filled, the input data size is (3,3), and the output data size is (2,2), which is equivalent to the output data being reduced by 1 element compared to the input data size. If there are many in a convolutional neural network A convolutional layer requires multiple convolution operations. If the space is reduced each time the convolution operation is performed, the output size may be (1,1) at a certain moment, and the convolution operation cannot be performed at this time. . Therefore, in order to avoid such a situation, the filling operation can be used to solve the above-mentioned problems, and keep the space size of the output data unchanged or larger.

3.4 Stride

  The interval at which the convolution kernel is used to slide over the input data is called the stride. The stride in all the above examples is 1, and the stride here can be set, and it can be set to an integer greater than 1, just like the filling range. As shown in the figure below, when the stride is 2, the result of the convolution operation is shown in the figure:insert image description here

  As shown in the figure above, the input data size is (4,4), the convolution kernel size is (2,2), the data padding is 0, and the stride size is 2; the calculation process with a stride of 2 is, convolution The kernel finds a data block of the same size as itself on the left side of the data to perform convolution operation, and then slides two pixels to the right to perform convolution operation until the convolution kernel cannot slide to the right, and then in the leftmost direction of the data Slide down 2 steps and repeat the above-mentioned convolution operation until the data has undergone the convolution operation. Specifically, the above figure is relatively intuitive to show the process of the convolution operation with a stride of 2.
  From the figure, it can be easily obtained that the final output size of the convolution operation is (2,2), so it is obvious that after increasing the stride, the output size will become smaller. And after increasing the padding, the size of the output will become larger. So is there a set to calculate the relationship between the input data and the output data? The answer is obviously yes.
  Suppose the input size is (H, W), the convolution kernel size is (FH, FW), the output data size is (OH, OW), the padding is P, and the stride is S.
  The calculation formula is as follows: OH = H + 2 P − FHS + 1 OW = W + 2 P − FWS + 1 \begin{gathered} OH=\frac{H+2 PF H}{S}+1 \\ OW= \frac{W+2 PF W}{S}+1 \end{gathered}OH=SH+2PFH+1OW=SW+2PFW+1  Now use the formulas to calculate the above case for padding and stride case.
  The calculations for the filled case are as follows: OH = 3 + 2 ∗ 1 − 2 1 + 1 = 4 OW = 3 + 2 ∗ 1 − 2 1 + 1 = 4 \begin{aligned} &O H=\frac{3+2 * 1-2}{1}+1=4 \\ &O W=\frac{3+2 * 1-2}{1}+1=4 \end{aligned}OH=13+212+1=4OW=13+212+1=4  Obviously, it is consistent with the final result.
  The stride case is calculated as follows: OH = 4 + 2 ∗ 0 − 2 2 + 1 = 2 OW = 4 + 2 ∗ 0 − 2 2 + 1 = 2 \begin{aligned} &O H=\frac{4+2 * 0-2}{2}+1=2 \\ &O W=\frac{4+2 * 0-2}{2}+1=2 \end{aligned}OH=24+202+1=2OW=24+202+1=2  Obviously, the calculation here is also consistent with the result in the final figure.
  It should be noted here that the stride and padding can be set by yourself, but there may be situations where the final calculation result is not an integer, which will cause an error to be reported in the final program operation. Such situations should be avoided as much as possible, of course. The deep learning framework of will perform rounding and continue calculation without error reporting.

3.5. Multi-channel data convolution operation

  The data in the above examples are all explained with the number of channels as 1, but the image has a grayscale image with a channel number of 1 and a color image with a channel number of 3. At the same time, the convolution operation can also change the channel of the input feature. The number makes the output feature a multi-channel feature map; therefore, multi-channel data not only needs to consider the height and length directions, but also needs to process the channel direction. Therefore, when there are multiple feature maps in the channel direction, the input data and the convolution kernel are convolved according to the channel, and the results are added to obtain the output feature map. Here, we use a 3-channel data to demonstrate the process of convolution operation of multi-channel data.
insert image description here
  As shown in the figure above, the input data is a feature map with a channel number of 3 and a shape size of (4,4). Since the input channel is 3, the channel of the convolution kernel must be the same as the channel of the input data, and the shape and size of the convolution kernel of each channel must also be the same, so the number of channels of the convolution kernel set here is 3, and the shape and size is (3,3). Convolute the feature maps of different channels with the convolution kernels of the corresponding channels, and add the final calculation results; compared with the single-channel convolution operation, there is one more step here to add the convolution operation results of different channels . Therefore, as shown in the figure, the final output feature map has a channel size of 1 and a shape of (2,2).
  The above process can be thought of using a cuboid, which may be more vivid, as shown in the figure, which can describe the above multi-channel data convolution process very vividly. insert image description here
  Among them, the data can be expressed as (channel, height, width). Therefore, the shape of the input data in the figure is (C, H, W), the shape of the convolution kernel is (C, FH, FW), and the shape of the output data is (OH, OW).
  Of course, when there is only one convolution kernel, the number of channels of the output feature map is 1. At that time, during the convolution operation, it was hoped that the output feature map would be multi-channel. At this time, it was necessary to set more convolution kernels (that is, to set more sets of weights).
  As shown in the figure, when the input data size is (C, H, W), there are a total of FN convolution kernels and the size is (C, FH, FW). So the output feature map is of size (FN, OH, OW). insert image description here  Of course, there are not only weight parameters (convolution kernel) but also bias parameters in the convolutional neural network. When the number of channels is C, the specific calculation process is shown in the following figure:insert image description here
  When the input data size is (C, H, W), there are a total of FN convolution kernels and the size is (C, FH, FW). The size of the output feature map after the convolution operation is (FN, OH, OW), so the number of offset channels needs to be the same as the number of channels of the output feature map, so the size of the offset is (FN, 1, 1) . Add the offset and the output feature map to the pixel, and the final output feature map size is (FN, OH, OW), so adding the offset does not change the shape of the output feature, but only changes the size of the output feature value .


4. Pooling layer

  The pooling layer is an operation to reduce the feature map space. The difference between the pooling layer and the convolutional layer is that the feature map needs to be convoluted with the convolution kernel, so the convolutional layer needs to learn parameters. Through the previous The error is determined by forward propagation, and then the parameters are updated by backpropagation. But the pooling layer only extracts the maximum or average value from the target area, so there are no parameters to learn.
  What the pooling layer does is to extract the maximum value of pixels in the target area, or to calculate the average value. Therefore, the operation of the pooling layer has two types of operations, called maximum pooling and average pooling. Now let’s take a look at how the pooling operation is performed.
  As shown in the figure below, it is the calculation process of maximum pooling: insert image description here
  as shown in the figure above, the input data is a (4,4) size feature map, and the target area is (2,2) size from left to right on the feature map Right, take the maximum value in the target area from top to bottom. At the same time, the stride size of the pooling layer is generally the same as the pooling window size. For example, the pooling window size in this example is (2, 2), then the stride at this time is set to 2. Through continuous calculation A feature map of size (4,4) is finally transformed into a computation map of (2,2). Of course, the relationship between the input feature map and the output feature map of the pooling layer can also be calculated using the above-mentioned input and output calculation formula of the convolutional layer, but the size of the convolution kernel becomes the size of the pooling window.
  For example, the calculation of this case is as follows: OH = 4 + 2 ∗ 0 − 2 2 + 1 = 2 OW = 4 + 2 ∗ 0 − 2 2 + 1 = 2 \begin{aligned} &O H=\frac{4+2 * 0-2}{2}+1=2 \\ &O W=\frac{4+2 * 0-2}{2}+1=2 \end{aligned}OH=24+202+1=2OW=24+202+1=2  The final calculation result is consistent with the final result in the figure.
  In addition to maximum pooling, there is also an average pooling operation. As the name implies, average pooling is to average and sum the values ​​in the target area.
  As shown in the figure below, it is the calculation process of average pooling: insert image description here  the specific process of average pooling will not be described in detail, and the above figure clearly shows the calculation process. However, in the field of image recognition, the main use is still the maximum pooling.
  There are still some things to pay attention to in pooling:
  the input data not only has length and height, but also the concept of channel. However, the operation of the pooling layer will not change the channel of the input feature map, and the pooling operation is calculated independently by channel.
  The pooling layer is robust to small position changes, making the model more robust. When a small change is found in the input feature data, the result of the output feature map is still the same, as shown in the following figure: the insert image description here
  position of the data in the red box has changed, but the returned result is the same, obviously the pool The optimization operation only focuses on local features and is not sensitive to subtle changes in the whole. This is a very good operation in the field of image recognition, because we only focus on the points that are useful for our judgment in an image, and do not need to be specific to each of the global features. characteristics are judged.


Summarize

  Convolutional neural network is an extremely important neural network model in deep learning. On this basis, it has developed into the field of computer vision. At present, computer neural network has branches such as image classification, target detection, and image segmentation, and there are many landing projects. Most of the algorithms are improved on the basis of convolutional neural networks, and to learn convolutional neural networks well, you need to cooperate with various network model diagrams to understand him. This article is also based on a large number of model diagrams. The neural network is explained, which is convenient for everyone to understand and learn.

Guess you like

Origin blog.csdn.net/didiaopao/article/details/126483397