[PaddlePaddle] [学习笔记] [上] 计算机视觉(卷积、卷积核、卷积计算、padding计算、BN、缩放、平移、Dropout)

1. 计算机视觉的发展历程

计算机视觉作为一门让机器学会如何去“看”的学科,具体的说,就是让机器去识别摄像机拍摄的图片或视频中的物体,检测出物体所在的位置,并对目标物体进行跟踪,从而理解并描述出图片或视频里的场景和故事,以此来模拟人脑视觉系统。因此,计算机视觉也通常被叫做机器视觉,其目的是建立能够从图像或者视频中“感知”信息的人工系统。

计算机视觉技术经过几十年的发展,已经在交通(车牌识别、道路违章抓拍)、安防(人脸闸机、小区监控)、金融(刷脸支付、柜台的自动票据识别)、医疗(医疗影像诊断)、工业生产(产品缺陷自动检测)等多个领域应用,影响或正在改变人们的日常生活和工业生产方式。未来,随着技术的不断演进,必将涌现出更多的产品和应用,为我们的生活创造更大的便利和更广阔的机会。

Insert image description here

飞桨为计算机视觉任务提供了丰富的 API,并通过底层优化和加速保证了这些 API 的性能。同时,飞桨还提供了丰富的模型库,覆盖图像分类、检测、分割、文字识别和视频理解等多个领域。用户可以直接使用这些 API 组建模型,也可以在飞桨提供的模型库基础上进行二次研发。

由于篇幅所限,本章将重点介绍计算机视觉的经典模型(卷积神经网络)和两个典型任务(图像分类和目标检测)。主要涵盖如下内容:

  • 卷积神经网络:卷积神经网络(Convolutional Neural Networks, CNN)是计算机视觉技术最经典的模型结构。本教程主要介绍卷积神经网络的常用模块,包括:卷积、池化、激活函数、批归一化、丢弃法等。
    • 图像分类:介绍图像分类算法的经典模型结构,包括:LeNet、AlexNet、VGG、GoogLeNet、ResNet,并通过眼疾筛查的案例展示算法的应用。
    • Target detection : Introducing the YOLOv3 algorithm for target detection, and demonstrating the application of the YOLOv3 algorithm through forestry pest and disease detection cases.

The development of computer vision starts with biological vision. As for the origin of biological vision, the academic community has not yet reached a conclusion. Some researchers believe that the earliest biological vision was formed in jellyfish about 700 million years ago, while others believe that biological vision originated in the Cambrian period about 500 million years ago. The cause of the Cambrian explosion has always been an unsolved mystery, but it is certain that animals in the Cambrian had visual abilities. Predators could find prey more easily, and predators could detect natural enemies earlier. Location. Visual ability intensifies the game between hunters and prey, and also gives rise to more intense survival and evolution rules. The formation of the visual system effectively promoted the evolution of the food chain and accelerated the process of biological evolution. It is an important milestone in the history of biological development. After hundreds of millions of years of evolution, the current human visual system has achieved very high complexity and powerful functions. The number of neurons in the human brain has reached 100 billion. These neurons are connected to each other through the network. Such a huge visual neural network This allows us to easily observe the world around us, as shown in the figure below.

Insert image description here

It is very easy for humans to identify cats and dogs. But for computers, even a master who is proficient in programming is difficult to easily write a universal program (for example: suppose the program thinks that the larger ones are dogs and the smaller ones are cats, but due to different shooting angles, it may Cats take up more pixels in an image than dogs). So, how to make computers understand the world around them like humans? Researchers tried to solve this problem from different angles, and thus developed a series of subtasks, as shown in the figure below.

Insert image description here

  1. Image Classification: used to identify the categories of objects in images (such as bottle, cup, cube).
  2. Object Detection: Object detection is used to detect the category of each object in the image and accurately mark their location.
  3. Image semantic segmentation (Semantic Segmentation): used to mark the category to which each pixel in the image belongs. Pixels belonging to the same category are identified with a color.
  4. Instance Segmentation: It is worth noting that the target detection task in 2 only needs to mark the location of the object, while the instance segmentation task in 4 not only needs to mark the location of the object, but also needs to mark the outline of the object.

In early image classification tasks, image features were usually manually extracted first, and then machine learning algorithms were used to classify these features. The classification results strongly depended on the feature extraction method, and often only experienced researchers could complete it, as shown in the figure below. Show.

Insert image description here

In this context, feature extraction methods based on neural networks emerged as the times require. Yann LeCun was the first to apply convolutional neural networks to the field of image recognition. Its main logic is to use convolutional neural networks to extract image features and predict the categories to which the images belong. The network parameters are continuously adjusted through training data, and finally a set of capabilities are formed. A network that automatically extracts image features and classifies these features, as shown in the figure below.

Insert image description here

This method achieved great success on the handwritten digit recognition task, but it did not develop well in the following time. The main reason is that on the one hand, the data set is incomplete, it can only handle simple tasks, and overfitting is prone to occur on large-size data; on the other hand, it is the hardware bottleneck. When the network model is complex, the calculation speed will be particularly slow.

Currently, with the continuous advancement of Internet technology, the amount of data is growing on a large scale, and increasingly rich data sets are emerging. In addition, thanks to the improvement of hardware capabilities, the computing power of computers is becoming more and more powerful. Researchers are constantly applying new models and algorithms to the field of computer vision. This has given rise to increasingly rich model structures and more accurate accuracy. At the same time, computer vision is dealing with more and more problems, including classification, detection, segmentation, scene description, image generation, style transformation, etc., and even more than just It is limited to 2-dimensional pictures, including video processing technology and 3D vision.

2. Convolutional Neural Networks (CNN)

Convolutional neural network is currently the most commonly used model structure in computer vision. This chapter mainly introduces some basic modules of convolutional neural networks, including:

  1. Convolution
  2. Pooling
  3. ReLU activation function
  4. Batch Normalization
  5. Dropout

To review, we introduced the handwritten digit recognition task before . There are two models in it. The first one is a fully connected network for feature extraction. The code is as follows:

# 全连接层神经网络实现
class MNIST_FC_Model(nn.Layer):  
    def __init__(self):  
        super(MNIST_FC_Model, self).__init__()  
          
        # 定义两层全连接隐含层,输出维度是10,当前设定隐含节点数为10,可根据任务调整  
        self.classifier = nn.Sequential(nn.Linear(in_features=784, out_features=256),
                                        nn.Sigmoid(),
                                        nn.Linear(in_features=256, out_features=64),
                                        nn.Sigmoid())

        # 定义一层全连接输出层,输出维度是1  
        self.head = nn.Linear(in_features=64, out_features=10)  
          
    def forward(self, x):  
        # x.shape: [bath size, 1, 28, 28]
        x = paddle.flatten(x, start_axis=1)  # [bath size, 784]
        x = self.classifier(x)  
        y = self.head(x)
        return y

We see that in forwardthe function, we first need to flatten the image into a one-dimensional vector and then input it into the feature extraction layer ( classifier), but this will cause the following two problems:

  1. The spatial information of the input data is lost. Spatially adjacent pixels often have similar RGB values, and the data between the various RGB channels are usually closely related, but when converted into a 1-dimensional vector, this information is lost. At the same time, some essential patterns may be hidden in the shape information of the image data, but when it is converted into a 1-dimensional vector and input into a fully connected neural network, these patterns will also be ignored.

  2. Too many model parameters may lead to overfitting. In the case of handwritten digit recognition, each pixel is connected to all output neurons. When the image size becomes larger, the number of input neurons will increase according to the square of the image size, resulting in too many model parameters and prone to overfitting.

In order to solve the above problems, we introduce convolutional neural network (CNN) for feature extraction. The code is as follows:

# 多层卷积神经网络实现
class MNIST_CNN_Model(nn.Layer):
     def __init__(self):
         super(MNIST_CNN_Model, self).__init__()
         
         self.classifier = nn.Sequential(
             nn.Conv2D( in_channels=1, out_channels=20, kernel_size=5, stride=1, padding=2),
             nn.ReLU(),
             nn.MaxPool2D(kernel_size=2, stride=2),
             nn.Conv2D(in_channels=20, out_channels=20, kernel_size=5, stride=1, padding=2),
             nn.ReLU(),
             nn.MaxPool2D(kernel_size=2, stride=2))
         
         self.head = nn.Linear(in_features=980, out_features=args.num_classes)
         
         
     def forward(self, x):
         # x.shape: [10, 1, 28, 28]
         x = self.classifier(x)  # [bath size, 20, 7, 7]
         x = x.flatten(1)  # [batch size, 980]
         x = self.head(x)  # [batch size, num_classes]
         return x

We found that the CNN network does not need to flatten the image first and can directly extract features from the image. This can not only extract the feature patterns between adjacent pixels, but also ensure that the number of parameters does not change with the image size. The figure below is a typical convolutional neural network structure. Multi-layer convolution and pooling layers are combined on the input image. At the end of the network, a series of fully connected layers are usually added. The ReLU activation function is usually added to the convolution or fully connected layer. On the output of the connection layer, Dropout is usually added to the network to prevent overfitting.

Insert image description here

Another point that needs to be explained is that in CNN, the calculation range is performed within the spatial neighborhood of the pixel, and the number of convolution kernel parameters is much smaller than that of the fully connected layer. The convolution kernel itself has nothing to do with the size of the input image. It represents the extraction of a certain feature pattern in the spatial neighborhood . For example, some convolution kernels extract object edge features, and some convolution kernels extract features at the corners of objects. Different areas on the image share the same convolution kernel. When the input image sizes are different, the same convolution kernel can still be used for operation.

3. Convolution

This section will introduce the principle and implementation scheme of the convolution algorithm, and show how to use convolution to operate on images through specific cases, mainly covering the following content:

  1. Convolution calculation
  2. padding
  3. stride
  4. Receptive Field
  5. Multiple input channels, multiple output channels and batch operations
  6. Flying paddle convolution API introduction
  7. Examples of application of convolution operators

3.1 Convolution calculation

Convolution is an integral transformation method in mathematical analysis. The discrete form of convolution is used in image processing. What needs to be explained here is that in convolutional neural networks, the implementation of the convolutional layer is actually the cross-correlation operation defined in mathematics, which is different from the definition of convolution in mathematical analysis. This is different from other The framework is consistent with the tutorials on convolutional neural networks, and both use cross-correlation operations as the definition of convolution. The specific calculation process is shown in the figure below.

Cross -correlation is an operation commonly used in signal processing and image processing to measure the similarity between two signals. Mathematically, cross-correlation represents a comparison between two functions, often used to find the location of a pattern in one signal in another.
Given two discrete signals (x) and (y), their cross-correlation can be calculated by the following formula: ( x ⋆ y ) [ n ] = ∑ m = − ∞ ∞ x [ m ] ⋅ y [ n − m ] (x \star y)[n] = \sum_{m=-\infty}^{\infty} x[m] \cdot y[nm](xand ) [ n ]=m=x[m]and [ nm]

Where, ( x [ m ] ) (x[m])(x[m]) ( y [ n − m ] ) (y[n-m]) ( and [ nm ]) are the signals ( x ) (x)respectively( x )( y ) (y)( y ) values ​​at different positions,( n ) (n)( n ) is the index of the resulting signal.

The calculation process of cross-correlation can be understood as taking a signal (x) (x)( x ) slides in time, with another signal(y) (y)( y ) do dot product⋅\cdot⋅And sum∑ \sum , a new signal is obtained, indicating the similarity of the two signals at different positions. If the shapes of the two signals are similar at a certain location, the resulting cross-correlation will have a larger value.
In image processing, cross-correlation can be used to find matching patterns in one image in another image. In deep learning, the convolution operation is actually a form of cross-correlation, used to extract features from images.

Insert image description here

Description :

  1. The convolution kernel is also called a filter. Assume that the height and width of the convolution kernel are kh k_h respectively.khand kw k_wkw, then it will be called
    kh × kw k_h \times k_wkh×kwConvolution, such as 3×5 3×53×5 convolution means that the height of the convolution kernel is 3 and the width is 5.
  2. In a convolutional neural network, in addition to the convolution process described above, a convolution operator also includes the operation of adding a bias term. For example, assuming the bias is 1, the result of the above convolution calculation is: 0 × 1 + 1 × 2 + 2 × 4 + 3 × 5 + 1 = 26 0 × 2 + 1 × 3 + 2 × 5 + 3 × 6 + 1 = 32 0 × 4 + 1 × 5 + 2 × 7 + 3 × 8 + 1 = 44 0 × 5 + 1 × 6 + 2 × 8 + 3 × 9 + 1 = 50 0 × 1 + 1 × 2 + 2 × 4 + 3×5 + 1 = 26 \\ 0 ×2 + 1 × 3 + 2 × 5 + 3×6 + 1 = 32\\ 0×4 + 1×5 + 2×7 + 3×8 + 1=44\\ 0×5 + 1×6 + 2 × 8 + 3 × 9 + 1 =500×1+1×2+2×4+3×5+1=260×2+1×3+2×5+3×6+1=320×4+1×5+2×7+3×8+1=440×5+1×6+2×8+3×9+1=50

3.2 Padding

In the above example, the input image size is 3×3 3×33×3 , the output image size is2×2 2×22×2. After one convolution, the image size becomes smaller. The size of the convolution output feature map is calculated as follows (the height and width of the convolution kernel arekh k_hkhand kw k_wkw):

H out = H − kh + 1 W out = W − kw + 1 H_{\mathrm{out}} = H - k_h + 1\\ W_{\mathrm{out}} = W - k_w + 1Hout=Hkh+1Wout=Wkw+1

If the input size is 4 and the convolution kernel size is 3, the output size is 4 − 3 + 1 = 2 4−3+1=243+1=2 . We can check by ourselves whether the above calculation formula holds when the input image and convolution kernel are of other sizes. When the convolution kernel size is greater than 1, the size of the output feature map will be smaller than the input image size. If you go through multiple convolutions, the size of the output image will continue to decrease. In order to prevent the image size from becoming smaller after convolution, padding is usually performed around the periphery of the image, as shown in the figure below.

Insert image description here

As shown in FIG:

  • The padding size is 1 and the padding value is 0. After padding, the input image size changes from 4 × 4 to 4 × 44×4 becomes6×6 6×66×6 , use3×3 3×33×3 convolution kernel, the output image size is4 × 4 4 × 44×4
  • The size of the padding is 2 and the padding value is 0. After padding, the input image size changes from 4 × 4 to 4 × 44×4 becomes8×8 8×88×8 , use3×3 3×33×Convolution kernel of 3 , the output image size is6 × 6 6 × 66×6

If in the picture height direction, fill ph 1 p_{h1} before the first lineph 1rows, padding ph 2 p_{h2} after the last rowph 2Row; in the width direction of the picture, fill pw 1 p_{w1} before column 1pw1columns, fill pw 2 p_{w2} after the last 1 columnpw2column; then the image size after filling is (H + ph 1 + ph 2) × (W + pw 1 + pw 2) (H + p_{h1} + p_{h2}) \times (W + p_{w1} + p_{w2})(H+ph 1+ph 2)×(W+pw1+pw2) . After the size iskh × kw k_h \times k_wkh×kwAfter the convolution kernel operation, the size of the output image is:

H o u t = H + p h 1 + p h 2 − k h + 1 W o u t = W + p w 1 + p w 2 − k w + 1 H_{\mathrm{out}} = H + p_{h1} + p_{h_2} - k_h + 1\\ W_{\mathrm{out}} = W + p_{w1} + p_{w_2} - k_w + 1 Hout=H+ph 1+ph2kh+1Wout=W+pw1+pw2kw+1

During the convolution calculation process, equal padding is usually taken on both sides of the height or width, that is, ph 1 = ph 2 = ph p_{h1} = p_{h2} = p_hph 1=ph 2=ph & p w 1 = p w 2 = p w p_{w1} = p_{w2} = p_w pw1=pw2=pw, the above calculation formula becomes:

H o u t = H + 2 p h − k h + 1 W o u t = W + 2 p w − k w + 1 H_{\mathrm{out}} = H + 2p_h - k_h + 1\\ W_{\mathrm{out}} = W + 2p_w - k_w + 1 Hout=H+2phkh+1Wout=W+2pwkw+1

The convolution kernel size usually uses {1, 3, 5, 7} \{1, 3, 5, 7\}{ 1 , 3 , 5 , 7 } such odd numbers, if the padding size used isph = (kh − 1) / 2 p_{h} = (k_h - 1) / 2ph=(kh1)/2 p w = ( k w − 1 ) / 2 p_w = (k_w - 1) / 2 pw=(kw1 ) /2 , the image size remains unchanged after convolution. For example, when the convolution kernel size is 3, the padding size is 1, and the image size remains unchanged after convolution; similarly, if the convolution kernel size is 5, and the padding size is 2, the image size can also be kept unchanged.

3.3 Stride/Step length (Stride)

In the picture above, the convolution kernel slides one pixel at a time, which is a special case of a stride of 1. The figure below is a convolution process with a stride of 2. When the convolution kernel moves on the image, each movement is 2 pixels in size.

Insert image description here

When the strides in the width and height directions are sh s_h respectivelyshand sw s_wswWhen , the calculation formula for the output feature map size is:

H o u t = H + 2 p h − k h s h + 1 W o u t = W + 2 p w − k w s w + 1 H_{\mathrm{out}} = \frac{H + 2p_h - k_h}{s_h} + 1\\ W_{\mathrm{out}} = \frac{W + 2p_w - k_w}{s_w} + 1 Hout=shH+2phkh+1Wout=swW+2pwkw+1

Assume that the input image size is H × W = 100 × 100 H × W = 100 × 100H×W=100×100 , convolution kernel sizekh × kw = 3 × 3 k_h \times k_w =3×3kh×kw=3×3 , fillph = pw = 1 p_h = p_w = 1ph=pw=1 , the stride issh = sw = 2 s_h = s_w = 2sh=sw=2 , then the size of the output feature map is:

H o u t = 100 + 2 − 3 2 + 1 = 50 W o u t = 100 + 2 − 3 2 + 1 = 50 H_{\mathrm{out}} = \frac{100 + 2 - 3}{2} + 1 = 50\\ W_{\mathrm{out}} = \frac{100 + 2 - 3}{2} + 1 = 50 Hout=2100+23+1=50Wout=2100+23+1=50

3.4 Receptive Field

The value of each point on the output feature map is determined by the size of the input image kh × kw k_h \times k_wkh×kwThe elements of the area are multiplied and then added to each element of the convolution kernel, so on the input image kh × kw k_h \times k_wkh×kwChanges in the value of each element in the area will affect the pixel value of the output point. We call this area of ​​the input image (Input) the receptive field of the corresponding point on the output feature map.

Remember, the receptive field generally refers to the input image

Feel the change in the value of each element in the field, which will affect the change in the value of the output point. For example, 3×3 3×33×The receptive field size corresponding to 3 convolutions is3 × 3 3 × 33×3 , as shown in the figure below.

Insert image description here

And when passing through two layers 3×3 3×33×After 3 convolutions, the size of the receptive field will increase to5 × 5 5 × 55×5 , as shown in the figure below.

Insert image description here
Therefore, when increasing the depth of the convolutional network, the receptive field will increase, and a pixel in the output feature map will contain more image semantic information.

As the network deepens, the feature map will become smaller and smaller, and the receptive field will also become larger and larger. Take the above picture as an example. The receptive field of the output feature map 2 is 5 × 5 5 \times 55×5 , which means that a point on the output feature map 2 contains and fuses5 × 5 5 \times 55×5 pixels, so it has certain semantic information. As the number of network layers deepens and the size of the feature map becomes smaller and smaller, its semantic information becomes richer and richer, and it also has advanced semantic information.

3.5 Multiple input channels, multiple output channels and batch operations

The convolution calculation process introduced earlier is relatively simple, but in actual application, the problems dealt with are much more complex. For example: a color picture has three RGB channels and needs to handle multiple input channel scenarios. The output feature map often also has multiple channels, and in the calculation of neural networks, a batch of samples is often calculated together, so the convolution operator needs to have the function of batch processing of multi-input and multi-output channel data. The following will introduce the operation methods of these scenarios respectively.

3.5.1 Multiple input channel scenario

In the above example, the data of the convolution layer is a 2-dimensional array, but in fact, a picture often contains three RGB channels. To calculate the output result of the convolution, the form of the convolution kernel will also change. Assume that the number of channels of the input image is C in C_{in}Cin, the shape of the input data is C in × H in × W in C_{in} × H_{in} \times W_{in}Cin×Hin×Win, the calculation process is shown in the figure below.

Insert image description here

  1. Design a 2-dimensional array as a convolution kernel for each channel. The shape of the convolution kernel array is cin × kh × kw c_{in} \times k_h \times k_wcin×kh×kw
  2. For any channel cin ∈ [ 0 , C in ) c_{in} \in [0, C_{in})cin[0,Cin) , respectively using the sizekh × kw k_h \times k_wkh×kwThe convolution kernel is of size H in × W in H_{in} \times W_{in}Hin×WinConvolution is done on the two-dimensional array.
  3. 将这 C i n C_{in} CinThe calculation results of the channels are added together, and the result is a shape of H out × W out H_{out} \times W_{out}Hout×Wouttwo-dimensional array.

3.5.2 Multi-output channel scenario

Generally speaking, the output feature map of the convolution operation will also have multiple channels C out C_{out}Cout, then we need to design C out C_{out}CoutThe dimensions are C in × kh × hw C_{in} \times k_h \times h_wCin×kh×hwThe convolution kernel, the dimension of the convolution kernel array is C out C_{out}Cout,As shown below.

Insert image description here

  1. For any output channel cout ∈ [ 0 , C out ) c_{out} \in [0, C_{out})cout[0,Cout) , respectively using the shapes described above asC in × kh × kw C_{in} \times k_h \times k_wCin×kh×kwThe convolution kernel performs convolution on the input image.
  2. Put this C out C_{out}CoutThe shape is H out × W out H_{out} \times W_{out}Hout×WoutThe two-dimensional arrays are spliced ​​together to form a dimension of C out × H out × W out C_{out} \times H_{out} \times W_{out}Cout×Hout×Woutthree-dimensional array.

Note❗️: The number of output channels of a convolution kernel is usually called the number of convolution kernels.

3.5.3 Batch operation (Batch)

In the calculation of convolutional neural networks, multiple samples are usually put together to form a mini-batch for batch operations, that is, the dimension of the input data is N × C in × H in × W in N \times C_{in} \ times H_{in} \times W_{in}N×Cin×Hin×Win. Since the same convolution kernel will be used for each image for convolution operation, the dimension of the convolution kernel is the same as the above case of multiple output channels, which is still C out × H out × W out C_{out} \times H_{out } \times W_{out}Cout×Hout×Wout, the dimension of the output feature map is N × C out × H out × W out N \times C_{out} \times H_{out} \times W_{out}N×Cout×Hout×Wout,As shown below.

Insert image description here

3.6 PaddlePaddle convolution API introduction

The API corresponding to the flying paddle convolution operator is that paddle.nn.Conv2Dusers can directly call the API for calculation, or modify it on this basis. Conv2DThe "2D" in the name indicates that the convolution kernel is two-dimensional and is mostly used to process image data. Similarly, there are also methods Conv3Dthat can be used to process video data (sequences of images).

It should be noted that the API for convolution in PyTorch is: torch.nn.Conv2d, where the dimension D is lowercase.

class paddle.nn.Conv2D (in_channels, out_channels, 
						kernel_size, stride=1, padding=0, 
						dilation=1, groups=1, padding_mode=‘zeros’, 
						weight_attr=None, bias_attr=None, data_format=‘NCHW’)

Commonly used parameters are as follows :

  • in_channels(int): The number of channels of the input image.
  • out_channels(int): The number of convolution kernels is the same as the number of output feature map channels, which is equivalent to C out C_{out} above.Cout
  • kernel_size(int | list | tuple): Convolution kernel size, which can be an integer, such as 3, indicating that the height and width of the convolution kernel are both 3; or a list of two integers, such as [3,2], indicating convolution The core has a height of 3 and a width of 2.
  • stride(int | list | tuple, optional): step size, which can be an integer. The default value is 1, which means the vertical and horizontal sliding steps are both 1; or a list of two integers, such as [3,2], Indicates that the vertical sliding step is 3 and the horizontal sliding step is 2.
  • padding (int | list | tuple, optional): padding size, which can be an integer, such as 1, indicating that the vertical and horizontal boundary padding sizes are both 1; or a list of two integers, such as [2,1], indicating The vertical border padding size is 2 and the horizontal border padding size is 1.

When assigning values ​​to kernel_size using elements, H is first followed by W.

Shape summary :

  • Deferment of number of imports [ N , C in , H in , W in ] [N, C_{in}, H_{in}, W_{in}][N,Cin,Hin,Win]
  • Output data dimensions [ N , C out , H out , W out ] [N, C_{out}, H_{out}, W_{out}][N,Cout,Hout,Wout]
  • Weight parameter www (convolution kernel parameterwww): [ o u t _ c h a n n e l s , C i n , f i l t e r _ s i z e _ h , f i l t e r _ s i z e _ w ] [\mathrm{out\_channels}, C_{in}, \mathrm{filter\_size\_h}, \mathrm{filter\_size\_w}] [out_channels,Cin,filter_size_h,filter_size_w]
  • Bias parameterbbb [ o u t _ c h a n n e l s , ] [\mathrm{out\_channels}, ] [out_channels,]

Note❗️: Even if the input is only one grayscale image [H in, W in] [H_{in}, W_{in}][Hin,Win] , also need to be processed into four-dimensional input vectors[1, 1, H in, W in] [1, 1, H_{in}, W_{in}][1,1,Hin,Win]

3.7 Convolution operator paddle.nn.Conv2Dapplication examples

The following introduces three cases of the application of convolution operators paddle.nn.Conv2Din pictures, and observes their calculation results.

3.7.1 Case 1 - Simple black and white boundary detection

The following is the task of using Conv2Dthe operator to complete an image boundary detection. The left side of the image is the light part, and the right side is the dark part. It is necessary to detect the boundary between light and darkness.

Set the convolution kernel parameters in the width direction to [1, 0, − 1] [1,0,−1][1,0,1 ] , this convolution kernel will subtract the values ​​of two pixels separated by 1 in the width direction. When the convolution kernel slides on the picture, if the pixels it covers are located in the same brightness area, the difference between the values ​​​​of the two pixels with a left and right interval of 1 is 0. Only when some of the pixels covered by the convolution kernel are in bright areas and some are in dark areas, the difference in pixel values ​​of two points with a left and right interval of 1 is not 0. Apply this convolution kernel to the image, and only the pixel values ​​corresponding to the black and white dividing lines on the output feature map are not 0. The specific code is as follows, and the results are output in the pattern below.

import matplotlib.pyplot as plt
import numpy as np
import paddle
import paddle.nn as nn
from paddle.nn.initializer import Assign


if __name__ == "__main__":
    # 创建初始化权重参数 w
    w = np.array([1, 0, -1], dtype="float32")
    # 将权重矩阵调整为 卷积核 的样式 -> [C_out, c_in, k_h, k_w]
    w = w.reshape(1, 1, 1, 3)
    
    # 创建卷积算子,设置输出通道数、卷积核大小和初始化权重参数
    # 创建卷积算子的时候,通过参数属性weight_attr指定参数初始化方式
    conv = nn.Conv2D(in_channels=1, out_channels=1, 
                     kernel_size=(1, 3),   # k_h = 1, k_w = 3
                     padding=0, stride=1,
                     weight_attr=paddle.ParamAttr(initializer=Assign(value=w)))
    
    # 创建输入图片,图片左边的像素点取值为1,右边的像素点取值为0
    img = np.ones(shape=[50, 50], dtype="float32")
    img[:, 30:] = 0.0
    
    # 调整图片尺寸以符合Conv2D的输入要求
    x = img.reshape(1, 1, 50, 50)
    
    # 将 ndarray 转换为 tensor
    x = paddle.to_tensor(x)
    
    # 使用卷积对输入图片进行特征提取
    out = conv(x)
    
    # 将 tensor 转换为 ndarray 以方便我们画图
    out = out.numpy()
    
    # 开始画图
    fig, axes = plt.subplots(1, 2, dpi=100)
    axes[0].imshow(img, cmap="gray")
    axes[0].set_title("origin image")
    
    axes[1].imshow(np.squeeze(out), cmap='gray')
    axes[1].set_title("convolved image")
    plt.show()

Insert image description here

3.7.2 Case 2 - Object edge detection in images

Shown above is an artificially constructed simple image, using a convolutional network to detect the light and dark boundaries of the image. For real pictures, you can also use appropriate convolution kernels ( 3 × 3 3 \times 33×3 The middle value of the convolution kernel is 8, and the surrounding values ​​are 8 - 1) Operate it to detect the outline of the object, and observe the correspondence between the output feature map and the original image, as shown in the following code Show:

import matplotlib.pyplot as plt
import numpy as np
import paddle
import paddle.nn as nn
from paddle.nn.initializer import Assign
from PIL import Image


if __name__ == "__main__":
    # 创建初始化权重参数 w
    w = np.array([[-1, -1, -1],
                  [-1, 8, -1],
                  [-1, -1, -1]], dtype="float32") / 8
    # 将权重矩阵调整为 卷积核 的样式 -> [C_out, c_in, k_h, k_w]
    w = w.reshape(1, 1, 3, 3)
    
    # 由于输入通道数是3,因此需要调整卷积核的通道数,与输入图片一致
    w = np.repeat(w, repeats=3, axis=1)  # 沿着通道方向重复3次
    
    # 创建卷积算子,设置输出通道数、卷积核大小和初始化权重参数
    # 创建卷积算子的时候,通过参数属性weight_attr指定参数初始化方式
    conv = nn.Conv2D(in_channels=3, out_channels=1, 
                     kernel_size=(3, 3),   # k_h = 1, k_w = 3
                     padding=0, stride=1,
                     weight_attr=paddle.ParamAttr(initializer=Assign(value=w)))
    
    # 读取输入图片
    img = Image.open("Tom_and_Jerry.jpg")
    
    # 转换图片格式
    x = np.array(img, dtype="float32")
    
    # 调整图片形状以符合Conv2D的输入要求
    # [H, W, C] -> [C, H, W]
    x = np.transpose(x, [2, 0, 1])
    
    # 添加Batch维度
    x = x.reshape(1, 3, img.height, img.width)
    
    # 将 ndarray 转换为 tensor
    x = paddle.to_tensor(x)
    
    # 使用卷积对输入图片进行特征提取
    out = conv(x)
    
    # 将 tensor 转换为 ndarray 以方便我们画图
    out = out.numpy()
    
    # 开始画图
    fig, axes = plt.subplots(1, 2, dpi=100)
    axes[0].imshow(img)
    axes[0].set_title("origin image")
    
    axes[1].imshow(np.squeeze(out), cmap="gray")
    axes[1].set_title("convolved image")
    plt.savefig("应用2.png", dpi=300)
    plt.show()

Insert image description here

3.7.3 Case 3 - Image mean blur

Another common convolution kernel ( 5 × 5 5\times 55×5 的卷积核中每个值均为 1)是用当前像素跟它邻域内的像素取平均,这样可以使图像上噪声比较大的点变得更平滑,如下代码所示:

import matplotlib.pyplot as plt
import numpy as np
import paddle
import paddle.nn as nn
from paddle.nn.initializer import Assign
from PIL import Image


if __name__ == "__main__":
    # 创建初始化权重参数 w
    w = np.ones([1, 1, 5, 5], dtype="float32") / 25  # [C_out, c_in, k_h, k_w]
    
    # 创建卷积算子,设置输出通道数、卷积核大小和初始化权重参数
    # 创建卷积算子的时候,通过参数属性weight_attr指定参数初始化方式
    conv = nn.Conv2D(in_channels=1, out_channels=1, 
                     kernel_size=(5, 5),   # k_h = 1, k_w = 3
                     padding=0, stride=1,
                     weight_attr=paddle.ParamAttr(initializer=Assign(value=w)))
    
    # 读取输入图片
    img = Image.open("Tom_and_Jerry.jpg").convert("L")

    # 转换图片格式
    img = np.array(img, dtype="float32")
    
    # 调整图片形状以符合Conv2D的输入要求
    # [H, W] -> [C, H, W]
    x = img.reshape(1, 1, img.shape[0], img.shape[1])
    
    # 将 ndarray 转换为 tensor
    x = paddle.to_tensor(x)
    
    # 使用卷积对输入图片进行特征提取
    out = conv(x)
    
    # 将 tensor 转换为 ndarray 以方便我们画图
    out = out.numpy()
    
    # 开始画图
    fig, axes = plt.subplots(1, 2, dpi=100)
    axes[0].imshow(img, cmap="gray")
    axes[0].set_title("origin image")
    
    axes[1].imshow(np.squeeze(out), cmap="gray")
    axes[1].set_title("convolved image")
    plt.savefig("应用3.png", dpi=300)
    plt.show()

Insert image description here

4. 池化(Pooling)

池化是使用某一位置的相邻输出的总体统计特征代替网络在该位置的输出,其好处是当输入数据做出少量平移时,经过池化函数后的大多数输出还能保持不变。比如:当识别一张图像是否是人脸时,我们需要知道人脸左边有一只眼睛,右边也有一只眼睛,而不需要知道眼睛的精确位置,这时候通过池化某一片区域的像素点来得到总体统计特征会显得很有用。由于池化之后特征图会变得更小,如果后面连接的是全连接层,能有效的减小神经元的个数,节省存储空间并提高计算效率。 如下图所示,将一个 2 × 2 2×2 2×2 的区域池化成一个像素点。通常有两种方法,平均池化和最大池化。

Insert image description here

  • 图(a):平均池化(Average Pooling)。这里使用大小为 2 × 2 2×2 2×2 的池化窗口,每次移动的步幅为 2,对池化窗口覆盖区域内的像素取平均值,得到相应的输出特征图的像素值。
  • 图(b):最大池化(Max Pooling)。对池化窗口覆盖区域内的像素取最大值,得到输出特征图的像素值。

当池化窗口在图片上滑动时,会得到整张输出特征图。池化窗口的大小称为池化大小,用 k h × k w k_h \times k_w kh×kwexpress. The window size commonly used in convolutional neural networks is 2 × 2 2 × 22×2 , pooling with stride 2.

Similar to the convolution kernel, when the pooling window slides on the picture, the step size of each movement is called the stride. When the movement sizes in the width and height directions are different, sh × sw s_h \times s_w are used respectively .sh×swexpress. You can also fill the pictures that need to be pooled. The filling method is similar to convolution. Assume that ph 1 p_{h1} is filled before the first row.ph 1rows, padding ph 2 p_{h2} after the last rowph 2OK. Fill pw 1 p_{w1} before first columnpw1columns, fill pw 2 p_{w2} after the last columnpw2columns, then the output feature map size of the pooling layer is:

H o u t = H + p h 1 + p h 2 − k h s h + 1 W o u t = W + p w 1 + p w 2 − k w s w + 1 H_{out} = \frac{H + p_{h1} + p_{h2} - k_h}{s_h} + 1\\ W_{out} = \frac{W + p_{w1} + p_{w2} - k_w}{s_w} + 1 Hout=shH+ph 1+ph 2kh+1Wout=swW+pw1+pw2kw+1

In convolutional neural networks, 2 × 2 2×2 is usually used2×For a pooling window of size 2 , the stride is also 2, and the padding is 0, then the size of the output feature map is:

H o u t = H 2 W o u t = W 2 H_{out} = \frac{H}{2}\\ W_{out} = \frac{W}{2} Hout=2HWout=2W

With pooling in this way, the height and width of the output feature map are halved, but the number of channels does not change .

5. ReLU activation function

5.1 Comparison of Sigmoid and ReLU activation functions

In the network structure introduced earlier, the Sigmoid function is commonly used as the activation function. In the early days of the development of neural networks, the Sigmoid function was used more frequently, and the most commonly used activation function currently is ReLU. This is because the Sigmoid function easily causes gradient attenuation during the back propagation process. Let's take a closer look at the form of the Sigmoid function to see this problem.

The Sigmoid activation function is defined as follows:

y = 1 1 + e − x y = \frac{1}{1 + e^{-x}} y=1+ex1

The ReLU activation function is defined as follows:

y = { 0 , ( x < 0 ) x , ( x ≥ 0 ) y = \begin{cases} 0, \quad (x < 0) \\ x, \quad (x \ge 0) \end{cases} y={ 0,(x<0)x,(x0)

The following program plots the Sigmoid and ReLU functions:

import numpy as np
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))


# 创建数据x
x = np.arange(-10, 10, 0.1)

# 计算Sigmoid函数
s = 1.0 / (1 + np.exp(0. - x))

# 计算ReLU函数
y = np.clip(x, a_min=0., a_max=None)

# 以下部分为画图代码
f = plt.subplot(121)
plt.plot(x, s, color='r')
currentAxis=plt.gca()
plt.text(-9.0, 0.9, r'$y=Sigmoid(x)$', fontsize=13)
currentAxis.xaxis.set_label_text('x', fontsize=15)
currentAxis.yaxis.set_label_text('y', fontsize=15)

f = plt.subplot(122)
plt.plot(x, y, color='g')
plt.text(-3.0, 9, r'$y=ReLU(x)$', fontsize=13)
currentAxis=plt.gca()
currentAxis.xaxis.set_label_text('x', fontsize=15)
currentAxis.yaxis.set_label_text('y', fontsize=15)

plt.savefig("两种激活函数对比.png")
plt.show()

Insert image description here

5.2 Gradient disappearance phenomenon

In neural networks, the phenomenon that the gradient value decays close to zero after backpropagation is called the vanishing gradient phenomenon.

It can be seen from the above function curve that when xxWhen x is a large positive number, the value of the Sigmoid function is very close to 1, the function curve becomes very smooth, and the derivative of the Sigmoid function is close to zero in these areas. WhenxxWhen x is a small negative number, the Sigmoid function value is also very close to 0, the function curve is also very smooth, and the derivative of the Sigmoid function is also close to 0 in these areas. Only whenxxWhen the value of x is near 0, the derivative of the Sigmoid function is relatively large. Taking the derivative of the Sigmoid function, the result is as follows:

d y d x = − 1 ( 1 + e − x ) ⋅ d ( e − x ) d x = 1 2 + e x + e − x \begin{aligned} \frac{dy}{dx} & = -\frac{1}{(1 + e^{-x})} \cdot \frac{d(e^{-x})}{dx}\\ & = \frac{1}{2 + e^x + e^{-x}} \end{aligned} dxdy=(1+ex)1dxd(ex)=2+ex+ex1

As can be seen from the above formula, the derivative of the Sigmoid function dydx \frac{dy}{dx}dxdyThe maximum value is 1 4 \frac{1}{4}41. During forward propagation, y = S igmoid ( x ) y=\mathrm{Sigmoid}(x)y=Sigmoid ( x ) ; and during the backpropagation process,xxThe gradient of x is equal to yyThe gradient of y is multiplied by the derivative of the Sigmoid function as follows:

∂ L ∂ x = ∂ L ∂ y ⋅ ∂ y ∂ x \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x} xL=yLxy

Make xxThe maximum gradient value of x will not exceedyy1 4 \frac{1}{4}of the gradient of y41

Since the parameters of the neural network are randomly initialized at the beginning, xxThe value of x is likely to be in a large or small area, which may cause the derivative of the Sigmoid function to be close to 0, resulting inxxThe gradient of x is close to 0; even ifxxThe value of x is close to 0. According to the above analysis, after back propagation of the Sigmoid function,xxThe gradient of x does not exceedyy1 4 \frac{1}{4}of the gradient of y41, if a multi-layer network uses the Sigmoid activation function, the gradients of the later layers will decay to very small values.

The ReLU function is different, although when x < 0 x<0x<Where 0 , the derivative of the ReLU function is 0. But whenx ≥ 0 x ≥ 0xWhere 0 , the derivative of the ReLU function is 1, and yyThe gradient of y is completely passed toxxx without causing the gradient to vanish.

6. Batch Normalization

The batch normalization method (Batch Normalization, BatchNorm) was proposed by Ioffe and Szegedy in 2015 and has been widely used in deep learning. Its purpose is to standardize the output of the middle layer of the neural network so that the output of the middle layer more stable.

Usually we normalize the data of the neural network, and the processed sample data set meets the mean μ \muμ is 0, and the varianceσ \sigmaThe statistical distribution of σ is 1. This is becausewhen the distribution of input data is relatively fixed, it is beneficial to the stability and convergence of the algorithm. For deep neural networks, since the parameters are constantly updated, even if the input data has been standardized, the inputs received by the later layers are still changing drastically, which usually leads to numerical instability. The model has difficulty converging. BatchNorm can make the output of the middle layer of the neural network more stable and has the following three advantages:

  1. Make learning happen quickly (can use larger learning rates)
  2. Reduce model sensitivity to initial values
  3. Suppress overfitting to a certain extent

The main idea of ​​BatchNorm is to use mini-batch as the unit to normalize the values ​​of neurons during training so that the distribution of data satisfies the mean μ \ muμ is 0, and the varianceσ \sigmaσ is 1. The specific calculation process is as follows:

6.1 Step one: Calculate the mean μ B \mu_B of the samples in the mini-batchmB

μ B ← 1 m ∑ i = 1 m x ( i ) \mu_B \leftarrow \frac{1}{m} \sum_{i=1}^m x^{(i)} mBm1i=1mx(i)

Among them x ( i ) x^{(i)}x( i ) represents the ii-th in mini-batchi samples.

For example, the input mini-batch contains 3 samples, each sample has 2 features, which are:

x ( 1 ) = ( 1 , 2 ) , x ( 2 ) = ( 3 , 6 ) , x ( 3 ) = ( 5 , 10 ) x^{(1)} = (1, 2), \quad x^{(2)} = (3, 6), \quad x^{(3)} = (5, 10) x(1)=(1,2),x(2)=(3,6),x(3)=(5,10)

Calculate the mean of the samples in the mini-batch for each feature:

μ B 0 = 1 + 3 + 5 3 = 3 , μ B 1 = 2 + 6 + 10 3 = 6 \mu_{B0} = \frac{1+3+5}{3}=3, \quad \mu_ {B1} = \frac{2+6+10}{3}=6mB 0=31+3+5=3,mB 1=32+6+10=6

Then the sample mean is:

μ B = ( μ B 0 , μ B 1 ) = ( 3 , 6 ) \mu_{B} = (\mu_{B0}, \mu_{B1}) = (3, 6)mB=( mB 0,mB 1)=(3,6)

The average is calculated according to the feature dimension.

6.2 Step 2: Calculate the variance σ B 2 \sigma_B^2 of the samples in the mini-batchpB2

σ B 2 ← 1 m ∑ i = 1 m ( x ( i ) − μ B ) 2 \sigma_B^2 \leftarrow \frac{1}{m} \sum_{i = 1}^m (x^{(i)} - \mu_B)^2 pB2m1i=1m(x(i)mB)2

The above calculation formula first calculates the mean μ B \mu_B of the samples in a batchmBand variance σ B 2 \sigma_B^2pB2, and then normalize the input data and adjust it to a distribution with a mean of 0 and a variance of 1.

For the above given input data x (1), x (2), x (3) x^{(1)}, x^{(2)}, x^{(3)}x(1),x(2),x( 3 ) , the variance corresponding to each feature can be calculated:

σ B 0 2 = 1 3 ⋅ [ ( 1 − 3 ) 2 + ( 3 − 3 ) 2 + ( 5 − 3 ) 2 ] = 8 3 σ B 1 2 = 1 3 ⋅ [ ( 2 − 6 ) 2 + ( 6 − 6 ) 2 + ( 10 − 6 ) 2 ] = 32 3 \sigma_{B0}^2 = \frac{1}{3} \cdot \left[(1-3)^2 + (3-3)^2 + (5-3)^2 \right] = \frac{8}{3}\\ \sigma_{B1}^2 = \frac{1}{3} \cdot \left[ (2-6)^2 + (6-6)^2 + (10-6)^2 \right] = \frac{32}{3} pB0 _2=31[(13)2+(33)2+(53)2]=38pB 12=31[(26)2+(66)2+(106)2]=332

Then the sample variance is:

σ B 2 = ( σ B 0 2 , σ B 1 2 ) = ( 8 3 , 32 3 ) \sigma_{B}^2 = (\sigma_{B0}^2, \sigma_{B1}^2) = (\frac{8}{3}, \frac{32}{3}) pB2=( pB0 _2,pB 12)=(38,332)

The variance is also calculated according to the feature dimensions.

6.3 Calculate the normalized output x ^ ( i ) \hat{x}^{(i)}x^(i)

x ^ ( i ) ← x ( i ) − μ B ( σ B 2 + ϵ ) \hat{x}^{(i)} \leftarrow \frac{x^{(i)} - \mu_B}{\sqrt{(\sigma_B^2 + \epsilon)}} x^(i)( pB2+) _ x(i)mB

where ϵ \epsilonϵ is a tiny value (for example1e−7) whose main purpose is to prevent the denominator from being 0.

For the above given input data x (1), x (2), x (3) x^{(1)}, x^{(2)}, x^{(3)}x(1),x(2),x( 3 ) , the normalized output can be calculated:

x ^ ( 1 ) = ( 1 − 3 8 3 , 2 − 6 32 3 ) = ( − 3 2 , − 3 2 ) x ^ ( 2 ) = ( 3 − 3 8 3 , 6 − 6 32 3 ) = ( 0 , 0 ) x ^ ( 3 ) = ( 5 − 3 8 3 , 10 − 6 32 3 ) = ( 3 2 , 3 2 ) \begin{aligned} &\hat{x}^{(1)} = \left( \frac{1 - 3}{\sqrt{\frac{8}{3}}}, \frac{2 - 6}{\sqrt{\frac{32}{3}}} \right) = \left( -\sqrt{\frac{3}{2}}, -\sqrt{\frac{3}{2}} \right) \\ &\hat{x}^{(2)} = \left( \frac{3 - 3}{\sqrt{\frac{8}{3}}}, \frac{6 - 6}{\sqrt{\frac{32}{3}}} \right) = \left( 0,0 \right) \\ &\hat{x}^{(3)} = \left( \frac{5 - 3}{\sqrt{\frac{8}{3}}}, \frac{10 - 6}{\sqrt{\frac{32}{3}}} \right) = \left(\sqrt{\frac{3}{2}}, \sqrt{\frac{3}{2}} \right) \end{aligned} x^(1)= 38 13,332 26 =(23 ,23 )x^(2)= 38 33,332 66 =(0,0)x^(3)= 38 53,332 106 =(23 ,23 )

We can verify the input data x (1), x (2), x (3) x^{(1)}, x^{(2)}, x^{(3)}x(1),x(2),x( 3 ) Does it meet the requirements that the mean is 0 and the variance is 1:

import numpy as np


def calc_mean_and_var(data):
    # 计算均值
    mean = np.mean(data, axis=0)
    # 计算方差
    variance = np.var(data, axis=0)

    return mean, variance


def normalization(data, mean, var):
    return (data - mean) / np.sqrt(var)


if __name__ == "__main__":
    # 定义输入数据
    origin_data = np.array([[1, 2], [3, 6], [5, 10]])
    mean, var = calc_mean_and_var(origin_data)
    print("[原始数据] 均值:", mean)
    print("[原始数据] 方差:", var)
    
    # 求归一化的数据
    normalization_data = normalization(origin_data, mean, var)
    print("归一化后的数据:\n", normalization_data)
    
    # 验证过归一化数据是否符合均值为0方差为1
    mean_norm, var_norm = calc_mean_and_var(normalization_data)
    print("[归一化] 均值:", mean_norm)
    print("[归一化] 方差:", var_norm)

result:

[原始数据] 均值: [3. 6.]
[原始数据] 方差: [ 2.66666667 10.66666667]
归一化后的数据:
 [[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]
[归一化] 均值: [0. 0.]
[归一化] 方差: [1. 1.]

If the distribution of the output layer is forced to be standardized, it may result in the loss of certain feature patterns, so after normalization, BatchNorm will then scale and translate the data .

y i ← γ x ^ i + β y_i \leftarrow \gamma \hat{x}_i + \beta yicx^i+b

where γ \gammacb \betaβ is a learnable parameter and can be assigned initial values​​γ = 1, β = 0 \gamma=1, \beta=0c=1,b=0 , and continuously learn and adjust during the training process.

Listed above is the calculation logic of the BatchNorm method. Examples of the two types of input data formats are given below. PaddlePaddle supports four dimensions of input data: 2, 3, 4, and 5. Examples of dimension sizes 2 and 4 are given here.

6.4 Example

6.4.1 Example 1: When the input data shape is [N, K] [N, K][N,K ] time

When the input data shape is [ N , K ] [N, K][N,K ] , generally corresponds to the output of the fully connected layer. In this case,KKCalculate NNfor each component of KThe mean and variance of N samples, data and parameters correspond as follows:

enter Shape
x x x [ N , C ] [N, C] [N,C]
yyy [ N , C ] [N, C][N,C]
Mean μ B \mu_BmB [ C , ] [C, ] [C,]
Varianceσ B 2 \sigma_B^2pB2 [ C , ] [C, ][C,]
Scaling parameter γ \gammac [ C , ] [C, ][C,]
Translation parameter β \betab [ C , ] [C, ][C,]

Sample code looks like this:

import numpy as np
import paddle
import paddle.nn as nn


def calc_mean_and_var(data):
    # 计算均值
    mean = np.mean(data, axis=0)
    # 计算方差
    variance = np.var(data, axis=0)

    return mean, variance

if __name__ == "__main__":
    # 定义数据
    data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype="float32")  # [N, C]

    # 使用 BN 计算归一化后的输出
    bn = nn.BatchNorm1D(num_features=3)  # 参数为通道数
    x = paddle.to_tensor(data)
    y = bn(x).numpy()  # [N, C]
    print(f"BN 层的输出为:\n{
      
      y}\n其shape为: {
      
      y.shape}")

    # 验证
    mean, var = calc_mean_and_var(y)
    print(f"BN 后的均值为: {
      
      mean}, 其shape为: {
      
      mean.shape}")  # [C, ]
    print(f"BN 后的方差为: {
      
      var}, 其shape为: {
      
      var.shape}")  # [C]
    print(f"BN 的缩放参数为: {
      
      bn.weight}")  # [C, ]
    print(f"BN 的平移参数为: {
      
      bn.bias}")  # [C, ]
BN 层的输出为:
[[-1.2247438 -1.2247438 -1.2247438]
 [ 0.         0.         0.       ]
 [ 1.2247438  1.2247438  1.2247438]]
其shape为: (3, 3)

BN 后的均值为: [0. 0. 0.], 其shape为: (3,)
BN 后的方差为: [0.99999833 0.99999833 0.99999833], 其shape为: (3,)
BN 的缩放参数为: Parameter containing: Tensor(shape=[3], dtype=float32, place=Place(gpu:0), stop_gradient=False, [1., 1., 1.])
BN 的平移参数为: Parameter containing: Tensor(shape=[3], dtype=float32, place=Place(gpu:0), stop_gradient=False, [0., 0., 0.])

6.4.1 Example 2: When the input data shape is [ N , C , H , W ] [N, C, H, W][N,C,H,W ] time

When the input data shape is [ N , C , H , W ] [N, C, H, W][N,C,H,W ] , generally corresponds to the output of the convolutional layer. In this case, it will be alongthe CCExpand the dimension C and calculate NNA total of N × H × WN × H × W in N samplesN×H×The mean and variance of W pixels, data and parameters correspond to the following:

enter Shape
x x x [ N , C , H , W ] [N, C, H, W] [N,C,H,W]
yyy [ N , C , H , W ] [N, C, H, W][N,C,H,W]
Mean μ B \mu_BmB [ C , ] [C, ] [C,]
Varianceσ B 2 \sigma_B^2pB2 [ C , ] [C, ][C,]
Scaling parameter γ \gammac [ C , ] [C, ][C,]
Translation parameter β \betab [ C , ] [C, ][C,]

Someone may ask: "Isn't it necessary to perform affine transformation on the standardized results in BatchNorm? How can the results calculated using Numpy be consistent with the BatchNorm operator?" This is because the initial value γ = 1 is automatically set in the BatchNorm operator. β = 0 \gamma=1, \beta=0c=1,b=0 (as we can see in the result of the above code), at this time the affine transformation is equivalent to the identity transformation. During the training process, these two parameters will be continuously learned, and then the affine transformation will take effect.

Sample code is shown below.

import numpy as np
import paddle
import paddle.nn as nn


def calc_mean_and_var(data):
    # 计算均值
    mean = np.mean(data, axis=0)
    # 计算方差
    variance = np.var(data, axis=0)

    return mean, variance

if __name__ == "__main__":
    paddle.seed(100)
    np.random.seed(100)
    
    # 定义数据
    data = np.random.random((10, 3, 64, 64)).astype("float32")  # [N, C, H, W]
    print(data.shape)  # (10, 3, 64, 64)

    # 使用 BN 计算归一化后的输出
    bn = nn.BatchNorm2D(num_features=3)  # 参数为通道数
    x = paddle.to_tensor(data)
    y = bn(x).numpy()  # [N, C, H, W]
    print(f"BN 层的输出.shape为: {
      
      y.shape}")

    print(f"BN 层的输出.shape为: {
      
      y.shape}")
    mean, var = calc_mean_and_var(y)
    print(f"BN 后的均值.shape为: {
      
      mean.shape}")  # [C, ]
    print(f"BN 后的方差.shape为: {
      
      var.shape}")  # [C, ]
    print(f"BN 的缩放参数.shape: {
      
      bn.weight.shape}")  # [C, ]
    print(f"BN 的平移参数为.shape: {
      
      bn.bias.shape}")  # [C, ]

result:

BN 层的输出.shape为: (10, 3, 64, 64)
BN 层的输出.shape为: (10, 3, 64, 64)
BN 后的均值.shape为: (3, 64, 64)
BN 后的方差.shape为: (3, 64, 64)
BN 的缩放参数.shape: [3]
BN 的平移参数为.shape: [3]

Tipnumpy : The output calculated here BatchNorm2Dis slightly different from the result of the operator, because in BatchNorm2Dorder to ensure the stability of the numerical value, the operator adds a relatively small floating point number to the denominator epsilon=1e-05.

6.5 Using BatchNorm when predicting

The above describes the method of using BatchNorm to normalize a batch of samples during the training process. However, if the same method is used to normalize a batch of samples that need to be predicted, uncertainty will occur in the prediction results .

For example, if sample A and sample B are used as a batch of samples to calculate the mean and variance, and if sample A, sample C, and sample D are used as a batch of samples to calculate the mean and variance, the results obtained are generally different. Then the prediction result of sample A will become uncertain, which is unreasonable for the prediction process. The solution is to save the mean and variance of a large number of samples during the training process, and use the saved values ​​directly during prediction without recalculating. In fact, in the specific implementation of BatchNorm, the moving average of the mean and variance is calculated during training. In PaddlePaddle, the default calculation method is as follows:

s a v e d _ μ B ← s a v e d _ μ B × 0.9 + μ B × ( 1 − 0.9 ) s a v e d _ σ B 2 ← s a v e d _ σ B 2 × 0.9 + σ B 2 × ( 1 − 0.9 ) \mathrm{saved\_\mu_B} \leftarrow \mathrm{saved\_\mu_B} \times 0.9 + \mu_B \times (1 - 0.9)\\ \mathrm{saved\_\sigma_B^2} \leftarrow \mathrm{saved\_\sigma_B^2} \times 0.9 + \sigma_B^2 \times (1 - 0.9) saved_μBsaved_μB×0.9+mB×(10.9)saved_σB2saved_σB2×0.9+pB2×(10.9)

Saved _ μ B \mathrm{saved}\_\mu_B at the beginning of the training processsaved_μBsaved _ σ B 2 \mathrm{saved\_\sigma_B^2}saved_σB2Set to 0, each time a new batch of samples is input, μ B \mu_B is calculatedmBandσ B 2 \sigma_B^2pB2, and then update saved _ μ B \mathrm{saved}\_\mu_B through the above formulasaved_μBsaved _ σ B 2 \mathrm{saved\_\sigma_B^2}saved_σB2, their values ​​are constantly updated during the training process, and saved as parameters of the BatchNorm layer. The parameter saved _ μ B \mathrm{saved}\_\mu_B will be loaded during prediction.saved_μBsaved _ σ B 2 \mathrm{saved\_\sigma_B^2}saved_σB2, use them to replace μ B \mu_BmBandσ B 2 \sigma_B^2pB2

7. Dropout

Dropout is a commonly used method in deep learning to suppress overfitting. It is based on randomly deleting a part of neurons during the neural network learning process. During training, a part of neurons are randomly selected and their outputs are set to 0. These neurons will not transmit signals to the outside world .

The figure below is a schematic diagram of Dropout. The left side is the complete neural network, and the right side is the network structure after applying Dropout. After applying Dropout, it will be marked with × \times× neurons are removed from the network so that they do not transmit signals to subsequent layers. During the learning process, which neurons are discarded are randomly determined, so the model will not be overly dependent on certain neurons and can inhibit overfitting to a certain extent.

Insert image description here


Q1 : Is the Dropout operation performed on the input feature map?
A1 : Yes, the Dropout operation is performed on the input feature map.


Q2 : If a picture is sent to the network, how will Dropout discard it? Discard pixels?
A2 : Yes, when an image is fed into the neural network as input, the Dropout operation will randomly discard some pixels. Specifically, Dropout independently randomly selects some pixels during each forward pass and sets them to zero, thus "turning off" the information corresponding to these pixels. This process is equivalent to blocking the input image, simulating the loss of some pixel information.

Please note ❗️: Dropout is generally applied to feature maps, specifically setting certain pixels of the feature map to 0.


When predicting a scene, the signals of all neurons will be forwarded, which may lead to a new problem: because some neurons are randomly discarded during training, the total size of the output data will become smaller. For example: calculate its L 1 L_1L1The norm will be smaller than when Dropout is not used, but neurons are not discarded during prediction, which will lead to different distributions of data during training and prediction. In order to solve this problem, PaddlePaddle supports the following two methods:

  1. downscale_in_infer : scale rr during trainingr randomly discards a part of neurons and does not pass their signals backward; when predicting, it passes the signals of all neurons backward, but multiplies the value on each neuron by (1 − r) (1−r)(1r)
  2. upscale_in_train : scale rr during trainingr randomly discards a portion of neurons without passing their signals backwards, but divides the values ​​on those retained neurons by( 1 − r ) (1−r)(1r ) ; When predicting, the signals of all neurons are transmitted backward without any processing.

In the PaddlePaddle DropoutAPI, modeparameters are used to specify how to operate the neurons:

paddle.nn.Dropout(p=0.5, axis=None, 
				  mode="upscale_in_train”, 
				  name=None)

The main parameters are as follows :

  • p(float): The probability of setting the input node to 0, that is, the probability of discarding, default value: 0.5. The probability of discarding elements for this parameter is for each element, not for all elements. For example, assuming there are 12 numbers in the matrix, a dropout with a probability of 0.5 may not necessarily have 6 zeros.
  • mode(str): The implementation method of discarding method, there are two methods: ’downscale_in_infer’and ’upscale_in_train’, the default is ’upscale_in_train’.

Different frameworks may have different default processing methods for Dropout. You can check the API for details when using it.

The following program shows the form of output data after Dropout.

import paddle
import paddle.nn as nn
import numpy as np


if __name__ == "__main__":
    np.random.seed(100)
    
    # 创建数据
    data_1 = np.random.rand(1, 3, 2, 2).astype("float32")  # [N, C, H, W]
    data_2 = np.arange(1, 13).reshape([-1, 3]).astype("float32")  # [N, C]
    
    # 使用dropout作用到输入数据上
    x_1 = paddle.to_tensor(data_1)
    x_2 = paddle.to_tensor(data_2)
    
    """方式1:downgrade_in_infer模式下"""
    drop_method_1 = nn.Dropout(p=0.5, mode="downscale_in_infer")
    droped_train_11 = drop_method_1(x_1)
    droped_train_12 = drop_method_1(x_1)
    
    # 切换到eval模式。在动态图模式下,使用eval()切换到求值模式,该模式禁用了dropout
    drop_method_1.eval()
    drop_11_eval_11 = drop_method_1(x_1)
    drop_12_eval_12 = drop_method_1(x_1)
    
    """方式2:upscale_in_train模式下"""
    drop_method_2 = nn.Dropout(p=0.5, mode="upscale_in_train")
    droped_train_21 = drop_method_2(x_2)
    droped_train_22 = drop_method_2(x_2)
    
    # 切换到eval模式。在动态图模式下,使用eval()切换到求值模式,该模式禁用了dropout
    drop_method_2.eval()
    drop_21_eval_21 = drop_method_2(x_2)
    drop_22_eval_22 = drop_method_2(x_2)
    
    # 输出
    print('x1: {}, \n\n droped_train_11: \n\n {}, \n\n drop_11_eval_11: \n {}\n\n'.format(data_1, droped_train_11.numpy(),  drop_11_eval_11.numpy()))
    print('x1: {}, \n\n droped_train_12: \n\n {}, \n\n drop_12_eval_12: \n {}\n\n'.format(data_1, droped_train_12.numpy(),  drop_12_eval_12.numpy()))
    print('x2: {}, \n\n droped_train_21: \n\n {}, \n\n drop_21_eval_21: \n {}\n\n'.format(data_2, droped_train_21.numpy(),  drop_21_eval_21.numpy()))
    print('x2: {}, \n\n droped_train_22: \n\n {}, \n\n drop_22_eval_22: \n {}\n\n'.format(data_2, droped_train_22.numpy(),  drop_22_eval_22.numpy()))

Result :

x1:
[[[[0.54340494 0.2783694 ]
   [0.4245176  0.84477615]]

  [[0.00471886 0.12156912]
   [0.67074907 0.82585275]]

  [[0.13670659 0.5750933 ]
   [0.89132196 0.20920213]]]],

 droped_train_11:

 [[[[0.54340494 0.2783694 ]
   [0.4245176  0.        ]]

  [[0.00471886 0.        ]
   [0.67074907 0.        ]]

  [[0.13670659 0.5750933 ]
   [0.         0.20920213]]]],

 drop_11_eval_11:
 [[[[0.27170247 0.1391847 ]
   [0.2122588  0.42238808]]

  [[0.00235943 0.06078456]
   [0.33537453 0.41292638]]

  [[0.0683533  0.28754666]
   [0.44566098 0.10460106]]]]


x1:
[[[[0.54340494 0.2783694 ]
   [0.4245176  0.84477615]]

  [[0.00471886 0.12156912]
   [0.67074907 0.82585275]]

  [[0.13670659 0.5750933 ]
   [0.89132196 0.20920213]]]],

 droped_train_12:

 [[[[0.         0.        ]
   [0.4245176  0.84477615]]

  [[0.00471886 0.12156912]
   [0.67074907 0.        ]]

  [[0.13670659 0.5750933 ]
   [0.         0.        ]]]],

 drop_12_eval_12:
 [[[[0.27170247 0.1391847 ]
   [0.2122588  0.42238808]]

  [[0.00235943 0.06078456]
   [0.33537453 0.41292638]]

  [[0.0683533  0.28754666]
   [0.44566098 0.10460106]]]]


x2:
[[ 1.  2.  3.]
 [ 4.  5.  6.]
 [ 7.  8.  9.]
 [10. 11. 12.]],

 droped_train_21:

 [[ 0.  0.  0.]
 [ 0. 10.  0.]
 [14.  0. 18.]
 [20.  0.  0.]],

 drop_21_eval_21:
 [[ 1.  2.  3.]
 [ 4.  5.  6.]
 [ 7.  8.  9.]
 [10. 11. 12.]]


x2:
[[ 1.  2.  3.]
 [ 4.  5.  6.]
 [ 7.  8.  9.]
 [10. 11. 12.]],

 droped_train_22:

 [[ 2.  0.  6.]
 [ 0.  0. 12.]
 [ 0.  0.  0.]
 [20. 22.  0.]],

 drop_22_eval_22:
 [[ 1.  2.  3.]
 [ 4.  5.  6.]
 [ 7.  8.  9.]
 [10. 11. 12.]]

From the output of the above code, we can find that after dropout, some elements in the tensor become 0. This is the function implemented by dropout. By randomly setting the elements of the input data to 0, it eliminates and weakens the joint adaptability between neuron nodes. , enhance the generalization ability of the model.

knowledge source

  1. https://www.paddlepaddle.org.cn/tutorials/projectdetail/4282406

Guess you like

Origin blog.csdn.net/weixin_44878336/article/details/132237306