Explainable Deep Learning: From Receptive Field to Three Basic Tasks of Deep Learning: Image Classification, Semantic Segmentation, Object Detection, Let You Really Understand Deep Learning

Table of contents

 

foreword

First, the first acquaintance with the feeling field

1.1 Guess what he is?

1.2 Receptive field under the human visual system

1.3 Receptive Fields in Deep Neural Networks

1.3.1 The nature of the receptive field

1.3.2 Definition of Receptive Field

1.3.3 Give an example

1.3.4 Take the VGG network as an example

2. Calculation of receptive field

2.1 What operations can change the receptive field?

2.2 Calculation formula of receptive field

2.3 Calculation of the center position of the receptive field

2.4 Calculation example of receptive field center

3. Effective receptive field

3.1 The concept of effective receptive field

3.2 Calculation of Effective Receptive Field

3.3 Contribution of each position of the receptive field

3.4 Why is the effective receptive field important?

3.5 The bigger the receptive field, the better?

Fourth, use receptive fields to explain the basic tasks of deep learning

4.1 Classification Network

4.1.1 Development of Classification Networks

4.1.2 How the receptive field affects the classification network (Resnet as an example)

4.1.3 Is the larger the receptive field the better?

4.2 Detection network

4.2.1 Development of Detection Network

4.2.2 How the receptive field affects the detection network

4.3 Segmenting the network

4.3.1 Development of Segmentation Networks

4.3.2 How do you design the segmentation network

 


 

foreword

Deep learning has always been criticized as a "blind box". We can use deep learning to achieve end-to-end training, which is simple and effective, but we don't know what the middle layer of the neural network is doing, and the attention of each layer of convolution What is the point.

I mentioned in the previous topic on image processing and deep learning that when we first started deep learning, we had to achieve a task, such as: to make unclear images clear, we arbitrarily built a three-layer network, Then start training and find that the effect is better than the traditional image processing method, and it is simple and effective. Then we randomly build a four-layer network and find that the four-layer network is better than the three-layer network. We may have a perceptual understanding: as With the increase of network layers and the increase of network parameters, the network can learn more features. This conclusion is reasonable to some extent, but not entirely reasonable.

In 2015, the Resnet network was born. This network shows that if the network is continuously stacked, the network will be "degraded". We can infer from the perspective of the formula how the residual network avoids the disappearance of the gradient , even when the gradient is very small. When the backpropagation of the residual network is also difficult to obtain a small gradient ( we need to understand a key: the driving force for the update of the neural network parameters is the gradient, we need to iterate along the negative direction of the gradient, if backpropagation When there is no gradient, the network will stop updating ), so the number of layers in the network is not the better.

The receptive field tells us: the deeper the network layer, the larger the receptive field (we will talk about it next).

According to these three conclusions: 1. The deeper the network is, the better. 2. The deeper the network, the larger the receptive field. 3. We need to balance the two aspects of accuracy and performance. According to these three conclusions, we may be able to design Come up with your own neural network!

This topic mainly refers to the UP Qi Shi Technology of Station B, which has already mastered the research on the receptive field!

First, the first acquaintance with the feeling field

1.1 Guess what he is?

 

0902e979c1fa46aabe02ffbfee3035e8.jpeg6997cb2edb2140718d3722c984227a23.jpeg2be70027992e4a54a880e77faea2ebae.jpegdd425338c8d24e69b700f183594dbbda.jpegcd7106a71b07473598b90cff4c02532a.jpeg

 From left to right, the picture gets bigger and bigger, and our human eyes see more and more information. When we look at the first image, we may not recognize it as a butterfly. The field of view of our human eye is getting bigger and bigger, and the information our human eye is exposed to more and more, we can gradually distinguish its category.

1.2 Receptive field under the human visual system

8c24aaf034f0405586fd9843a509ac3c.jpeg

 

 The word "receptive field" originally came from the human visual system. As shown in the picture above (the picture is taken by a camera), our eyes focus on the center point of the image, and we find that we can only see the center area clearly. The object, and the object at the edge is blurred, and the real picture seen by the human eye is shown in the following figure.

 

 

d0d9cbce16654b8e820b645f106097d7.jpeg

1. When staring at a point and keeping the eyeballs in motion, only the point area is "clear", while other surrounding areas are blurred.

2. Only by constantly turning the eyeballs can you see different areas continuously.

So, we conclude

Three characteristics of the receptive field of the human visual system :

A. have a large field of vision

B. Focus on the center

C. Blurred around

1.3 Receptive Fields in Deep Neural Networks

1.3.1 The nature of the receptive field

In a deep convolutional neural network, each neuron node corresponds to a certain area of ​​the input image, and only the image content of this area can affect the activation of the neuron, then this area is called the neuron's feeling wild.

 

45bda63108744fb1ab75e6c7c46e854f.png

 As shown in the figure above: the light gray is the input image (bottom), the dark gray is the special diagnosis image after convolution (top), and the receptive field of the red point on the dark gray (feature map) on the light gray (input image) image is As indicated by the black area on the light grey. That is, the red point can "see" the size of the black area. This black area is called the receptive field of the red point. In other words, the black area affects the output of the red point. The output of the point plays a decisive role, which also shows that the nodes in the neural network also pay attention to the central area of ​​the receptive field, or the neural network pays different attention to each area of ​​the receptive field.

1.3.2 Definition of Receptive Field

A. The area closer to the center of the receptive field is more important

This is consistent with the perception field of the human eye.

B. Isotropic

It is a symmetrical position along the center, and the effect of the receptive field on the output of the neuron is the same.

C. The importance decay rate from the center to the surrounding can be controlled by the network structure

This point is more difficult to understand, but it is also the most crucial point. He is the key to eliciting the " effective receptive field ". We know that the neuron node pays the most attention to the central area of ​​the receptive field (that is, the white area in the black area), and other areas in the central area may also be able to "see" by the neuron node. The output has little effect. Just like when we first talked about the receptive field under the human visual system, we can also notice the blurred area outside the central area, but it is difficult for us to judge the object category through this blurred area. The receptive field consists of a black area (unimportant area) and a white area (central area), and the process from the white area to the black area is called the decay speed of importance. Under the premise that the receptive field is the same, the larger the white area, the slower the decay speed, and the decay speed of importance can be controlled by the network structure, that is, we can adjust the decay speed of importance by designing the network!

1.3.3 Give an example

c707e97a94294aa7b9124a099b1ffe10.png

 As shown in the figure above, this one is a convolution process, not a full connection, everyone should pay attention.

We have 15 input nodes, and through four layers of convolution operations, we get the last 5 output nodes, of which the receptive field of the green node in the middle is 11 (red node). Here I raise a question: the receptive field of the green node is 11 red nodes, so do the 11 red nodes have the same degree of influence on the output of the green node?

Obviously it is not the same. According to the previous conclusion, the intermediate node plays a decisive role in the output of the neuron. So, how do we calculate or how do we describe how important the 11 nodes are to the green nodes? We can use the number of edges to represent, the number of edges is the number of convolutions, we have reason to believe that the more the number of convolutions, the greater the impact on neurons. We count the number of edges from these 11 nodes to the green node at a time.

The importance of input points is defined as: the number of times that affects subsequent convolution calculations

As shown above: the size of the receptive field is 11 (the final green output is determined by 11 red inputs), and the importance of each node is in turn (the edges formed by 11 nodes and the green output): 5-13-24-31-36 -37 (middle) -36-31-24-13-5.

Through the calculation of the degree of importance, we can also further verify several characteristics of the receptive field. The intermediate node is indeed the most important, and the importance of the symmetrical position in the middle is indeed the same!

1.3.4 Take the VGG network as an example

As shown on the right, the size of our input image is 224x224, and the receptive field of VGG13 is 348x348...

0642c64482294a1db1917555b24622e1.jpeg12fd0b90c4f7481a9964c326c62bc0d1.png

 A. Taking VGG13(B)/VGG16(D)/VGG19(E) as an example, calculate the size of the neuron receptive field in the previous layer of classification.

B. Classification accuracy: VGG16 is significantly better than VGG13, while VGG19 is only a little higher than VGG16.

 

 Here I ask two questions :

1. The size of the receptive field of VGG13-19 is larger than the input image (224x224), why?

The receptive field of the neural network is the same as that of the human eye. We just mentioned that when we see part of the information of the image (as shown in the figure below), it is difficult for us to judge what it is. The neural network is also, we need to accurately judge the object of the image. , are we looking at the entire image? It is always inaccurate to judge objects through local information. Therefore, the receptive field of the neural network is at least larger than the original image.

ea879df83386498d9fb40c4f6956c498.jpeg

2. Is the larger the receptive field the better?

For this issue, we will explain it after the introduction of the effective receptive field.

2. Calculation of receptive field

2.1 What operations can change the receptive field?

A. Convolution

a68f4fa70eda456eb5c2b30c239b172c.png

 

B. Deconvolution

6b3b24ecaec947ba95cfb3ae2dc3d7d1.png

 

C. Atrous convolution

22ab33f1526f4ad397e2e11847dd9e97.png

 

D. Pooling

de4cba58877a4a72acc38f04120c770a.png

 

The above operations are easy to understand, but I think the following two operations are easy to be ignored:

E. Residual structure

5b602c846f244f44bc89ca13c6d7750f.png

F.Concat operation

a426d4af17e84a13b76741bd0a42a8b1.png

 

2.2 Calculation formula of receptive field

9e923dc76b2d48f58ed9bb3f13939c2f.jpeg

 Through this formula, we can find that the size of the receptive field is mainly related to the step size of the convolution!

The growth rate of the receptive field is directly related to the cumulative multiplication of the step size. If you want the network to reach a certain receptive field scale faster, you can make the convolution kernel with a step size greater than 1 move forward. Another advantage of doing this is that it greatly Increases the inference speed of the network, as the resolution of the feature maps gets smaller rapidly.

2.3 Calculation of the center position of the receptive field

 

35b9bb57ccd94655a0cc155488bff316.jpeg

Having just calculated the size of the receptive field, we also need to determine the coordinates of the receptive field. For example, we have determined the size of the receptive field, and then we will determine the coordinates of the upper left corner of the receptive field.

 

2.4 Calculation example of receptive field center

 

ba3471a1721b45de8451dc1ab577c23d.jpeg

 

3. Effective receptive field

3.1 The concept of effective receptive field

1. Effective receptive field refers to a phenomenon:

Each location in the receptive field (RF, Receptive Field) has an impact on the activation of the corresponding neuron, but not all locations contribute equally. This phenomenon of location 'discrimination' is the effective receptive field (ERF, Effective Receptive Field).

Field) main connotation.

2. The effective receptive field is the energy attribute of the neural network. When the network structure is determined, the relevant characteristics of the effective receptive field are determined.

3. Even if the size of the receptive field is the same, the characteristics of the effective receptive field will be different due to different network structures.

As shown in the figure below, the effective receptive field is actually the white area. In the following two figures, we find that even though the two networks are different, their receptive fields are the same, and the main difference is that their effective receptive fields are different. , and the effective receptive field determines the output of the neuron. In classification, detection, and segmentation tasks, the larger the effective receptive field, the greater the information captured by the network, and the greater the recognition accuracy, so the recognition accuracy of the network on the right is better. on the left.

1b20b9729c5440d6863adcc52482ac6f.png

 W. Luo, Y. Li, R. Urtasun, and R. Zemel, “Understanding the effective receptive field in deep convolutional neural networks,” in Proc. 30th Int. Conf. Neural Inf. Process. Syst., 2016, pp. 4905–4913.

3.2 Calculation of Effective Receptive Field

1. The effective receptive field is a phenomenon, which itself cannot be calculated, but the importance of each position in the receptive field can be calculated, and the importance of all positions can reflect the existence of the effective receptive field.

2. At present, the more mainstream approach is (graph theory): treat the entire calculation of CNN as a 3D directed graph, from low-level nodes to high-level nodes. Calculate the contribution of an input node in RF to a high-level neuron node:

A. Calculate all paths from input nodes to neuron nodes

B. Count the edges contained in all paths and remove duplicate edges

C. The number of edges is the contribution of the input node

This method of graph theory is calculated before, counting his "edges".

3.3 Contribution of each position of the receptive field

Plotting the contribution of each position in the receptive field presents a three-dimensional image that resembles a binary Gaussian function.

1a7fe8143ebb4e83b1b9f3de0a6064b1.png

 Through this figure, we can also verify the three characteristics of the receptive field:

A. The middle contribution is the largest

B. Isotropic

C. Gradually decay around

3.4 Why is the effective receptive field important?

1. Through the effective receptive field, you can know where the neural network is paying attention and how much attention!

2. Know the design of classification, detection, segmentation network (how deep) -> balance performance and accuracy!

3. An effective means to further explore network interpretability

The neural network has always been criticized as a black box, and the input to the result is completely unknowable. Effective receptive fields may be the key to understanding the behavior of neural networks

3.5 The bigger the receptive field, the better?

This is a problem left over from the front. We mentioned earlier that when the neural network can see the entire image, the recognition ability is the most accurate, even if the receptive field of the VGG13-19 network is larger than the original image, but their The effective receptive field is not necessarily larger than the original image. We know that the effective receptive field determines the output of the neuron node. Therefore, the larger the receptive field, the better. We keep increasing the receptive field to relatively increase the effective receptive field. When the effective receptive field can cover the whole image, the neural The recognition ability of the network is the strongest!

Fourth, use receptive fields to explain the basic tasks of deep learning

4.1 Classification Network

4.1.1 Development of Classification Networks

48d0bdaf46bc4c1ba548cd3e77d87810.jpeg

 The essence of the Resnet network is to increase the receptive field of the feature map!

4.1.2 How the receptive field affects the classification network (Resnet as an example)

0062ae53aceb4847b55c27b03b1961fe.jpeg

 We can find that as the number of network layers increases, the receptive field increases and the recognition error decreases!

4.1.3 Is the larger the receptive field the better?

Here the system says:

06341a69a6a64024a458f687fbacce1e.png

 

A. When the area of ​​the effective receptive field can cover the whole image , the representation ability of the neural network is the strongest at this time.

B. The size of the receptive field does not completely determine the performance, but is related to the effective receptive field, and the characteristics of the effective receptive field are determined by the network structure .

C. The effective receptive field is related to the structure of the network. If the network structure is determined, the effective receptive field is determined. Therefore, improving the superiority of the structure is more effective than simply increasing the receptive field.

4.2 Detection network

4.2.1 Development of Detection Network

 

0b9da87624ff40b79e817765d4607104.jpeg

 4.2.2 How the receptive field affects the detection network

 

0077165250924b04b45294e9777e7807.jpeg

 

The output of the classification network is one value, and the output of the detection network is multiple values ​​(3x3=9 values). What does this 3x3 output represent? It corresponds to the receptive fields of 9 positions. As shown in the above figure, the upper feature map (butterfly), the lower part is the original image (gray rectangle), the 9 red points are the output of 3x3, and each red point corresponds to the original image This area is the receptive field.

The training of the detection network can be regarded as a kind of efficient classification network training , and the regression of the box can be regarded as an incidental thing. Each receptive field corresponds to an input image , and each neuron node used for prediction will have a class label. All input images share the same classification network.

So where is "efficient" reflected?

9 values ​​correspond to the receptive fields of 9 regions, and the categories of 9 regions can be identified. Compared with image recognition, we can only output one value per image, that is, one category.

4.3 Segmenting the network

4.3.1 Development of Segmentation Networks

b1ab98f88c304959a0629997fdbd79a1.jpeg

 

 

4.3.2 How do you design the segmentation network

 I want to determine the category of this pixel, how do we do it?

A pixel has no other associated information, so it is difficult to make a judgment, so we need to take the pixel as the center, crop out images of different sizes, and then set up separate classifiers for images of different sizes (the size of the image will affect the final fully connected layer. design), and finally integrated (very time-consuming, inefficient, and low in accuracy).

 

74c59f78eda245d2adf2bc7baa35e086.png

 

Method 1 : Train separate classifiers for blocks of different sizes, and then integrate the classification results of all blocks.

Method 1 is the traditional approach mentioned above.

Method 2 : Design a network structure that can simultaneously couple the feature representations of different blocks, and then directly classify on the coupled feature map.

When we couple different blocks at the same time, we actually couple different receptive fields. As we mentioned earlier, what structure can couple different receptive fields?

A receptive field of a fixed size can be regarded as a block of one size , coupling the characteristics of receptive fields of different sizes

A : Use atrous convolution (different rates can get different receptive field sizes) or use Pyramid Pooling directly, then skip connections or splicing

f4236ef8a76540d8ab861e18feb2d48a.png

We can control the rate of atrous convolution to fuse receptive fields of different sizes!

B : Using the encode-decode structure, the feature map of the encode stage is spliced ​​to the special diagnosis map of the decode stage.

3b9cf57e0c794c698050128a96abf239.png

 How to efficiently couple more size receptive field features is the focus of segmentation network considerations

 

Guess you like

Origin blog.csdn.net/weixin_43507744/article/details/125967671