04Deep Learning-Target Detection-Detailed Explanation of Deep Learning Methods-Two-stage Target Detection Algorithm

1. Introduction to deep learning target detection algorithm

       In the second article, we introduced the traditional algorithms and target detection methods in target detection. In the third article, we briefly compared the traditional target detection algorithm and the deep learning target detection algorithm. This article records the content of the deep learning target detection algorithm. Let’s go into depth about the principles and effects of deep learning algorithms in target detection. The deep learning algorithm is divided into two stages of classification in target detection: 

  • One-stage (YOLO and SSD series): Return directly to the target position.
  • Two-stage (Faster RCNN series): Use the RPN network to recommend candidate areas, that is, complete the detection process through a complete convolutional neural network.

       We first introduce Two-stage.

2. Target detection algorithm based on Two-stage

       The two-stage target detection algorithm mainly completes the target detection process through a complete convolutional neural network
. The features used in target detection are CNN features , that is, a CNN convolutional neural network is used to extract pairs of candidates. Description of the characteristics of the area target.

        The most typical representative of the Two-stage target detection algorithm is the series of algorithms proposed in 2014 from R-CNN to faster RCNN.

       Two steps are required in the training process: training the RPN network and network training in the target area. Compared with traditional target detection algorithms, there is no need to train a classifier or perform feature representation. The entire process is completed by a complete convolutional neural network from A to B, and the accuracy is improved at the same time. However, it is slower than one-stage.

        The above description can be summarized in the following points:

  • CNN convolution features
  • R. Girshick et al., 2014 proposed R-CNN to faster RCNN
  • End-to-end target detection (RPN network)
  • High accuracy, relatively slow speed compared to one-stage

 3. Two-stage basic process

       Input picture------Extract deep features from the picture (backbone neural network)------RPN network completes the task of the sliding window, that is, generates candidate areas and completes candidate area classification (background and target ) Preliminary positioning of the target position ------- In order to calculate the CNN features without repeated calculations, the matting operation is completed through roi_pooling ------- the fc fully connected layer represents the candidate area ---- classification and regression Category determination and position refinement of candidate targets (obtaining the true category of the object)
 

Detailed process:

       First, a picture is input, and then the picture is extracted for deep features. That is, a picture is used as input and passes through a convolutional neural network . It is usually called a backbone network. Typical backbone networks include some classic convolutional neural networks such as VGG and ResNet. network, and then use an RPN network to complete the task completed by the sliding window in the traditional target detection algorithm, that is, to generate candidate areas. While extracting the candidate frame area, the candidate frame area is classified ( during the classification process, the candidate frame area is classified into two different categories: background and target ) . When the RPN network generates candidate areas, it will also make a preliminary prediction of the target's location , which means that the RPN network completes both area classification and location refinement at the same time. After obtaining the candidate area, the candidate area is further accurately regressed and corrected through the roi_pooling layer. Roi_pooling can be understood as matting. After obtaining the features corresponding to the candidate target on the feature map, an fc layer will be used to further Represent the characteristics of the candidate area, and then complete the determination of the category of the candidate target and the refinement of the candidate target position through classification and regression. The category here is different from the category of the RPN network. The real category of the object is usually obtained here. , Regression mainly obtains the specific coordinate position of the current target, which is usually represented as a rectangular box, that is, four values ​​​​(x, y, w, h).

4. Two-stage common algorithms

  • RCNN
  • Fast RCNN
  • Faster RCNN
  • Faster RCNN variant

 

5. Two-stage core components 

1. Two core components of Two-stage

       Two-stage has two important core components:

  • CNN network (backbone network)
  • RPN network

2. Backbone CNN network design principles

  • From simple to complex, and then from complex to simple convolutional neural network

        Simple network structure: The more classic one is LeNet (which has one input layer, two convolutional layers, two pooling layers, and three fully connected layers. However, LeNet has relatively poor network expression and abstraction capabilities in large-scale tasks. weaker)

       Complex network structure: After LeNet, complex network structures such as LSNet, Resnet, and Vgg have emerged. These network structures are often used to increase the depth of the network, because the deeper the network, the stronger the nonlinear expression ability, and the objects obtained are more abstract. The expression is less sensitive to image changes, the stronger the robustness, and the stronger the ability to solve nonlinear tasks. It will also lead to gradient disappearance or gradient dispersion. The most typical one is Resnet, which has a depth of more than 100 layers, and the most classic one is GoogleNet.

  • Multi-scale feature fusion network
  • More lightweight CNN network

       When designing, consider performance and model size. At this time, you need to use lightweight CNN networks, such as the classic ShuffleNet, MobileNet, etc.

3. RPN network

       Before understanding the RPN network, we first understand some related concepts of area recommendation (Anchor mechanism) .

1. Regional recommendation (Anchor mechanism)

1.1 Introduction of the problem

       Often, the target object may appear at every position on the picture, and the size of the target is uncertain. So is there any way to detect all objects? The easiest way to think of is to intercept many small blocks of different aspect ratios and sizes with a pixel as the center . Each small block detects whether the package contains an object. If it contains an object, the position of the object is the one just intercepted. The location of the small piece, and at the same time predict its category. In this way, any object with the aspect ratio and size of the current pixel will not be missed; the small block just intercepted is an anchor box .

       In order to detect objects at different positions in the image, use the sliding window method to scan the image from left to right, top to bottom, and take many small blocks for detection at each pixel point, so that different positions can be ensured , objects of different sizes are not missed. Fig. 1 is an example of a scan check.

       This method is easy to understand and is indeed effective, but the disadvantage is also prominent -the amount of calculation is too large. If the feature map size of an image is 640*640, and 10 frames with different aspect ratios and different sizes are selected for each pixel in the image for detection, the frames that need to be detected will be 640 x 640 x 10 = 4096000. There are too many, as shown below. How to improve it

        In fact, there are two obvious points that can be improved on the above problems:

  • First, 4,096,000 scanning frames overlap too much.
  • First, many of these frames are backgrounds and do not contain objects, so there is no need for detection.

       Therefore, it is particularly important to try to ensure that the entire image is covered, omit frames that overlap too much, avoid background frames , and find high-quality candidate frames that may contain target objects for detection. This can be used to reduce Reduce the amount of calculation and improve the detection speed .

       Anchor boxes are a series of candidate boxes that we determine before detection. By default, all objects that appear on the picture will be covered by the anchor boxes we set. The quality of anchor box selection is directly related to two aspects:

  • One is whether it can cover the entire picture well.
  • One is whether it can frame every object that may appear in the picture.

       Therefore, the setting of the anchor box is very important, not only related to the accuracy, but also related to the speed (the speed is only for the scanning method mentioned above).

1.2 Solution- set anchor boxes

        Use the set anchor boxes to reduce the amount of calculation and improve the detection speed. How to set anchor boxes? We accomplish this by following these steps:

  • Determination of aspect ratio
  • Determination of scale
  • Determination of the number of anchor boxes

       For example: If you want to do object detection on a data set, the image resolution of the data set is 256 x 256 , and the size of most target objects in the data set is 40 x 40 or 80 x 40 .

Determination of aspect ratio

Because the size of the target objects        in the vast majority of the data set is 40*40 or 80*40, this means that the height-to-width ratio of the true border of most objects in the data set is 1:1 and 2:1 . Based on this information, the aspect ratio information of the anchor box can be determined. When designing anchor boxes for this data set , the aspect ratio needs to include at least 1:1 and 2:1 . For the sake of convenience, we only take 1:1 and 2:1 as examples here.

Determination of scale

       Scale is the ratio between the height or width of an object and the height or width of the image . For example, if the width of the picture is 256px and the width of the object in the picture is 40px, then the scale of the object is 40/256=0.15625, which means that the object occupies 15.62% of the width of the picture.

       In order to select a set of scales that better represent the targets in the data set, we should use the maximum scale value and the minimum scale value of the target object in the data set as the upper and lower limits. For example, the minimum and maximum values ​​of the scale of objects in the data set are 0.15625 and 0.3125 respectively. If we plan to set three scales within this range, we can choose {0.15625, 0.234375, 0.3125}.

Determination of the number of anchor boxes

       Our scales are {0.15625, 0.234375, 0.3125}, the aspect ratios are {1:1, 2:1}, then the number of a set of anchor boxes on each anchor point is 3x2 = 6. As shown in the figure below, there are three sizes {0.15625, 0.234375, 0.3125}, and each size has two aspect ratios {1:1, 2:1}.

       According to the above method, the anchor point refers to each pixel in the 256x256 image. According to the anchor-based neural network target detection, the anchor point is each point on the final output feature map of the network.

1.3 How is Anchor used in target detection?

       In the network, anchor boxes are used to encode the position of the target object . Target detection generally does not directly detect the absolute coordinates of the object frame, but detects its offset relative to an anchor frame, such as the offset of the green true value frame to the blue frame in the figure below. All objects in the data set will be encoded as offsets from anchor boxes. For example, in the picture introduced in question 1.1, there are many anchor boxes. A picture may contain multiple objects and many anchor boxes. So how to use anchor boxes to encode the true value?

Steps for encoding true-value bounding boxes with anchor boxes

  • a. For each anchor box, calculate which true value bounding box has the largest intersection over union score.
  • b. If the intersection ratio is >50%, the current anchor box is responsible for detecting the object corresponding to the current true value bounding box, and finding the offset of the true value bounding box to the anchor box
  • c. If the intersection ratio is between 40% and 50%, it cannot be determined whether the anchor contains the object, and it is a fuzzy box.
  • d. If the intersection ratio is <40%, it is considered that the anchor frame is all background, and the anchor is classified as the background class.
  • In addition to the anchor box of the assigned object, for the anchor box and vague box that only contain the background, the offset is assigned 0, and the classification is assigned the background.

       After encoding, the regression target of the object detection network becomes the regression encoded offset . The input of the network is an image, and the output is the classification and offset of each anchor box. Each pixel on the feature map finally output by the network has a set of anchor boxes (if the number of a set of anchor boxes is 6, the aspect ratio is 2:1 and 1:1, and the scale is 0.15625, 0.234375, 0.3125, as shown in the figure ), assuming that the final output feature map resolution of the network is 7*7, then the total number of anchor boxes in the regression network is 7x7x6=296. The true value received by the network is the classification information of whether the 296 anchor boxes are backgrounds (if they contain objects, they are separated into object categories) and the offset of each anchor to the target object bounding box (the offset between the blur box and the background box). The shift amount is 0), and the output of the network is the offset and classification information of 296 boxes.

        For a trained network, in its output, the anchor box that only contains the background is classified as background, and the offset is 0; the anchor box that contains the object is classified as the object category, and the offset is the difference between the anchor box and the real border of the object. offset between

 

Why return offsets instead of absolute coordinates

       One of the properties of neural networks is displacement invariance. For example, for a photo containing a tree, regardless of whether the tree is in the upper left corner or the lower right corner of the picture, the classification output by the network is a tree, and the classification result will not change due to changes in the position of the tree in the photo. Therefore, for a tree, no matter what its position in the picture is, the regression network tends to output the same position coordinates for it. It can be seen that the displacement invariance conflicts with the position coordinate changes we need, which is obviously not acceptable. . If we turn to regression offset , no matter where the tree is in the image, its offset to the anchor box where it is located is basically the same, which is more suitable for neural network regression.

What is the relationship between the output feature map and the anchor box?

       Shouldn't the anchor box be placed on the input map? Why is it said that each point on the output feature map has a set of anchor boxes?

As shown in the figure, any point on the output feature map (the small feature map of 3 x 3 on the far right) can be mapped to the input image (the meaning of the receptive field), that is to say, according to the scale and network downsampling, the Any point on the output feature map can be proportionally found in its corresponding position on the input image. For example, the corresponding position of the point (0, 0) on the output feature map on the input image is (2, 2), and the output feature dimension of the network is 3 3 84 (= 3 3 6 14),  then  the  output  feature  map The values ​​corresponding to the 84 channels at point (0, 0) are the offsets and classification values ​​of the 6 anchor boxes at the position (2, 2) of the input image. 84 = 3  6 in 14 is the 6 anchor boxes, 4 in 14 is the offset of (x, y, w, h), and 10 in 14 is the number of categories.

Through such an implicit mapping relationship, all anchor boxes are placed on the input image.

1.3 The essence of Anchor

          The essence of Anchor is the reverse of SPP (spatial pyramid pooling) idea. SPP itself resizes inputs of different sizes into outputs of the same size, so the reverse of SPP is to reverse the output of the same size to obtain inputs of different sizes. 

         Region recommendation: called Anchor mechanism, that is, n*c*w*h, where n represents the number of samples, c represents the number of channels, w and h represent the image height and width, and each point in the w*h area is used as a
          candidate The center point of the area is used to extract the candidate area, and each such point is called an Anchor.
          When extracting a candidate area with a certain point as the center point of the candidate area, it is usually extracted according to a certain proportion. For example, each center point in fastRCN extracts 9 candidate regions. Therefore, 1 w*h region requires w*h*9 candidate regions to be extracted.
          For these candidate areas and true values ​​(GT), use the true values ​​to screen these subsequent areas, and obtain positive and negative samples after screening. The positive samples are the areas that
          contain the candidate targets, and whether they are included is usually determined through IOU. Judgment, that is, the coverage area judgment of the overlap between the true value and the candidate area.
          If the overlap area between the true value and the subsequent area exceeds 70%, it is a positive sample. If it is less than 30%, it is a negative sample.

         The 0.7 and 0.3 here are super parameters and can be set by yourself.

     The RPN network processes the feature map output by the backbone network (one of VGG, ResNet, etc.) and generates multiple suggested areas that may contain targets. It consists of two convolution branches. One branch locates the approximate position of the target in the picture through coordinate regression, and the other branch finds the foreground area containing the target through binary classification processing. The network structure of RPN is shown in the figure:

        The input feature map of RPN is the Feature Maps extracted by the backbone network in Figure 1, also called shared Feature Maps, whose scale is H (height) × W (width) × C (number of channels). Based on this feature parameter, through a 3×3 sliding window and sliding on this H×W area, H×W 3×3 windows can be obtained. The center point of each 3×3 window corresponds to the center point of a target area in the original image.

       Then perform two full connection operations on each feature vector. One branch gets 2 scores (confidence of the foreground and background), and the other branch gets 4 coordinates (the coordinate frame information of the target). The 4 coordinates refer to the original Offset of graph coordinates. Since the same fully connected operation needs to be performed on each vector, it is equivalent to performing two 1 × 1 convolutions on the entire feature map, resulting in a feature map of 2 × H × W and a 4 × H × W size. Finally, it is combined with the predefined Anchors to complete the post-processing operation and obtain the candidate frame.
 

     On the whole, RPN customers can be summarized into the following points:

  • Regional recommendation (Anchor mechanism)
  • ROI Pooling
  • Classification and regression

Guess you like

Origin blog.csdn.net/qq_41946216/article/details/132801357