Anchor Free target detection method

Faster rcnn anchor: fixed size ratio
yolo anchor size determination: by clustering

Anchor Free method

A simple understanding of the anchor: the template on the feature map, the information contained is the size and scale of the detection frame

Anchor based method summary

 

  • Faster rcnn (upper left)

  • yolo v3 (upper right)

  • ssd (medium)

  • retinaNet (below)

Although the Anchor based method has achieved a lot of success, there are still some shortcomings:

  • The parameters of the anchor box need to be adjusted manually

  • The number of samples of the anchor box is too large, which reduces the efficiency

  • The sample of the anchor box is unbalanced (too many negative samples)

How to bypass the anchor box?

Anchor free class methods

The anchor free method can be roughly divided into two types:

  • The method of using the center point as the reference point (center net)

Advantages: fast

Disadvantages: 1) When the targets overlap, only one can be detected;

           2) Small targets in large targets are difficult to detect

  • Method based on upper left + lower right (corner net)

 CornerNet

Core idea: For a target, constructing a target box only needs to know the two key points of its upper left corner and lower right corner.

What problems need to be solved to achieve this goal?

  1. Find two points for each target ( Find Points )

  2. Classification points (upper left or lower right)

  3. Match the obtained points ( matching points ) to match a pair of points of the same object

overall framework

  1. Image input convNet, feature extraction

  2. The features are divided into two paths, which are sent to two branches ( classification )

    • The upper route is composed of heatmap (heat map: find points ) + embedding ( matching ), predicting the upper left point
    • The composition of the bottom lane is the same as that of the top lane, and the bottom right point is predicted
  3. Combined with the prediction results of the upper and lower roads, the target detection results are obtained

Contents of each module

  • hourglass net (hourglass network)

hourglass net is used for attitude detection , the basic structure is as follows

 Obviously, this is a multi- scale feature fusion network

Considering that the hourglass net is a combined module, two hourglass nets are used in the corner to form the convnet part of the model

  • two output

After the output from the hourglass net, the features are subjected to different 3*3 convolutions, and two outputs are obtained, which are sent to the top-left and bottom-right respectively.

tl:top left     br:bottom right 

  • After the feature entered the branch, I faced the following problems:
  1. How to detect points?

  2. How to determine whether the detected point is upper left or lower right?

  3. How to link the upper left and lower right corners of the same object?

After visualizing a feature map, we expect that keypoint locations should have higher responses than other locations

 That is to say, the point with the largest response in the feature map (equivalent to an nms operation) is the key point.

Then the question is: how to judge whether the detected key point is the upper left corner or the lower right corner?

 If a point is the upper left corner, then its lower right should be the target;

Conversely, the upper left of the lower right corner should be the target.

Considering the above problems, you can design a pooling layer in different directions, the purpose is to retain the features of the expected area and delete the features of the unexpected area.

corner pooling

The largest value on the right and the largest value below cause the upper left corner to respond the most ==> equivalent to shifting the lower right content to the upper left point:

When calculating, calculate  it in the opposite direction , and add it up after calculating.

The three branches are parallel 

How to connect the upper left point and the lower right point?

While obtaining the positions of the two key points, extract features for the two key points.

The closer the features are, the more they are the same object.

Therefore, in the upper and lower branches, the lt branch enters the corner pooling module of lt, and outputs respectively:

- heatmaps (位置)
- embedings (目标)
- offsets (位置纠偏)

In summary, the overall structure diagram of corner net is shown in the figure

loss function

  1. Heatmaps output loss (size: batch_size(bs)×128×128(feature map size)×80(number of categories)×2 (some networks output 1 and 0, some do not)) 

ycij: The position of the cth heat map (i, j) is the angle of the cth class

focal loss (the cross-entropy loss is increased by a power)

Focal Loss is a loss function for solving category imbalance problems , proposed by Lin et al. in the paper "Focal Loss for Dense Object Detection". It increases the weight of difficult-to-classify samples by reducing the contribution of easy-to-classify samples to the loss function, thereby improving the performance of the model in the case of category imbalance.

The formula of Focal Loss is as follows: (loss of heat map)

FL(p_t) = -α_t(1-p_t)^γ * log(p_t)

Among them, p_t is the probability value predicted by the model, α_t is the category weight, and γ is the factor to adjust the weight of difficult and easy samples. When γ=0, Focal Loss is equivalent to cross-entropy loss; when γ>0, Focal Loss will reduce the weight of easy-to-classify samples and increase the weight of difficult-to-classify samples.

  1. offset loss (size: bs×128×128×80×2 (deta x, deta y))

pull: Make two points of the same object as close as possible

push: Make two points of different objects as far apart as possible

  1. offsets loss (size: bs×128×128×2) 

overall loss

center net

Center net surpasses yolo v3 in speed and accuracy

 First, the image is input into the bottel net for downsampling to obtain the feature map

Second, predict in three branches.

Unlike corner nets, feature maps do not require corner pooling

Branch 1: Predict the center point and get the approximate coordinates according to the feature map

Branch 2: Estimate an offset to deal with the position offset caused by processes such as pooling and padding

Branch 3: Estimate the size of a box and get the target box.

 Compared with the previous anchor based method, there are different ground truth methods.

Anchor based method: Calculate iou, the box whose IOU coincides with more than 0.7 is a positive example, otherwise it is a negative example

Anchor free method: the center point is distributed according to Gaussian, and the weight is gradually reduced to spread outward.

loss function

 

attitude detection

Furthermore, the center net can be used for pose detection.

That is to say, by predicting N key points of the human body and merging and recognizing them, multiple positions on the human body can be obtained, thereby obtaining the posture.

Guess you like

Origin blog.csdn.net/qq_54809548/article/details/130927854