NMS - Convolutional Neural Network

1- Traditional NMS

NMS, Non-Maximum Suppression, has important applications in many computer vision problems, especially in the field of object detection.

Taking face detection as an example, the usual process is 3 steps:

(1) Generate a large number of candidate windows through sliding windows or other object proposals methods;

(2) Classify the candidate window with the trained classifier, which can be regarded as a scoring process;

(3) Use NMS to fuse the above detection results (because a target may be detected in multiple windows, and we only want to keep one).

The following figure is the result after (2) classification detection: 
write picture description here

Taking this figure as an example, the traditional NMS first selects an IOU threshold, such as 0.25. Then all 4 windows (bounding boxes) are sorted by score from high to low. Then select the window with the highest score, traverse and calculate the overlap area ratio (IOU) between the remaining 3 windows and the window. If the IOU is greater than the threshold of 0.25, the window will be deleted. Then, select the one with the highest score from the remaining windows and repeat the process. until all windows are processed.

If 0.25 is a good threshold, then we can get better results, as shown below:

write picture description here

If, our IOU threshold is set very small, say 0.1. Then the windows of 2 people will be merged as one person. Got the following error result:

write picture description here

If our IOU threshold is set too large, say 0.6. Then you may get the following error result: 
write picture description here

From the above, we can see how important it is to choose a good threshold for the traditional NMS algorithm, but it is also a very difficult thing. The traditional NMS is a hard decision and a greedy algorithm. Therefore, in the article, the author calls the traditional NMS algorithm: GreedyNMS

2-NMS-ConvNet

To mention again, the traditional NMS only uses two pieces of information in the decision fusion: Score and IOU, that is, the score of each box and the overlap ratio between boxes and boxes.

The article uses a neural network to implement NMS, which also uses these two pieces of information. The entire flow chart as shown below: 
write picture description here

It can be obtained from the red box in the network structure diagram. There are two input data layers, one is the Score map and the other is the Iou layer. Let's talk about how to get these two data layers from the original bounding box.

2-1 Mapping - making a score map

Assuming that the original image size is W×H, then the size of the score map we want to make is w×h, where w=W/4, h=H/4, then a point on the score map corresponds to 4×4 of the original image Area: 
write picture description here

For a bounding box, we calculate its center and then determine which area it belongs to, and then fill in the score of the box into the position corresponding to the score map. As shown below: 
write picture description here

If the centers of multiple bounding boxes fall into the same area, only the highest score is recorded.

So far, we have obtained the score map of w×h×1.

The article mentions that traditional NMS requires sorting, but in convolutional neural networks, it is difficult to simulate sorting with various linear combinations and nonlinear activations. Therefore, the article first uses traditional NMS to process bounding boxes, and then generates a score map of the same size, denoted as S(T), where T is the threshold of NMS.

Finally, we get a score map of w×h×2. Denoted as S(1,T)

2-2 Make IOU layer

IOU, intersection-over-union. Therefore, the IOU layer describes the intersection between the bounding boxes.

The size of the IOU layer is w×h×121, where 121=11×11, that is, it describes the intersecting relationship of the boxes existing in the 11×11 range of the corresponding center point. 
As shown in the example below, the picture on the left is the score map S(1), and the pink color indicates that there is a value, and each pink place also corresponds to a bounding box. Draw a 11 × 11 range with red as the center, then you can calculate the IOU of the box corresponding to each position and the box corresponding to the center point in turn, and the value is recorded as I, as shown on the right. Similarly, it is easy to know that I(7,8,i) must be 0.

write picture description here

2-3 Network Analysis

Let's look at the network structure again: 
write picture description here
note 2 points:

(1): The kernel size of the IOU layer is 1, and the stride is also 1. 
The kernel size of the Score map layer is 11×11, which is to echo the IOU layer; the stride is 1, and the pad is 5, which is to obtain an output of the same size as the input:

(2): Layer 2 concatenates the previous 2 outputs, and all subsequent convolutions are 1×1. The final output is still a consistent-sized score map.

2-4 Output and Loss

The ideal output is a score map that is exactly the same size as the input. In this map, each target has only one score and correspondingly only one bounding box.

So the goal of training is to keep one and suppress the other. As shown below: 
write picture description here

(1) The score map of Figure a above is our input. It is easy to see from the figure that there are a total of 5 valid scores, which also correspond to 5 bounding boxes.

(2) Assume that the five bounding boxes are the detection results of the same target. Then our training purpose is to keep the best one and suppress the remaining 4. 
To this end, we first assign labels: among the 5 bounding boxes, the box that satisfies the IOU with ground truth greater than 0.5 and has the highest score is used as a positive sample, and the rest are negative samples, as shown in Figure b above.

(3) Obviously, the number of positive and negative samples is seriously unbalanced, so before calculating the loss, we need to assign a weight to balance this imbalance. The assignment of weights is simple, as shown in Figure c, the sum of the weights of the positive samples is equal to the sum of the weights of the negative samples.

(4) The picture on the right above is the ideal output. In summary, our Loss Function can be easily derived (similar to pixel-level classification): 
write picture description here 
where p belongs to G, which represents a valued point in the score map.

 

The article is transferred from: https://blog.csdn.net/shuzfan/article/details/50371990

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325722612&siteId=291194637