Deep Neural Networks for Object Detection Thinking

Deep Neural Networks for Object Detection Thinking

This article is purely based on my own point of view. It is limited to my own knowledge and ability, and there must be more or less mistakes. Note will write a separate article, a Note, a thinking

statement of problem:

​ Not only classification, but also precise positioning

The purpose of the paper:

​ Solve the object detection problem through the regression of DNN. Through a formula, the regression problem of the bb mask target is used to solve the object detection. In addition, a multi-scale inference process is proposed, which can convert low-resolution objects into high-resolution object detection at low cost.

Challenges raised by the paper:

Highlights of the paper:

deep regression to achieve localization problems

Solve the robustness problem of multi-target detection

For multi-scale Refinement of DNN locator

Method of the paper:

​ The overall idea: generate some masks from the window through the DNN network, and then merge these masks to generate a high-quality mask, and then use a simple bounding box to extract the detection from the mask

1. The DNN network generates an object mask. What is a mask?

​ mask is the binary mask of the bounding box of the object, the same size as the image, and the position of the detected object represents 1, otherwise 0. The mask may be obtained from a part of the object frame, which is caused by the precise positioning method proposed later. The BB is divided into full and left. . . . And so on, and then form the corresponding mask.

2. The output of the DNN network is a mask, and the size of the output is very different from the size of the input. How to transform the mask form called the input image size

3. How to extract the detection from the mask and what is the process? What do you get in the end?

Paper detail

1. Pass DNN as detection

​ Modify alex's classification network to achieve positioning. The change is to change the softmax of the last layer to the regression layer. This layer can generate a binary mask of the object. Because the mask is generated by the network, and the output size of the network is fixed, the fixed size of the predicted mask is d*d=N, where N is the total number of pixels, and then the mask size is adjusted to the size of the image, and finally generated The mask can represent multiple objects. In this mask, 1 means it exists in the BB of a given class, otherwise it is 0

Loss function:
min Θ ∑ (x, m) ∈ D ∣ ∣ (D iag (m) + λ I) 1/2 (DNN (x; Θ) − m) ∣ ∣ 2 2 min_{Θ}\sum_{( x,m)∈D){||(Diag(m) + λI)^{1/2}(DN N(x; Θ) − m)||_2^2}m i nTh(x,m)D(Diag(m)+λI)1/2(DNN(x;Θ ) - m ) 22
m is the truth mask, and D is the training image set containing the BB object expressed by the binary mask.

2. Accurate target detection through mask

2.1 Put forward three problems to be solved:

  • A single object mask may not be enough to disambiguate adjacent objects
  • Due to the limitation of the output size, the mask we generated is much smaller than the size of the original image.
  • Since we use a complete image as input, small objects will affect very few input neurons, making it difficult to recognize.

2.2 Multiple masks for robust positioning

​ Previously, I got a mask for a bb object, and now I use the network to get multiple masks for a bb object, usually portion: top, bottom, etc.
mh (i, j; bb) = area (bb (h) ∩ T (i, j)) / area (T (i, j)) (1) m^h(i, j;bb) = area(bb (h) ∩ T(i, j)) / area(T(i, j)) (1)mh(i,j;b b )=area(bb(h)T(i,j))/area(T(i,j ) ) ( 1 )
T(i,j) represents the rectangular box predicted by the network, and bb is the real rectangular box

2.3 Target location output from DNN

​ The bb set of each image is evaluated, and the BB is scored by introducing the Score formula. One for the whole m, and one for the corresponding part of m.

​ In this way, the predicted bb set is calculated, and then deleted according to two filtering methods, 1. Set the score threshold to filter 2. Apply the classifier trained by the interest class and retain the forward classification to the current detector Category to further filter (that is to see whether the object in bb is an interest category to delete), and then apply NMS

3 Multi-scale refinement of DNN classifier

​ The paper describes two methods to solve the problem of network output resolution. 1. Use DNN localizer in multiple scales and several large sub-windows 2. Use DNN localizer in the top derivation bb. The process is shown in Algorithm 1.

Here are a few questions about the process:

3.1. Do the multiple scales mentioned repeatedly in the text refer to windows or images? What are they?

​ In the algorithm, you can see that both the image and the window use multiple scales. But the difference is that the scale of the image is calculated, while the window is generated. There is a question here, how is the window generated? , The mask is generated by the DNN network, but how does the window come from? I speculate that the window should be the BB given by the label or the predicted BB. It can be seen that the scale refers to different images and the window is fixed.

3.2. A popular understanding of the algorithm 1 process?

​ First calculate a few suitable image scales, and then process each different image scale to generate a window (this is doubtful, I didn’t talk about how to generate it). I think the window should already exist, if it is a label, or if it is after the prediction window, then the window for partial selection, were placed in the network has been trained mask, and then merge the mask. Perform score calculation on the selection box, filter and put it into the detection set

3.3. Understanding of the algorithm 2 process?

​ This process is refinement. It traverses the set of detection frames obtained by Algorithm 1, enlarges the BB and then gets the modified image, performs DNN processing on the processed image , and puts the BB with the highest score calculated into the refined detection frame

Guess you like

Origin blog.csdn.net/qq_40940540/article/details/109623803