Vegetarian paddle seventh: the basic concept of target detection

1. What is target detection? 

In the previous articles, we learned how to use convolutional neural networks for image classification. For example, handwritten digit recognition is used to recognize ten numbers from 0 to 9. Unlike image classification, which deals with the identification of a single object, target detection recognizes not only an object, but also multiple objects, not only to determine the classification of the object, but also to determine the location of the object. For example, the following picture:

 Target detection not only tells us that there are both puppies and cats in this picture, but also tells that the puppy is in the red box on the left, and the kitten is in the red box on the right. That is to say, the output result of target detection is [target classification + target coordinates]

2. Concepts involved in target detection

1. Bounding box

The detection task needs to predict the category and location of the object at the same time, so some concepts related to location need to be introduced. The location of an object is usually represented by a bounding box (bbox), which is a rectangular box that can just contain the object. Just like the red boxes around the puppy and kitten in the picture above, they are two bounding boxes.

2. The method of expressing the position of the bounding box

  • xyxy, that is (x1, y1, x2, y2) where (x1, y1) is the coordinates of the upper left corner of the rectangle, and (x2, y2) is the coordinates of the lower right corner of the rectangle.
  • xywh, that is (x, y, w, h) where (x, y) is the coordinates of the center point of the rectangle, w is the width of the rectangle, and h is the height of the rectangle.

3. Prediction frame

To complete a target detection task, we hope that the model can output some predicted bounding boxes based on the input picture, as well as the category of objects contained in the bounding box or the probability of belonging to a certain category, such as this format: [L ,P,x1,y1,x2,y2], where L is the category label and P is the probability that the object belongs to that category. An input image may generate multiple prediction boxes.

4. Anchor frame

The anchor box is different from the object bounding box, which is a kind of box generated by people according to certain rules. First set the size and shape of the anchor frame, and then draw a rectangular frame centered on a certain point on the image. In the target detection task, a series of anchor boxes are usually generated on the picture according to certain rules, and these anchor boxes are regarded as possible candidate regions. The model predicts whether these candidate areas contain objects, and if they contain target objects, it is necessary to further predict the category to which the object belongs. More importantly, since the position of the anchor frame is fixed, it is unlikely to coincide with the bounding box of the object, so it needs to be fine-tuned on the basis of the anchor frame to form a prediction frame that can accurately describe the position of the object, the model The magnitude of the fine-tuning needs to be predicted. Different models often have different ways of generating anchor boxes.

5. Cross-merge ratio

In detection tasks, Intersection of Union (IoU) is used as a measure. This concept comes from the set in mathematics and is used to describe the relationship between two sets A and B. It is equal to the number of elements contained in the intersection of the two sets, divided by the elements contained in their union The specific calculation formula is as follows:

We use the intersection-over-union ratio to describe the degree of coincidence between two boxes. Two boxes can be regarded as a collection of two pixels, and their intersection ratio is equal to the area of ​​the overlapping part of the two boxes divided by their combined area, as shown in the figure below:

As we said in image classification, our neural network needs to establish a loss function, so the intersection and union ratio is a good loss function to measure the quality of the prediction.

 

Guess you like

Origin blog.csdn.net/duzm200542901104/article/details/128296289
Recommended