(3) Several boxes in target detection [anchor, bbox, prior box, grid unit, ROI, proposal, DenseBox]

bbox (bounding box/bounding box)

  Bbox is the abbreviation of BoundingBox (bounding box), which refers to the directed rectangular box used to represent the position and size of objects in target detection. Usually, for each target object in an image, a corresponding Bbox will be pre-marked in the training set, which means the position and size of the object in the image.
  When the model predicts, by detecting the Bbox of multiple positions or scales in the image, the target object contained in the image can be identified.
  In the target detection algorithm, Bbox is closely related to the Anchor box, because the Anchor box is usually used as a predefined candidate box to capture the area that may contain the target. In the candidate boxes generated by the Anchor box, the final target BBox can be obtained through further screening and adjustment.
  Therefore, Bbox is also often called a detection box, because it is an important way to represent the position and size of the object.

anchor (a priori box)

  Anchor is filtered by Bbox.
  Anchor and BBox play different roles in object detection.
  Anchor is usually a set of predetermined candidate boxes, which are used to determine the location on the input image that may contain the target object. Specifically, a series of anchor boxes with different sizes and aspect ratios are usually generated on the image, and then matched and adjusted with the target object, and finally a candidate box containing the target is obtained.
 More specifically, the model determines which anchor boxes contain the target by comparing the IOU (intersection-over-union ratio) between the anchor and the real target box, and how to adjust the anchor box to better fit the target. Therefore anchor boxes can be regarded as candidate boxes, since they are used to find regions that may contain objects.

How to determine whether the candidate box contains the target?

  Generally, it is usually preset according to information such as the size and shape of the target object in the training set. The number and size of Anchor boxes can be determined by clustering technology (K-means clustering). (Clustering process: You can first cluster the bounding boxes of all target objects in the training set to obtain several cluster centers, and use these cluster centers as Anchor boxes;) Then, during the training process, the model will be based on the Anchor box Predict the position and confidence of the target object, so as to realize the detection function.
  Generally speaking, during the detection process, if the confidence of an Anchor box is relatively high (usually exceeding a set threshold), it is considered that the box contains the target object.

How to filter Anchor?

  Since the same target object may be detected by multiple Anchor frames, non-maximum suppression (NMS) processing is required to remove duplicate detection results to obtain the final detection result.

grid cell

  In the target detection algorithm, Grid cell refers to each small grid obtained in the process of dividing the input image into several small grids.
  The YOLO algorithm achieves target detection by dividing the input image into multiple grids. For each grid, the model needs to predict whether the grid contains the target object, as well as information such as the location and category of the target object. In order to facilitate prediction, a complete image is usually divided into multiple grids, and the position of the target object is predicted in each grid.
  In the YOLO algorithm, each grid can be regarded as a Grid Cell. For each Grid cell, the model needs to predict the position and size of three bounding boxes (Bounding box), as well as information such as the category and confidence of the target object. Specifically, each bounding box contains five attributes: x, y, w, h and confidence, where x and y represent the offset of the center point of the bounding box relative to the upper left corner of the Grid cell where it is located, and w and h represent The width and height of the bounding box, and the confidence indicates whether there is a target object in the bounding box.
  Therefore, in the YOLO algorithm, dividing the input image into several small grids and detecting each grid is one of the key technologies to achieve target detection.
  Dividing the input image into multiple grids can improve the operating efficiency of the model and reduce the amount of calculation. Compared with detecting the entire image, it is more efficient to predict only in each small grid.
  When performing target detection in the entire image, it is easy to detect some irrelevant areas, resulting in a high false detection rate. After the image is divided into multiple small grids, the model only needs 有可能出现目标物体的网格中to perform detection, which can effectively improve the detection accuracy.

ROI Box (Region of Interest)

  In the target detection algorithm, ROI (Region of Interest, region of interest) refers to the region obtained by selecting and cropping the region of interest in the image. In general, target detection algorithms use ROI boxes to define areas that may contain target objects, thereby achieving target detection and classification within this area.
  ROI boxes can be generated in various ways, such as based on candidate boxes generated by Region Proposal Network (RPN) or manually drawn. Once the ROI box is obtained, it can be processed as a sub-region of the input image in the target detection algorithm, so that the target object can be detected and recognized more accurately.
  Usually, the ROI box not only contains the target object itself, but also contains certain context information, so that the algorithm can better detect and classify the target. In the target detection algorithm, the process of using the ROI frame for detection is called "region extraction" or "region pooling". Usually, different pooling methods are used to aggregate the information in the ROI frame into a fixed-size feature vector (Feature Vector ), and input into subsequent classifiers or regressors for tasks such as classification or position regression.

Guess you like

Origin blog.csdn.net/weixin_44463519/article/details/131269260