The Magic of Bounding Box Regression: Demystifying the Accurate and Efficient MPDIoU Loss Function


insert image description here

Summary

https://arxiv.org/pdf/2307.07662.pdf
Bounding box regression (BBR) has been widely used in object detection and instance segmentation, which is an important step in object localization. However, most existing bounding box regression loss functions fail to optimize when the predicted box has the same aspect ratio as the ground truth box but completely different width and height values. To address the above issues, we fully explore the geometric features of horizontal rectangles, and propose a novel bounding box similarity comparison metric MPDIoU based on the minimum point distance, which incorporates all relevant factors considered in existing loss functions, namely Overlapping or non-overlapping areas, center point distances, and width and height deviations, while simplifying calculations. On this basis, we propose an MPDIoU-based bounding box regression loss function L MPDIoU \text{L}_{\text{MPDIoU}}LMPDIoU. Experimental results show that applying the MPDIoU loss function to existing state-of-the-art instance segmentation (e.g. YOLACT) and object detection (e.g. YOLOv7) models trained on PASCAL VOC, MS COCO, and IIIT5k outperforms existing loss functions.

Keywords: object detection, instance segmentation, bounding box regression, loss function

1 Introduction

Object detection and instance segmentation are two important problems in the field of computer vision, which have attracted a lot of research interest in the past few years. Most state-of-the-art object detectors (e.g., the YOLO family [1, 2, 3, 4, 5, 6], Mask R-CNN [7], Dynamic R-CNN [8] and DETR [9]) rely on Based on the Bounding Box Regression (BBR) module to determine the location of the object. Based on this paradigm, a well-designed loss function is crucial to the success of BBR. So far, most existing BBR loss functions fall into two categories: ℓ n \ell_nnNorm loss function and intersection over union (IoU) loss function.

However, most of the existing loss functions for bounding box regression have the same value under the same prediction results, which reduces the convergence speed and accuracy of bounding box regression. Therefore, considering the advantages and disadvantages of existing loss functions for bounding box regression, inspired by the geometric features of horizontal rectangles, we try to design a new loss function LMPDIoU for bounding box regression based on the minimum point distance, and in the boundary MPDIoU is used during box regression as a new measure to compare the similarity between predicted and ground truth bounding boxes. We also provide an easy-to-implement solution for computing MPDIoU between two axis-aligned rectangles, making it possible to incorporate MPDIoU as an evaluation metric into state-of-the-art object detection and instance segmentation algorithms, and in PASCAL VOC [ 10], MS COCO [11], IIIT5k [12] and MTHv2 [13] are tested on some mainstream object detection, scene text detection and instance segmentation datasets to verify the performance of our proposed MPDIoU.

The contributions of this paper can be summarized as follows:

  • 1. We consider the existing IoU loss and ℓ n \ell_nnThe advantages and disadvantages of norm loss, and then propose an IoU loss based on the minimum point distance, called LMPDI o U \mathcal{L}_{MPD IoU}LMP D I o U, to solve the problem of the existing loss and obtain faster convergence and more accurate regression results.
  • 2. Extensive experiments are conducted on object detection, character-level scene text detection, and instance segmentation tasks. Excellent experimental results verify the superiority of the proposed MPD IoU loss. A detailed ablation study shows the effect of setting different loss functions and parameter values.

2. Related work

2.1, target detection and instance segmentation

In the past few years, many researchers from different countries and regions have proposed a large number of deep learning-based object detection and instance segmentation methods. Overall, bounding box regression has been adopted as a fundamental component by many representative object detection and instance segmentation frameworks [14]. Among the deep models for object detection, the R-CNN series [15][16][17] adopts two or three bounding box regression modules to obtain higher positioning accuracy, while the YOLO series [2][3] [6] and the SSD family [18][19][20] adopt one for faster inference. RepPoints [21] predicts several points to define a rectangular box. FCOS [22] localizes objects by predicting the Euclidean distances from sampled points to the top, bottom, left, and right sides of a ground truth bounding box.

For example, for instance segmentation, PolarMask [23] predicts the length of the ray from the sampling point to the edge of the object in n directions to segment the instance. There are other detectors, such as RRPN [24] and R2CNN [25], for remote sensing detection and scene text detection by adding rotation angle regression to detect arbitrarily oriented objects. Mask R-CNN [7] adds an additional instance mask branch on Faster R-CNN [15], and the recent state-of-the-art YOLACT [26] does the same on RetinaNet [27] things. In summary, bounding box regression is one of the key components of state-of-the-art deep models for object detection and instance segmentation.

2.2. Scene Text Recognition

To solve the problem of text detection and recognition in arbitrary-shaped scenes, ABCNet [28] and its improved version ABCNet v2 [29] use BezierAlign to transform arbitrary-shaped text into regular shapes. These approaches have achieved great progress by unifying detection and recognition into an end-to-end trainable system using a rectification module. [30] proposed RoI Masking to extract features for arbitrary shape text recognition. Similar to [30, 31] try to use faster detectors for scene text detection. AE TextSpotter [32] uses recognition results to guide detection through language models. Inspired by [33], [34] proposes a Transformer-based scene text detection method that provides instance-level text segmentation results.

2.3. Loss function of bounding box regression

insert image description here

At first, ℓ n \ell_{n}nThe norm loss function is widely used for bounding box regression, it is very simple but sensitive to various scales. In YOLO v1[35], the square root is used for w and h to mitigate this effect, while YOLO v3[2] uses 2 × wh 2 \times wh2×w h . To better compute the difference between ground truth and predicted bounding boxes, IoU loss is used since Unitbox [36]. To ensure training stability, the Bounded-IoU loss [37] introduces an upper bound on IoU. For training deep models for object detection and instance segmentation, it is recommended to use IoU-based metrics, thanℓ n \ell_{n}nThe norm is more consistent [38, 37, 39]. The original IoU represents the ratio of the intersection area and the union area of ​​the predicted bounding box and the ground truth bounding box (as shown in Figure 1(a)), which can be expressed as I o U = B gt ∩ B prd B gt ∪ B prd
, (1) I o U=\frac{\mathcal{B}_{gt} \cap \mathcal{B}_{prd}}{\mathcal{B}_{gt} \cup \mathcal{B}_{ prd}}, \tag{1}IoU=BgtBprdBgtBprd,( 1 )
Among them,B gt \text{B}_{gt}BgtIndicates the groundtruth bounding box, B prd \text{B}_{prd}BprdRepresents the predicted bounding box. It can be seen that the original IoU only calculates the union area of ​​two bounding boxes, and cannot distinguish the case where the two boxes do not overlap. As shown in equation 1, if ∣ B gt ∩ B prd ∣ = 0 \left|\text{B}_{gt} \cap \text{B}_{prd}\right|=0BgtBprd=0,则 I o U ( B g t , B p r d ) = 0 IoU(\text{B}_{gt}, \text{B}_{prd})=0 I o U ( Bgt,Bprd)=0 . In this case, IoU cannot reflect whether two boxes are close to each other or far away. Then, to address this problem, GIoU [39] is proposed. GIoU can be expressed as
GI o U = I o U − ∣ C − B gt ∪ B prd ∣ ∣ C ∣ , (2) GI o U=I o U-\frac{\left|C-\mathcal{B}_ {gt} \cup \mathcal{B}_{prd}\right|}{|C|}, \tag{2}G I o U=IoUCCBgtBprd,( 2 )
Among them, C means to coverB gt \text{B}_{gt}BgtB prd \text{B}_{prd}BprdC ∣ |C|C represents the area of ​​box C. Due to the penalty term introduced in the GIoU loss, the predicted boxes will move towards the target box even when they do not overlap. GIoU loss has been applied to train state-of-the-art object detectors, such as YOLO v3 and Faster R-CNN, and achieved better performance than MSE loss and IoU loss. However, GIoU loses its effectiveness when the predicted bounding box is completely covered by the groundtruth bounding box. To solve this problem, DIoU [40] is proposed considering the center point distance between the predicted bounding box and the groundtruth bounding box. The formula of DIoU can be expressed as
DI o U = I o U − ρ 2 ( B gt , B prd ) C 2 (3) DI o U=I o U-\frac{\rho^{2}\left(\mathcal {B}_{gt}, \mathcal{B}_{prd}\right)}{C^{2}} \tag{3}D I o U=IoUC2r2(Bgt,Bprd)(3)

其中, ρ 2 ( B g t , B p r d ) \rho^{2}(\text{B}_{gt}, \text{B}_{prd}) r2(Bgt,Bprd) represents the Euclidean distance between the center point of the predicted bounding box and the groundtruth bounding box (shown by the red dashed line in Figure 1(b)). C 2 C^{2}C2 represents the diagonal length of the smallest circumscribed rectangle (as shown by the black dotted line in Figure 1(b)). We can see thatLDI o U \text{L}_{DIoU}LD I o UThe objective of is to directly minimize the distance between the center point of the predicted bounding box and the center point of the groundtruth bounding box. However, when the center point of the predicted bounding box and the center point of the groundtruth bounding box coincide, it degenerates to the original IoU. To solve this problem, CIoU is proposed considering the center point distance and aspect ratio. The formula of CIoU can be written as:

C I o U = I o U − ρ 2 ( B g t , B p r d ) C 2 − α V , (4) C I o U=I o U-\frac{\rho^{2}\left(\mathcal{B}_{g t}, \mathcal{B}_{p r d}\right)}{C^{2}}-\alpha V, \tag{4} C I o U=IoUC2r2(Bgt,Bprd)αV,(4)

V = 4 π 2 ( arctan ⁡ w g t h g t − arctan ⁡ w p r d h p r d ) 2 , (5) V=\frac{4}{\pi^{2}}\left(\arctan \frac{w^{g t}}{h^{g t}}-\arctan \frac{w^{p r d}}{h^{p r d}}\right)^{2}, \tag{5} V=Pi24( arctanhgtwgtarctanhprdwprd)2,(5)

α = V 1 − I o U + V . (6) \alpha=\frac{V}{1-I o U+V} . \tag{6} a=1IoU+VV.(6)

However, the definition of aspect ratio in CIoU is a relative value rather than an absolute value. To solve this problem, EIoU [41] is proposed based on DIoU, which is defined as follows:
EI o U = DI o U − ρ 2 ( wprd , wgt ) ( wc ) 2 − ρ 2 ( hprd , hgt ) ( hc ) 2 ( 7) EI o U=DI o U-\frac{\rho^{2}\left(w_{prd}, w_{gt}\right)}{\left(w^{c}\right)^{2 }}-\frac{\rho^{2}\left(h_{prd}, h_{gt}\right)}{\left(h^{c}\right)^{2}} \tag{7}E I o U=D I o U(wc)2r2(wprd,wgt)(hc)2r2(hprd,hgt)( 7 )
However, as shown in Fig. 2, the above loss function for bounding box regression loses effectiveness when the predicted bounding box and the groundtruth bounding box have the same aspect ratio but different width and height values, which will Limit convergence speed and accuracy. Therefore, we try to design a method calledLMPDI o U \text{L}_{MPDIoU}LMP D I o UThe novel loss function of is used for bounding box regression, taking into account LGI o U [39], LDI o U [40], LCI o U [42], LEI o U \text{L}_{GIoU} [39], \text{L}_{DIoU}[40], \text{L}_{CIoU}[42], \text{L}_{EIoU}LG I o U[39]LD I o U[40]LC I o U[42]LE I o U[41] and other loss functions, but with higher efficiency and accuracy for bounding box regression. However, existing loss functions do not take full advantage of the geometric properties of bounding box regression. Therefore, we propose MPDIoU loss by minimizing the distance between the upper-left and lower-right points between the predicted bounding box and the groundtruth bounding box to better train deep models for object detection, character-level scene text detection, and instance segmentation .
insert image description here

3. The union intersection point with the smallest point distance

After analyzing the advantages and disadvantages of the above IoU-based loss functions, we started to think about how to improve the accuracy and efficiency of bounding box regression. In general, we use the coordinates of the upper left and lower right points to define a unique rectangle. Inspired by the geometric properties of bounding boxes, we design a novel IoU metric named MPDIoU to minimize the distance between the top-left and bottom-right points between the predicted bounding box and the groundtruth bounding box. The computation of MPDIoU is summarized in Algorithm 1.

insert image description here

In summary, our proposed MPDIoU simplifies the similarity comparison between two bounding boxes and can be adapted for overlapping or non-overlapping bounding box regression. Therefore, MPDIoU can be an appropriate substitute for IoU for all performance metrics used in 2D/3D computer vision tasks. This paper only focuses on 2D object detection and instance segmentation, we can easily use MPDIoU as a metric and loss. The extension to the axis-aligned 3D case is left for future work.

During the training phase, force each bounding box B prd = [ xprd , yprd , wprd , hprd ] T \text{B}_{prd}=\left[x^{prd} predicted by the model by minimizing the following loss function ,y^{prd},w^{prd},h^{prd}\right]^TBprd=[xprd,yprd,wprd,hprd]T is close to its groundtruth bounding boxB gt = [ xgt , ygt , wgt , hgt ] T \text{B}_{gt}=\left[x^{gt},y^{gt},w^{gt}, h^{gt}\right]^TBgt=[xgt,ygt,wgt,hgt]T
L = min ⁡ Θ ∑ B g t ∈ B g t L ( B g t , B p r d ∣ Θ ) (8) \mathcal{L}=\min _{\Theta} \sum_{\mathcal{B}_{g t} \in \mathbb{B}_{g t}} \mathcal{L}\left(\mathcal{B}_{g t}, \mathcal{B}_{p r d} \mid \Theta\right) \tag{8} L=ThminBgtBgtL(Bgt,Bprdi )(8)

where, B gt \text{B}_{gt}BgtIs the set of groundtruth bounding boxes, Θ \ThetaΘ are the parameters of the deep model used for regression. L \text{L}The typical form of L is the \ell_{n}-norm, e.g., the mean squared error (MSE) loss and the Smooth-\text{}\ell_{1} loss [43], which have been widely used in object detection [44 ], pedestrian detection [45,46], scene text detection [34,47], 3D object detection [48,49], pose estimation [50,51] and instance segmentation [52,26]. However, recent research has shown that based onℓ n \text{}\ell_{n}n- The loss function of the norm is inconsistent with the IoU evaluation metric, so IoU-based loss functions are proposed [53, 37, 39]. According to the definition of MPDIoU in the previous section, we define the loss function based on MPDIoU as follows:
L MPDIoU = 1 − MPDIoU (9) \mathcal{L}_{\text {MPDIoU }}=1-\text { MPDIoU } \tag {9}LMPDIoU =1 MPDIoU ( 9 )
Therefore, all factors of the existing bounding box regression loss function can be determined by the four point coordinates. The conversion formula is as follows:
∣ C ∣ = ( max ⁡ ( x 2 gt , x 2 prd ) − min ⁡ ( x 1 gt , x 1 prd ) ) ∗ ( max ⁡ ( y 2 gt , y 2 prd ) − min ⁡ ( y 1 gt , y 1 prd ) ) , (10) |C|=\left(\max \left(x_{2}^{gt}, x_{2}^{prd}\right)-\min \left (x_{1}^{gt}, x_{1}^{prd}\right)\right) *\left(\max \left(y_{2}^{gt}, y_{2}^{prd} \right)-\min \left(y_{1}^{gt}, y_{1}^{prd}\right)\right), \tag{10}C=(max(x2gt,x2prd)min(x1gt,x1prd))(max(y2gt,y2prd)min(y1gt,y1prd)),(10)

x c g t = x 1 g t + x 2 g t 2 , y c g t = y 1 g t + y 2 g t 2 , y c p r d = y 1 p r d + y 2 p r d 2 , x c p r d = x 1 p r d + x 2 p r d 2 , (11) x_{c}^{g t}=\frac{x_{1}^{g t}+x_{2}^{g t}}{2}, y_{c}^{g t}=\frac{y_{1}^{g t}+y_{2}^{g t}}{2}, y_{c}^{p r d}=\frac{y_{1}^{p r d}+y_{2}^{p r d}}{2}, x_{c}^{p r d}=\frac{x_{1}^{p r d}+x_{2}^{p r d}}{2}, \tag{11} xcgt=2x1gt+x2gt,ycgt=2y1gt+y2gt,ycprd=2y1prd+y2prd,xcprd=2x1prd+x2prd,(11)

w g t = x 2 g t − x 1 g t , h g t = y 2 g t − y 1 g t , w p r d = x 2 p r d − x 1 p r d , h p r d = y 2 p r d − y 1 p r d . (12) w_{g t}=x_{2}^{g t}-x_{1}^{g t}, h_{g t}=y_{2}^{g t}-y_{1}^{g t}, w_{p r d}=x_{2}^{p r d}-x_{1}^{p r d}, h_{p r d}=y_{2}^{p r d}-y_{1}^{p r d} . \tag{12} wgt=x2gtx1gt,hgt=y2gty1gt,wprd=x2prdx1prd,hprd=y2prdy1prd.( 12 )
Among them, |C| represents enclosingB gt \mathcal{B}_{gt}Bgtand B prd \mathcal{B}_{prd}BprdThe area of ​​the smallest rectangle, ( xcgt , ycgt ) \left(x_{c}^{gt},y_{c}^{gt}\right)(xcgt,ycgt) ( x c p r d , y c p r d ) \left(x_{c}^{p r d},y_{c}^{p r d}\right) (xcprd,ycprd) respectively represent the coordinates of the center point of the groundtruth bounding box and the predicted bounding box,wgt w_{gt}wgtand hgt h_{gt}hgtRepresents the width and height of the groundtruth bounding box, wprd w_{prd}wprdand hprd h_{prd}hprdRepresents the width and height of the predicted bounding box.

According to formulas (10)-(12), we can find that all the factors considered in the existing loss function can be determined by the coordinates of the upper left corner point and the lower right corner point, such as non-overlapping area, center point distance, width and height deviation, This means that our proposed LMPDI o U \mathcal{L}_{MPDI o U}LMP D I o UNot only is it well considered, but it also simplifies the calculation process.

According to Theorem 3.1, if the predicted bounding box and the groundtruth bounding box have the same aspect ratio, the predicted bounding box inside the groundtruth bounding box has a lower LMPDI o U \mathcal{L}_{ MPDI o U}LMP D I o Uvalue. This property ensures the accuracy of bounding box regression, tending to provide less redundancy for predicting bounding boxes.

Theorem 3.1. We define a groundtruth bounding box as B gt \mathcal{B}_{\text {gt}}Bgt, the two predicted bounding boxes are B prd 1 \mathcal{B}_{\text {prd}1}Bprd 1B prd 2 \mathcal{B}_{\text {prd}2}Bprd2. The width and height of the input image are w and h respectively. Suppose B gt \mathcal{B}_{gt}BgtB prd 1 \mathcal{B}_{\text {prd}1}Bprd 1B prd 2 \mathcal{B}_{\text {prd}2}Bprd2The upper left and lower right corner coordinates are ( x 1 gt , y 1 gt , x 2 gt , y 2 gt ) \left(x_{1}^{gt},y_{1}^{gt},x_{2} ^{gt},y_{2}^{gt}\right)(x1gt,y1gt,x2gt,y2gt) ( x 1 p r d 1 , y 1 p r d 1 , x 2 p r d 1 , y 2 p r d 1 ) \left(x_{1}^{p r d 1},y_{1}^{p r d 1},x_{2}^{p r d 1},y_{2}^{p r d 1}\right) (x1example 1 _ _,y1example 1 _ _,x2example 1 _ _,y2example 1 _ _) ( x 1 p r d 2 , y 1 p r d 2 , x 2 p r d 2 , y 2 p r d 2 ) \left(x_{1}^{p r d 2},y_{1}^{p r d 2},x_{2}^{p r d 2},y_{2}^{p r d 2}\right) (x1prd2,y1prd2,x2prd2,y2prd2) , thenB gt \mathcal{B}_{gt}BgtB prd 1 \mathcal{B}_{\text {prd}1}Bprd 1B prd 2 \mathcal{B}_{\text {prd}2}Bprd2The width and height of can be expressed as ( wgt = y 2 gt − y 1 gt , hgt = x 2 gt − x 1 gt ) \left(w_{gt}=y_{2}^{gt}-y_{1}^ {gt},h_{gt}=x_{2}^{gt}-x_{1}^{gt}\right)(wgt=y2gty1gt,hgt=x2gtx1gt) ( w p r d 1 = y 2 p r d 1 − y 1 p r d 1 , h p r d 1 = x 2 p r d 1 − x 1 p r d 1 ) \left(w_{p r d 1}=y_{2}^{p r d 1}-y_{1}^{p r d 1},h_{p r d 1}=x_{2}^{p r d 1}-x_{1}^{p r d 1}\right) (wexample 1 _ _=y2example 1 _ _y1example 1 _ _,hexample 1 _ _=x2example 1 _ _x1example 1 _ _) ( w p r d 2 = y 2 p r d 2 − y 1 p r d 2 , h p r d 2 = x 2 p r d 2 − x 1 p r d 2 ) \left(w_{p r d 2}=y_{2}^{p r d 2}-y_{1}^{p r d 2},h_{p r d 2}=x_{2}^{p r d 2}-x_{1}^{p r d 2}\right) (wprd2=y2prd2y1prd2,hprd2=x2prd2x1prd2).若 w p r d 1 = k ∗ w g t w_{p r d 1}=k*w_{g t} wexample 1 _ _=kwgtAnd hprd 1 = k ∗ hgt h_{prd 1}=k*h_{gt}hexample 1 _ _=khgt w p r d 2 = 1 k ∗ w g t w_{p r d 2}=\frac{1}{k}*w_{g t} wprd2=k1wgt h p r d 2 = 1 k ∗ h g t h_{p r d 2}=\frac{1}{k}*h_{g t} hprd2=k1hgt, where k > 1 k>1k>1 k ∈ N ∗ k\in N* kN

B g t \mathcal{B}_{g t} BgtB prd 1 \mathcal{B}_{\text {prd } 1}Bprd 1B prd 2 \mathcal{B}_{\text {prd } 2}Bprd 2The central points of both overlap. Therefore, GIoU ( B gt \mathcal{B}_{\text {gt }}Bgt B prd 1 \mathcal{B}_{\text {prd } 1}Bprd 1)=GIoU( B gt  \mathcal{B}_{\text {gt }} Bgt B prd 2 \mathcal{B}_{\text {prd } 2}Bprd 2),DIoU( B g t \mathcal{B}_{g t} BgtB prd 1 \mathcal{B}_{\text {prd } 1}Bprd 1)=DIoU( B g t \mathcal{B}_{g t} BgtB prd 2 \mathcal{B}_{\text {prd } 2}Bprd 2),CIoU( B g t \mathcal{B}_{g t} BgtB prd 1 \mathcal{B}_{\text {prd } 1}Bprd 1)=CIoU( B gt  \mathcal{B}_{\text {gt }} Bgt B prd 2 \mathcal{B}_{\text {prd } 2}Bprd 2),EIoU( B gt  \mathcal{B}_{\text {gt }} Bgt B prd 1 \mathcal{B}_{\text {prd } 1}Bprd 1)=EIoU( B gt  \mathcal{B}_{\text {gt }} Bgt B prd 2 \mathcal{B}_{\text {prd } 2}Bprd 2), but MPDIoU ( B gt \mathcal{B}_{\text {gt }}Bgt B prd 1 \mathcal{B}_{\text {prd } 1}Bprd 1)>MPDIoU( B g t \mathcal{B}_{g t} Bgt B p r d 2 \mathcal{B}_{p r d 2} Bprd2).
insert image description here
insert image description here
Considering the groundtruth bounding box B gt \mathcal{B}_{gt}Bgtis a rectangle with an area greater than zero, that is, A gt > 0 A^{gt}>0Agt>0 . The conditions in Algorithm 2 (1) and Algorithm 2 (6) respectively ensure that the predicted areaA prd A^{prd}Ap r d and intersection areaI \mathcal{I}I is a non-negative value, that is,A prd ≥ 0 A^{\text {prd }} \geq 0Aprd 0 andI ≥ 0 \mathcal{I} \geq 0I0 , for any predicted bounding boxB prd = ( x 1 prd , y 1 prd , x 2 prd , y 2 prd ) ∈ R 4 \mathcal{B}_{\text {prd }}=\left(x_{1 }^{\text {prd }}, y_{1}^{\text {prd }}, x_{2}^{prd}, y_{2}^{\text {prd }}\right) \in \ mathbb{R}^{4}Bprd =(x1prd ,y1prd ,x2prd,y2prd )R4 . Therefore, for any predicted bounding boxB prd = ( x 1 prd , y 1 prd , x 2 prd , y 2 prd ) ∈ R 4 \mathcal{B}_{prd}=\left(x_{1}^{prd }, y_{1}^{prd}, x_{2}^{prd}, y_{2}^{prd}\right) \in \mathbb{R}^{4}Bprd=(x1prd,y1prd,x2prd,y2prd)R4 , the union areaU \mathcal{U}U is always greater than the intersection areaI \mathcal{I}I,即U ≥ I \mathcal{U} \geq \mathcal{I}UI ,LMPDI o U \mathcal{L}_{MPDI o U}LMP D I o Uis always bounded, i.e. 0 ≤ LMPDI o U < 3 0 \leq \mathcal{L}_{MPDI o U}<30LMP D I o U<3 , for any predicted bounding boxB prd ∈ R 4 \mathcal{B}_{\text {prd }} \in \mathbb{R}^{4}Bprd R4

I of U = 0 IoU=\mathbf{0}IoU=At 0 o'clock, MPDIoU action: Lost against MPDIoU, we haveLMPDI o U = 1 − MPDI o U = 1 + d 1 2 d 2 + d 2 2 d 2 − I o U \mathcal {L}_{MPDI o U }=1-MPDI o U=1+\frac{d_{1}^{2}}{d^{2}}+\frac{d_{2}^{2}}{d^{2}}- I o ULMP D I o U=1MP D I o U=1+d2d12+d2d22IoU。在 B g t \mathcal{B}_{g t} BgtB prd \mathcal{B}_{\text {prd}}BprdIn the case of no overlap, which means IoU=0, the MPDIoU loss can be simplified to LMPDI o U = 1 − MPDI o U = 1 + d 1 2 d 2 + d 2 2 d 2 \mathcal{L}_{MPDI o U}=1-MPDI o U=1+\frac{d_{1}^{2}}{d^{2}}+\frac{d_{2}^{2}}{d^{2}}LMP D I o U=1MP D I o U=1+d2d12+d2d22. In this case, by minimizing LMPDI o U \mathcal{L}_{MPDI o U}LMP D I o U, we actually minimize d 1 2 d 2 + d 2 2 d 2 \frac{d_{1}^{2}}{d^{2}}+\frac{d_{2}^{2}} {d^{2}}d2d12+d2d22. This term is a normalized measure between 0 and 1, that is, 0≤ d 1 2 d 2 + d 2 2 d 2 < 2 \frac{d_{1}^{2}}{d^{2 }}+\frac{d_{2}^{2}}{d^{2}}<2d2d12+d2d22<2

4. Experimental results

We regress the new bounding box loss LMPDI o U \mathcal{L}_{MPDI o U}LMP D I o UIt incorporates the most popular 2D object detectors and instance segmentation models such as YOLO v7 [6] and YOLACT [26] to evaluate it. To do this, we replace their default regression loss with LMPDI o U \mathcal{L}_{MPDI o U}LMP D I o U, which replaces ℓ 1 − \ell_{1} in YOLACT[26] -1 L CIoU \mathcal{L}_{\text {CIoU }}in smooth and YOLO v7[6]LCIoU . We also compare the baseline loss with LGIOU \mathcal{L}_{GIOU}LG I O UCompare.

4.1. Experimental settings

The experimental environment can be summarized as follows: the memory is 32 GB, the operating system is Windows 11, the CPU is Intel i9-12900k, the graphics card is NVIDIA Geforce RTX 3090, and the memory is 24GB. For a fair comparison, all experiments are implemented using PyTorch [54].

4.2. Dataset

We train on all object detection and instance segmentation baselines and report all results on two standard benchmarks, the PASCAL VOC [10] and Microsoft Common Objects in Context (MS COCO 2017) [11] challenges. Their training protocols and evaluation details are explained in their respective sections.

PASCAL VOC 2007 & 2012: The Pascal Visual Object Classes (VOC) [10] benchmark is one of the most widely used datasets for classification, object detection, and semantic segmentation, containing about 9963 images. The training and testing datasets are split 50% each, where objects from 20 predefined categories are annotated with horizontal bounding boxes. Due to the weak performance due to the small size of instance segmentation images, we only provide instance segmentation results trained with MS COCO 2017.

MS COCO: MS COCO [11] is a widely used benchmark for image captioning, object detection, and instance segmentation, containing more than 200,000 images, including training, validation, and test datasets, and more than 500,000 annotated object instances, from 80 categories.

IIIT5k: IIIT5k [12] is one of the popular scene text detection benchmarks with character-level annotations, consisting of 5000 cropped word images collected from the Internet. Character classes include English letters and numbers. There are 2000 images for training and 3000 images for testing.

MTHv2: MTHv2 [13] is one of the popular OCR benchmarks with character-level annotations. Character categories include simplified and traditional Chinese characters. It contains more than 3000 images of Chinese historical documents and more than 1 million Chinese characters.

4.3. Evaluation protocol

In this paper, we report all results using the same performance metrics as the MS COCO 2018 challenge [11], including mAP for different class labels to determine a specific value IoU threshold for true and false positives. The main performance metrics for object detection used in our experiments are shown in the form of precision and [email protected]:0.95. We report AP75 values ​​with an IoU threshold of 0.75 in the table. For instance segmentation, the main performance metrics used in our experiments are presented in the form of AP and AR, i.e. mAP and mAR averaged over different IoU thresholds, i.e. IoU = {0.5, 0.55, ... 0.95}.

All object detection and instance segmentation baselines are also evaluated using the test sets of MS COCO 2017 and PASCAL VOC 2007 & 2012. The results will be shown in the next section.

4.4. Experimental results of target detection

training protocol. We use the original Darknet implementation of YOLO v7 published by [6]. For the baseline results (trained with GIoU loss), we chose DarkNet-608 as the backbone in all experiments and trained it exactly according to its training protocol using the reported default parameters and number of iterations for each benchmark. To train YOLO v7 with GIoU, DIoU, CIoU, EIoU and MPDIoU losses, we simply replace the bounding box regression IoU loss with L GIoU \mathcal{L}_{\text {GIoU}}LGIoUL DIoU \mathcal{L}_{\text {DIoU}}LDIoU L C I o U \mathcal{L}_{C I o U} LC I o UL EIoU \mathcal{L}_{\text {EIoU}}LEIoUL MPDIoU \mathcal{L}_{\text {MPDIoU}}LMPDIoUlosses, which are described in 2.

Following the training protocol of the original code, we trained YOLOv7 [6] for up to 150 epochs using each loss on the training and validation sets of the dataset. We set the tolerance of the early stopping mechanism to 5 to reduce the training time and save the best performing model. Their performance is evaluated on the test set of PASCAL VOC 2007 & 2012 using the best checkpoints for each loss. The results have been reported in Table 1.
insert image description here
insert image description here
insert image description here

4.5. Experimental results of character-level scene text recognition

training protocol. We use a similar training protocol to the object detection experiments. Following the training protocol of the original code, we trained YOLOv7 [6] for up to 30 epochs with each loss on the training and validation sets of this dataset. Their performance was evaluated on the IIIT5K [12] and MTHv2 [55] test sets using the best checkpoints for each loss. The results have been reported in Table 2 and Table 3.
insert image description here

As we can see, the results in Tab.2 and Tab.3 show that using LMPDI o U \mathcal{L}_{MPDI o U}LMP D I o UTraining YOLO v7 as a regression loss can significantly improve its performance, compared with existing ones including L GIoU , L DIoU , L CIoU , L EIoU \mathcal{L}_{\text {GIoU}}, \mathcal{L}_{ \text {DIoU}}, \mathcal{L}_{\text {CIoU}}, \mathcal{L}_{\text {EIoU}}LGIoULDIoULCIoULEIoUcompared to the regression loss. Our proposed LMPDIOU \mathcal{L}_{MPDIOU}LMP D I O UIt shows excellent performance in character-level scene text detection.
insert image description here

4.6. Experimental Results of Instance Segmentation

training protocol. We used YOLACT [26], the latest PyTorch implementation released by the University of California. For baseline results (using LGI o U \mathcal{L}_{GI o U}LG I o Utraining), we chose ResNet-50 as the backbone network architecture of YOLACT in all experiments, and trained according to its training protocol using the reported default parameters and the number of iterations for each benchmark. To train YOLACT with GIoU, DIoU, CIoU, EIoU and MPDIoU losses, we use L GIoU \mathcal{L}_{\text {GIoU}}LGIoUL DIoU \mathcal{L}_{\text {DIoU}}LDIoU L C I O U \mathcal{L}_{C I O U} LC I O UL EIoU \mathcal{L}_{\text {EIoU}}LEIoULMPDIOU\mathcal{L}_{MPDIOU}LMP D I O Uloss replaces their ℓ 1 − smooth \ell_{1} -smooth in the final bounding box refinement stage1s m oo t h losses, which are described in 2. Similar to the YOLO v7 experiments, we replace the original bounding box regression loss with our proposedL MPDIoU \mathcal{L}_{\text {MPDIoU}}LMPDIoU

如图8©表示,General LGI o U \mathcal{L}_{GI o U}LG I o ULDI or U \mathcal{L}_{DI or U}LD I o U L C I o U \mathcal{L}_{C I o U} LC I o ULEI o U \mathcal{L}_{EI o U}LE I o UAs a regression loss can slightly improve the performance of YOLACT on MS COCO 2017. However, unlike using LMPDI o U \mathcal{L}_{MPDI o U}LMP D I o UCompared to the training case, the improvement is significant, and we visualize the relationship between different values ​​of mask AP and different values ​​of IoU threshold, that is, 0.5≤IoU≤0.95.

Similar to the experiment above, using LMPDI o U \mathcal{L}_{MPDI o U}LMP D I o UAs a regression loss, it can improve the detection accuracy, surpassing the existing loss functions. As shown in Table 4, our proposed L MPDIoU \mathcal{L}_{\text {MPDIoU}}LMPDIoUPerforms better than existing loss functions on most metrics. However, the magnitude of improvement between different losses is smaller than previous experiments. This may be due to the following factors. First, the detection anchor boxes on YOLACT [26] are denser than YOLO v7 [6], resulting in L MPDIoU \mathcal{L}_{\text {MPDIoU}}LMPDIoURelative to LI o U \mathcal{L}_{I o U}LIoUThere are fewer cases where it is advantageous (such as non-overlapping bounding boxes). Second, existing loss functions for bounding box regression have been improved over the past few years, which means that the improvement in accuracy is very limited, but there is still a lot of room for improvement in efficiency.
insert image description here

We also compare the trend of bbox loss and AP value during training when YOLACT uses different regression loss functions. As shown in Figure 8(a), (b), use LMPDIOU \mathcal{L}_{MPDIOU}LMP D I O UTraining is faster than most existing loss functions (i.e. LGI o U \mathcal{L}_{GI o U}LG I o ULDIOU \mathcal{L}_{DIOU}LD I O U) performed better, achieving higher accuracy and faster convergence. Although bbox loss and AP values ​​show large fluctuations, our proposed LMPDI o U \mathcal{L}_{MPDI o U}LMP D I o UPerform better at the end of training.
insert image description here

To better demonstrate the performance of different loss functions for bounding box regression for instance segmentation, we provide some visualizations as shown in Figures 5 and 9. We can see that, except for L GIoU \mathcal{L}_{\text {GIoU}}LGIoUL DIoU \mathcal{L}_{\text {DIoU}}LDIoUL CIoU \mathcal{L}_{\text {CIoU}}LCIoULEI o U \mathcal{L}_{EI o U}LE I o UAlso, we use LMPDI o U \mathcal{L}_{MPDI o U}LMP D I o UProvides less redundant and higher precision instance segmentation results.
insert image description here

5 Conclusion

In this paper, we introduce a new metric, MPDIoU based on minimum point distance, for comparing any two arbitrary bounding boxes. We demonstrate that this new metric has all the attractiveness of existing IoU-based metrics while simplifying its computation. It will be a better choice in all 2D/3D vision tasks that rely on IoU metrics.

We also propose a method called LMPDIOU \mathcal{L}_{MPDIOU}LMP D I O UThe loss function for bounding box regression. Using a general performance metric and our proposed MPDIoU, we apply it to state-of-the-art object detection and instance segmentation algorithms, resulting in improved performance on popular benchmarks such as PASCAL VOC, MS COCO, MTHv2, and IIIT5K . Since the optimal loss for a metric is the metric itself, our MPDIoU loss can be used as the optimal bounding box regression loss in all applications requiring 2D bounding box regression.

For future work, we hope to further experiment on some downstream tasks based on object detection and instance segmentation, including scene text detection, person re-identification, etc. Through the above experiments, we can further verify the generalization ability of our proposed loss function.

Guess you like

Origin blog.csdn.net/hhhhhhhhhhwwwwwwwwww/article/details/132481292