[MOT Study Notes] JDE Loss Function Detailed Explanation

I just wrote a paper recently and sorted it into the JDE algorithm. The loss function part of the original JDE paper is a bit vague.

(1) Loss function

Unlike YOLO v3, JDE uses a dual-threshold segmentation method to judge whether the target is foreground or background. That is, if the IoU of the target and a truth box is greater than 0.5, it is considered a match; if the IoU is less than 0.4, it is considered a mismatch. After experiments,
it is believed that this method can suppress false alarms (FP). For foreground and background classification loss L α \mathcal{L}_{\alpha}LaUsing cross-entropy loss, the regression loss L β \mathcal{L}_{\beta} for the bounding boxLbSmooth L1 loss is used,
as shown in formulas (4-1) and (4-2).
L α ( x , y ) = 1 N ∑ n = 1 N [ − ∑ c = 1 C log ⁡ exn , c ∑ i = 1 C exn , i ] yn , c \mathcal{L}_{\alpha}( x,y) =\frac{1}{N} \sum_{n=1}^N[ -\sum_{c=1}^C \log \frac{e^{x_{n,c}}}{ \sum_{i=1}^Ce^{x_{n,i}}}]y_{n,c}La(x,y)=N1n=1N[c=1Clogi=1Cexn,iexn,c]yn,c
L β ( x , y ) = 1 N ∑ n = 1 N [ 1 2 ( x n − y n ) 2 I ( ∣ x n − y n ∣ < 1 ) + ( ∣ x n − y n ∣ − 0.5 ) I ( ∣ x n − y n ∣ ≥ 1 ) ] \mathcal{L}_{\beta}(x,y) =\frac{1}{N} \sum_{n=1}^N[\frac{1}{2}(x_n-y_n)^2 \mathbb{I}(|x_n-y_n|<1) + (|x_n-y_n| - 0.5)\mathbb{I}(|x_n-y_n|\ge 1)] Lb(x,y)=N1n=1N[21(xnyn)2 I(xnyn<1)+(xnyn0.5)I(xnyn1 ) ]
wherexxx represents the prediction result,yyy represents the truth value,NNN represents the batch size,eee is the natural logarithm. xn in formula (4-1), c x_{n,c}xn,cIndicates the predicted xn x_nxnbelongs to category ccThe probability of c ,yn , c ∈ { 0 , 1 } y_{n,c}\in\{0,1\}yn,c{ 0,1 } means labelyyDoes y belong to category ccc . I ( ⋅ ) \mathbb{I}(·)in formula (4-2)I() is an indicator function.

For the appearance feature learning task, the desired effect is that the distance metric should be large enough for different objects. JDE treats this problem as a classification problem. Suppose the target number of different instances in the entire video sequence is n ID nIDn I D , then the algorithm shouldn ID nID
on the target by embedding vectorClassification of n I D categories.

Suppose an anchor instance in a batch of samples is f T f^TfT , the positive sample (that is, the true value category) isf + f^+f+ , which with the anchorf T f^TfT is related; negative samples (that is, other categories) aref − f^-f . When calculating the loss, pay attention to all negative sample classifications. Takef T f + f^Tf^+fTf+ indicates the probability that the anchor instance is
considered as a positive sample,f T fj − f^Tf_j^-fTfjIndicates that the anchor is considered to be the jjthThe probability of j categories, the loss is calculated in a form similar to the cross-entropy function:
L γ ( x , y ) = 1 N ∑ i = 1 N [ − log ⁡ efi T fi + efi T fi + + ∑ jefi T fi , j − ] \mathcal{L}_{\gamma}(x,y) =\frac{1}{N}\sum_{i=1}^N[-\log \frac{e^{f_i^Tf_i^ +}} {e^{f_i^Tf_i^+} + \sum_j{e^{f_i^Tf_{i,j}^-}}}]Lc(x,y)=N1i=1N[logefiTfi++jefiTfi,jefiTfi+]
where subscriptiii displayiii samples.

(2) Loss balance

JDE learns three tasks simultaneously: classification, bounding box regression, and appearance feature learning. Therefore, how to balance the three tasks is a very important issue. Most of the other algorithms are weighted sums of the loss functions of each part, but JDE adopts the method of automatically adjusting the importance of multi-tasks to select the weight of each part of the loss. Specifically, refer to the concept of task-independent uncertainty proposed in [39] to learn the weights of each part of the loss as network parameters.
Therefore, the total loss function is shown in formula (4-4):
L total = ∑ i = 1 M ∑ j = α , β , γ 1 2 ( 1 esji L ji + sji ) \mathcal{L}_{total} = \sum_{i=1}^M \sum_{j=\alpha,\beta,\gamma} \frac{1}{2} (\frac{1}{e^{s_j^i}}\mathcal{L }_j^i+s_j^i)Ltotal=i=1Mj = a , b , c21(esji1Lji+sji)
in thatsji s_j^isjiis a task-independent uncertainty and is a learnable parameter. MMM is the number of tasks, since there are three tasks of classification, bounding box regression and appearance feature learning, soM = 3 M=3M=3.

Guess you like

Origin blog.csdn.net/wjpwjpwjp0831/article/details/124538565