【Summary and Analysis of CV Knowledge Points】|Loss Function

【Summary and Analysis of CV Knowledge Points】|Loss Function

【Written in front】

This series of articles is suitable for students or people who have already started Python and have a certain programming foundation, as well as students or people who are looking for jobs in artificial intelligence, algorithms, and machine learning. The series of articles includes deep learning, machine learning, computer vision, feature engineering, etc. I believe it can help beginners quickly get started with deep learning, and help job seekers fully understand the knowledge points of algorithms.

1. Common loss functions and their application scenarios in machine learning?

For classification problems:

0-1 loss function:

L 0 − 1 ( f , y ) = 1 f y ≤ 0 L_{0-1}(f, y)=1_{f y \leq 0} L01(f,y)=1fy0

The 0-1 loss function can intuitively describe the error rate of classification, but because of its non-convex and non-smooth characteristics, it is difficult for the algorithm to directly optimize it

Hinge loss function (SVM)

L hinge  ( f , y ) = max ⁡ { 0 , 1 − f y } L_{\text {hinge }}(f, y)=\max \{0,1-f y\} Lhinge (f,y)=max{ 0,1fy}

The Hinge loss function is a proxy loss function of the 0-1 loss function, and it is also its tight upper bound, when fy ≥ 0 fy \geq 0fyWhen 0 , the model is not penalized. It can be seen that the hinge loss function is atfy = 1 fy=1fy=1 is not derivable, so it cannot be optimized by the gradient descent method, only the subgradient descent method can be used.

Logistic loss function

L logistic  ( f , y ) = log ⁡ 2 ( 1 + exp ⁡ ( − f y ) ) L_{\text {logistic }}(f, y)=\log _{2}(1+\exp (-f y)) Llogistic (f,y)=log2(1+exp(fy))

The Logistic loss function is another proxy loss function of the 0-1 loss function, which is also the convex upper bound of the 0-1 loss function, and the function is smooth everywhere. But this loss function penalizes all sample points, so it is more sensitive to outliers. When predicting values ​​$f ∈ [−1, 1]$, another commonly used proxy loss function is the cross-entropy loss function

Cross-Entropy loss function

L cross entropy  ( f , y ) = − log ⁡ 2 ( 1 + f y 2 ) L_{\text {cross entropy }}(f, y)=-\log _{2}\left(\frac{1+f y}{2}\right) Lcross entropy (f,y)=log2(21+fy)

The cross-entropy loss function is also a smooth convex upper bound of the 0-1 loss function

Exponential loss function (AdaBoost)

L exponential  ( f , y ) = e − f y L_{\text {exponential }}(f, y)=e^{-f y} Lexponential (f,y)=efy

The exponential loss function is the loss function used in AdaBoost. Similarly, it is sensitive to outliers and not robust enough

Logistic loss function (LR)

L logloss  ( y , p ( y ∣ x ) ) = − log ⁡ ( p ( y ∣ x ) ) L_{\text {logloss }}(y, p(y \mid x))=-\log (p(y \mid x)) Llogloss (y,p ( andx))=log(p(yx))

The expression of logistic regression $ p ( y ∣ x ) $ is as follows:

P ( Y = y ( i ) ∣ x ( i ) , θ ) = { h θ ( x ( i ) ) = 1 1 + e θ x , y ( i ) = 1 1 − h θ ( x ( i ) ) = e θ T x 1 + e θ x , y ( i ) = 0 , P\left(Y=y^{(i)} \mid x^{(i)}, \theta\right)=\left\{\begin{array}{l}h_{\theta}\left(x^{(i)}\right)=\frac{1}{1+e^{\theta_{x}}}, y^{(i)}=1 \\ 1-h_{\theta}\left(x^{(i)}\right)=\frac{e^{\theta^{T_{x}}}}{1+e^{\theta_{x}}}, y^{(i)}=0,\end{array}\right. P(Y=y(i)x(i),i )={ hi(x(i))=1+eix1,y(i)=11hi(x(i))=1+eixeiTx,y(i)=0,

Combining the above two formulas, the probability that the i-th sample is correctly predicted can be obtained:

P ( Y = y ( i ) ∣ x ( i ) , θ ) = ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) P\left(Y=y^{(i)} \mid x^{(i)}, \theta\right)=\left(h_{\theta}\left(x^{(i)}\right)\right)^{y^{(i)}}\left(1-h_{\theta}\left(x^{(i)}\right)\right)^{1-y^{(i)}} P(Y=y(i)x(i),i )=(hi(x(i)))y(i)(1hi(x(i)))1y(i)

For all samples, since the generation process is independent, so

P ( Y ∣ X , θ ) = Π i = 1 N ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) P(Y \mid X, \theta)=\Pi_{i=1}^{N}\left(h_{\theta}\left(x^{(i)}\right)\right)^{y^{(i)}}\left(1-h_{\theta}\left(x^{(i)}\right)\right)^{1-y^{(i)}} P ( Y)X,i )=Pii=1N(hi(x(i)))y(i)(1hi(x(i)))1y(i)

The final loss function is as follows:

J ( θ ) = − ∑ i = 1 N [ y ( i ) log ⁡ ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) ] \mathcal{J}(\theta)=-\sum_{i=1}^{N}\left[y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right] J(θ)=i=1N[y(i)log(hi(x(i)))+(1y(i))log(1hi(x(i)))]

For regression problems:

For regression problems, Y = RY=\mathbb{R}Y=R,我们希望 f ( x ( i ) , θ ) = y ( i ) f\left(x^{(i)}, \theta\right)=y^{(i)} f(x(i),i )=y(i)

Square loss function (least squares method)

L square  ( f , y ) = ( f − y ) 2 L_{\text {square }}(f, y)=(f-y)^{2} Lsquare (f,y)=(fy)2

The square loss function is smooth and can be solved by the gradient descent method. However, when the difference between the predicted value and the real value is large, its penalty is relatively large, so it is more sensitive to outliers.

absolute loss function

L absolute  ( f , y ) = ∣ f − y ∣ L_{\text {absolute }}(f, y)=|f-y| Labsolute (f,y)=fy

The absolute loss function is less sensitive to outliers, and its robustness is stronger than square loss, but it is at f = yf=yf=unguided at y

Huber loss function

L H u b e r ( f , y ) = { 1 2 ( f − y ) 2 , ∣ f − y ∣ ≤ δ δ ∣ f − y ∣ − 1 2 δ 2 , ∣ f − y ∣ > δ , L_{H u b e r}(f, y)=\left\{\begin{array}{l}\frac{1}{2}(f-y)^{2},|f-y| \leq \delta \\ \delta|f-y|-\frac{1}{2} \delta^{2},|f-y|>\delta,\end{array}\right. LHuber(f,y)={ 21(fy)2,fydδfy21d2,fy>d ,

The Huber loss function is a square loss when ∣ f − y ∣ is small, and a linear loss when ∣ f − y ∣ is large, and it is derivable everywhere, and it is more robust to abnormal points.

Log-cosh loss function

L log ⁡ − cosh ⁡ ( f , y ) = log ⁡ ( cosh ⁡ ( f − y ) ) L_{\log -\cosh }(f, y)=\log (\cosh (f-y)) Llogcosh(f,y)=log(cosh(fy))

其中 cosh ⁡ ( x ) = ( e x + e − x ) / 2 \cosh (x)=\left(e^{x}+e^{-x}\right) / 2 cosh(x)=(ex+ex)/2 , the log-cosh loss function is smoother than the mean square loss function, has all the advantages of the huber loss function, and is second-order derivable. Therefore, Newton's method can be used to optimize the calculation, but when the error is large, the first-order gradient and Hessian will become constant, causing Newton's method to fail.

Quantile loss function

L γ ( f , y ) = ∑ i : y i < f i ( 1 − γ ) ∣ y i − f i ∣ + ∑ i : y i ≥ f i γ ∣ y i − f i ∣ L_{\gamma}(f, y)=\sum_{i: y_{i}<f_{i}}(1-\gamma)\left|y_{i}-f_{i}\right|+\sum_{i: y_{i} \geq f_{i}} \gamma\left|y_{i}-f_{i}\right| Lc(f,y)=i:yi<fi(1c )yifi+i:yificyifi

What is predicted is the range of values ​​of the target rather than the value. γ is the desired quantile with values ​​between 0 and 1, and γ equal to 0.5 is equivalent to MAE. Set multiple γ values ​​to get multiple prediction models, and then draw them into charts to know the prediction range and corresponding probability (subtraction of two γ values)

for searching questions

Triplet loss

∑ i N [ ∥ f ( x i a ) − f ( x i p ) ∥ 2 2 − ∥ f ( x i a ) − f ( x i n ) ∥ 2 2 + α ] + \sum_{i}^{N}\left[\left\|f\left(x_{i}^{a}\right)-f\left(x_{i}^{p}\right)\right\|_{2}^{2}-\left\|f\left(x_{i}^{a}\right)-f\left(x_{i}^{n}\right)\right\|_{2}^{2}+\alpha\right]_{+} iN[f(xia)f(xip)22f(xia)f(xin)22+a ]+

[]_+ is equivalent to a ReLU function. The composition of the triplet: randomly select a sample from the training data set, the sample is called Anchor, and then randomly select a sample that belongs to the same class as the Anchor and a sample of a different class. The corresponding two samples are called Positive and Negative , thus forming a triplet.

Through learning, let the distance between the feature expressions of the positive samples be as small as possible, while the distance between the feature expressions of the negative samples is as large as possible, and there must be a distance between the distance between the positive samples and the distance between the negative samples. The smallest interval (margin). The loss function looks like this:

Sum Hinge Loss & Max Hinge Loss

The input of Triplet loss is (a, p, n), the general practice is b ( ai , pi ) i ∈ [ 0 , b ] \left(a_{i}, p_{i}\right) i \in[ 0, b](ai,pi)i[0,b ] pair, we rotate pi to get( p 1 , p 2 , … , pb , p 0 ) \left(p_{1}, p_{2}, \ldots, p_{b}, p_{0} \right)(p1,p2,,pb,p0) as a list of negative samples. Finally get a one-dimensional loss vector( l 1 , l 2 … , lb ) \left(l_{1}, l_{2} \ldots, l_{b}\right)(l1,l2,lb)

Triplet loss actually only considers the loss generated by part of the matrix composed of a and p. We can actually calculate the loss of all off-diagonal negative samples in the similarity matrix generated by a and p, so as to make full use of batch Through this idea, we can get the Sum Hinge Loss as follows. The L2 distance is used in the calculation of Triplet loss. Here it is changed to cosine similarity, so the previous ap - an + margin is changed to an - ap + margin. , the goal is to make the similarity of an smaller and the similarity of ap larger

Sum Hinge Loss:

ℓ S H ( i , c ) = ∑ c ^ [ α − s ( i , c ) + s ( i , c ^ ) ] + + ∑ i ^ [ α − s ( i , c ) + s ( i ^ , c ) ] + \ell_{S H}(i, c)=\sum_{\hat{c}}[\alpha-s(i, c)+s(i, \hat{c})]{+}+\sum{\hat{i}}[\alpha-s(i, c)+s(\hat{i}, c)]_{+} SH(i,c)=c^[ as(i,c)+s(i,c^)]++i^ [as(i,c)+s(i^,c)]+

Max Hinge Loss:

VSE++ proposes a new loss function max hinge loss, which advocates that more attention should be paid to difficult negative samples in the sorting process. Difficult negative samples refer to negative samples that are close to the anchor. The experimental results also show the performance of max hinge loss Much better than the previously commonly used sorting loss sum hinge loss:

ℓ M H ( i , c ) = max ⁡ c ′ [ α + s ( i , c ′ ) − s ( i , c ) ] + + max ⁡ i ′ [ α + s ( i ′ , c ) − s ( i , c ) ] + \ell_{M H}(i, c)=\max {c^{\prime}}\left[\alpha+s\left(i, c^{\prime}\right)-s(i, c)\right]{+}+\max {i^{\prime}}\left[\alpha+s\left(i^{\prime}, c\right)-s(i, c)\right]{+} MH(i,c)=maxc[ a+s(i,c)s(i,c)]++maxi[ a+s(i,c)s(i,c)]+

The Max Hinge Loss pytorch code is as follows:

def cosine_sim(im, s):
    """Cosine similarity between all the image and sentence pairs
    """
    return im.mm(s.t())

class MaxHingLoss(nn.Module):

    def __init__(self, margin=0.2, measure=False, max_violation=True):
        super(MaxHingLoss, self).__init__()
        self.margin = margin
        self.sim = cosine_sim
        self.max_violation = max_violation

    def forward(self, im, s):
        an = self.sim(im, s) # an
        diagonal = scores.diag().view(im.size(0), 1)
        ap1 = diagonal.expand_as(scores)
        ap2 = diagonal.t().expand_as(scores)

        # query2doc retrieval
        cost_s = (self.margin + an - ap1).clamp(min=0)
        # doc2query retrieval
        cost_im = (self.margin + an - ap2).clamp(min=0)

        # clear diagonals
        mask = torch.eye(scores.size(0)) > .5
        I = Variable(mask)
        if torch.cuda.is_available():
            I = I.cuda()
        cost_s = cost_s.masked_fill_(I, 0)
        cost_im = cost_im.masked_fill_(I, 0)
        # keep the maximum violating negative for each query
        if self.max_violation:
            cost_s = cost_s.max(1)[0][:1]
            cost_im = cost_im.max(0)[0][:1]
        return cost_s.mean() + cost_im.mean()
        # or # return cost_s.sum() + cost_im.sum()

Info NCE

Info NCE loss is a simple variant of NCE. It thinks that if you only think of the problem as a two-category, with only data samples and noise samples, it may not be friendly to model learning, because many noise samples may not be a class at all. , so it is more reasonable to regard it as a multi-classification problem (but the multi-classification k here refers to the number of negative samples after negative sampling), so there is an InfoNCE loss function as follows:

L q = − log ⁡ exp ⁡ ( q ⋅ k + / τ ) ∑ i = 0 k exp ⁡ ( q ⋅ k i / τ ) ) L_{q}=-\log \frac{\exp \left(q \cdot k_{+} / \tau\right)}{\left.\sum_{i=0}^{k} \exp \left(q \cdot k_{i} / \tau\right)\right)} Lq=logi=0kexp(qki/ t ) )exp(qk+/ t ).

where q ⋅ kq \cdot kqk is equivalent to logits,τ \tauτ is a temperature coefficient, which is very similar to cross entropy as a whole.

For detecting problems:

class loss

Softmax+cross entropy

For binary classification, the form of the cross-entropy loss function is as follows:

L = − ylogy ′ − ( 1 − y ) log ⁡ ( 1 − y ′ ) = { − log ⁡ y ′ , y = 1 − log ⁡ ( 1 − y ′ ) , y = 0 \mathrm{L}=- \mathrm{ylog} y^{\prime}-(1-y) \log \left(1-y^{\prime}\right)=\left\{\begin{array}{ll}-\log y ^{\prime}, & y=1 \\ -\log \left(1-y^{\prime}\right), & y=0\end{array}\right.L=ylogy(1y)log(1y)={ logy,log(1y),y=1y=0

The cross-entropy loss function makes predictions more reliable by continuously reducing the difference between the two distributions.

Focal Loss

Focal loss comes from the paper Focal Loss for Dense Object Detection, mainly to solve the problem of serious imbalance in the proportion of positive and negative samples in the one-stage target detection algorithm, and reduce the proportion of a large number of simple negative samples in training, which can be understood as A type of hard sample mining. The focal loss is modified on the cross-entropy loss function. Specific improvements:

L fl = { − α ( 1 − y ′ ) γ log ⁡ y ′ , y = 1 − ( 1 − α ) y ′ γ log ⁡ ( 1 − y ′ ) , y = 0 \mathrm{L}_{fl }=\left\{\begin{array}{cc}-\alpha\left(1-y^{\prime}\right)^{\gamma} \log y^{\prime}, &y=1\ \ -(1-\alpha) y^{\prime\gamma}\log\left(1-y^{\prime}\right), &y=0\end{array}\right.Lfl={ a(1y)clogy,(1a ) yclog(1y),y=1y=0

Among them, γ>0 (2 in the article) reduces the loss of easy-to-classify samples and pays more attention to difficult and misclassified samples. For example, if γ is 2, for positive samples, the predicted result is 0.95, which must be a simple sample, so the γ power of (1-0.95) will be very small, and the value of the loss function will become smaller. The sample with a predicted probability of 0.3 has a relatively large loss. For negative samples, the result of predicting 0.1 should be much smaller than the sample loss value of predicting 0.7. For a prediction probability of 0.5, the loss is only reduced by a factor of 0.25, so more attention is paid to such indistinguishable samples. In this way, the influence of simple samples is reduced, and the effect of a large number of samples with very small prediction probabilities may be more effective when superimposed.
In addition, a balance factor α is added to balance the uneven proportion of positive and negative samples. The value in this paper is 0.25, that is, the proportion of positive samples is smaller than that of negative samples. This is because negative samples are easy to distinguish

position loss

L1 (MAE), L2 (MSE), smooth L1 loss function

Use the L1, L2 or smooth L1 loss function to regress the 4 coordinate values. The smooth L1 loss function was proposed in Fast R-CNN. Three loss functions, as follows:

L 1 = ∣ x ∣ L 2 = x 2 s m o o t h L 1 = { 0.5 x 2  if  ∣ x ∣ < 1 ∣ x ∣ − 0.5  otherwise  L 1=|x|\\L 2=x^{2}\\smoothL1 =\left\{\begin{array}{cc}0.5 x^{2} & \text { if }|x|<1 \\ |x|-0.5 & \text { otherwise }\end{array}\right. L 1=xL2 _=x2smoothL1={ 0.5 x2x0.5 if x<1 otherwise 

From the derivative of the loss function to x, it can be known that the derivative of the L1 loss function to x is a constant, and there will be no problem of gradient explosion, but it is not derivable at 0. When the loss value is small, the obtained gradient is relatively large. It may cause model oscillations that are not conducive to convergence. The L2 loss function can be derived everywhere, but due to the square operation, when the difference between the predicted value and the real value is greater than 1, the error will be amplified. Especially when the input value of the function is far from the central value, the gradient is very large when the gradient descent method is used to solve the problem, which may cause the gradient to explode. At the same time, when there are multiple outliers, these points may occupy the main part of the Loss, and many effective samples need to be sacrificed to compensate it, so the L2 loss is greatly affected by the outliers. smooth L1 perfectly avoids the shortcomings of L1 and L2 losses:

  • Between [-1,1] is the L2 loss, solving the problem that L1 has a turning point at 0

  • Outside the [-1, 1] interval is the L1 loss, which solves the problem of gradient explosion of outliers

  • When the error between the predicted value and the real value is too large, the gradient value will not be too large

  • When the error between the predicted value and the real value is small, the gradient value is small enough

The above three loss functions have the following deficiencies:

  • When calculating the bounding box regression loss, the above three loss functions independently calculate the loss of 4 points, and then add up to get the final loss value. The premise of this method is that the four points are independent of each other, but in fact it is have a certain correlation

  • The actual index for evaluating the quality of the detection results is IoU, the two are not equivalent, multiple detection frames may have the same loss, but the IoU is very different

IoU Loss

The definition of IoU loss is as follows:

L I = 1 − P ∩ G P ∪ G L_I= 1-\frac{P \cap G}{P \cup G} LI=1PGPG

Among them, P represents the predicted frame and G represents the real frame.

GIoU loss

IoU reflects the degree of overlap of two boxes. When the two boxes do not overlap, the IoU balance is equal to 0, and the IoU loss is always equal to 1. In the bounding box regression of object detection, this is obviously inappropriate. Therefore, GIoU loss considers the loss when two boxes do not overlap on the basis of IoU loss. The specific definition is as follows:

L G = 1 − I o U + R ( P , G ) = 1 − I o U + ∣ C − P ∪ G ∣ ∣ C ∣ L_{G}=1-I o U+R(P, G)=1-I o U+\frac{|C-P \cup G|}{|C|} LG=1IoU+R(P,G)=1IoU+CCPG

Among them, C represents the minimum enclosing rectangular frame of the two frames, and R(P,G) is the penalty term. It can be seen from the formula that when the two boxes do not overlap, the IoU is 0, but R still produces losses. In the limit, when the distance between the two frames is infinite, R→1

DIoU Loss

Both IoU loss and GIoU loss only consider the degree of overlap of two frames, but in the case of the same degree of overlap, we actually hope that the two frames can be close enough, that is, the centers of the frames should be as close as possible. Therefore, DIoU considers the distance between the center points of the two boxes on the basis of IoU loss, which is defined as follows:

LG = 1 − I or U + R ( P , G ) = 1 − I or U + ρ 2 ( p , g ) c 2 L_{G}=1-I or U+R(P, G)=1- I o U+\frac{\rho^{2}(p, g)}{c^{2}}LG=1IoU+R(P,G)=1IoU+c2r2(p,g)

Among them, ρ represents the distance between the predicted frame and the center of the labeled frame, and p and g are the center points of the two frames. c represents the diagonal length of the smallest enclosing rectangular frame of the two frames. When the distance between the two frames is infinite, the distance between the center point and the length of the diagonal of the circumscribed rectangular frame is infinitely close, R→1.

CIoU Loss

DIoU loss considers the distance between the center points of two frames, while CIoU loss makes more detailed measurements based on DIoU loss, including:

  • Overlap area

  • center point distance

  • aspect ratio

The specific definition is as follows:

LG = 1 − I or U + R ( P , G ) = 1 − I or U + ρ 2 ( p , g ) c 2 + α v L_{G}=1-I or U+R(P, G) =1-I or U+\frac{\rho^{2}(p, g)}{c^{2}}+\alpha vLG=1IoU+R(P,G)=1IoU+c2r2(p,g)+αv

v = 4 π 2 ( arctan ⁡ w g h g − arctan ⁡ w p h p ) 2 α = v ( 1 − I o U ) + v v=\frac{4}{\pi^{2}}\left(\arctan \frac{w^{g}}{h^{g}}-\arctan \frac{w^{p}}{h^{p}}\right)^{2}\\\alpha=\frac{v}{(1-I o U)+v} v=Pi24( arctanhgwgarctanhpwp)2a=(1IoU)+vv

for segmentation problems

dice Loss

Dice loss is derived from the dice coefficient and is a measurement function used to measure the similarity of a set. It is usually used to calculate the similarity between two samples. The formula is as follows:

D i c e = 2 ∣ X ∩ Y ∣ ∣ X ∣ + ∣ Y ∣ Dice =\frac{2|\mathrm{X} \cap Y|}{|X|+|Y|} Dice=X+Y2∣XY

Then the corresponding dice loss formula is as follows:

D i c e l o s s = 1 − 2 ∣ X ∩ Y ∣ ∣ X ∣ + ∣ Y ∣ Dice loss =1-\frac{2|\mathrm{X} \cap Y|}{|X|+|Y|} Diceloss=1X+Y2∣XY

According to the definition of dice loss, it can be seen that dice loss is a region-related loss, which means that the loss and gradient value of a pixel point are not only related to the label and predicted value of the point, but also related to the label and prediction of other points. value dependent. Dice loss can be used when the sample is extremely unbalanced, but in general, the use of dice loss will have an adverse effect on backpropagation, making the training unstable.

cross entropy

The most commonly used loss function in semantic segmentation tasks is cross entropy, which obtains the loss value by comparing each pixel one by one.

The loss function corresponding to each pixel is:

p i x e l − l o s s = − ∑ classes  ( y c log ⁡ ( p c ) ) pixel - loss =-\sum_{\text {classes }}\left(y_{c} \log \left(p_{c}\right)\right) pixelloss=classes (yclog(pc))

y c y_{c} ycFor a one-hot vector, the value is only [0, 1], pc p_cpcIt is the probability value after the network prediction value passes through the softmax or sigmoid function.

The loss of the entire image is the average of the loss of each pixel. The cross-entropy loss function is suitable for most semantic segmentation scenarios, but when the number of foreground pixels is much smaller than the number of background pixels, the loss of the background dominates at this time, resulting in poor network performance.

Cross entropy with weights

For the problem of class imbalance, it is alleviated by adding a weight coefficient to each class. The formula after adding the weight is as follows:

p i x e l − l o s s = − ∑ c classes  ( w c y c log ⁡ ( p c ) ) pixel - loss =-\sum_{c}^{\text {classes }}\left(w_{c} y_{c} \log \left(p_{c}\right)\right) pixelloss=cclasses (wcyclog(pc))

where wc = N − N c N w_{c}=\frac{N-N_{c}}{N}wc=NNNc N N N represents the total number of pixels, andN c N_cNcIndicates the number of pixels whose GT category is c.

2. Why can't the recognition accuracy be used as an indicator ?

Taking the number recognition task as an example, what we want to obtain are parameters that can improve the recognition accuracy. Isn’t it a bit of repetitive work to introduce a loss function on purpose? In other words, since our goal is to obtain a neural network that makes the recognition accuracy as high as possible, shouldn't the recognition accuracy be used as an indicator?

When learning the neural network, the recognition accuracy cannot be used as an index, because if the recognition accuracy is used as an index, the derivative is 0 in most cases . A derivative of 0 will cause the update of the weight parameters to stop.

3. Why is LogSoftmax better than Softmax?

log_softmax can solve function overflow and underflow, speed up calculation and improve data stability.

Because softmax will perform exponential operations, when the output of the previous layer, that is, the input of softmax, is relatively large, overflow may occur. Similarly, when the input is negative and the absolute value is also large, the numerator and denominator will become extremely small, and may be rounded to 0, resulting in underflow.

Although in the mathematical expression it is the case of taking the logarithm of softmax. But in practice it is through:

log ⁡ [ f ( x i ) ] = log ⁡ ( e x i e x 1 + e x 2 + … + e x n ) = log ⁡ ( e x i e M e x 1 e M + e 2 e M + … + e x n e M ) = log ⁡ ( e ( x i − M ) ∑ j n e ( x j − M ) ) = log ⁡ ( e ( x i − M ) ) − log ⁡ ( ∑ j n e ( x j − M ) ) = ( x i − M ) − log ⁡ ( ∑ j n e ( x j − M ) ) \log \left[f\left(x_{i}\right)\right]=\log \left(\frac{e^{x_{i}}}{e^{x_{1}}+e^{x_{2}}+\ldots+e^{x_{n}}}\right)\\=\log \left(\frac{\frac{e^{x_{i}}}{e^{M}}}{\frac{e^{x_{1}}}{e^{M}}+\frac{e^{2}}{e^{M}}+\ldots+\frac{e^{x_{n}}}{e^{M}}}\right)=\log \left(\frac{e^{\left(x_{i}-M\right)}}{\sum_{j}^{n} e^{\left(x_{j}-M\right)}}\right)\\=\log \left(e^{\left(x_{i}-M\right)}\right)-\log \left(\sum_{j}^{n} e^{\left(x_{j}-M\right)}\right)=\left(x_{i}-M\right)-\log \left(\sum_{j}^{n} e^{\left(x_{j}-M\right)}\right) log[f(xi)]=log(ex1+ex2++exnexi)=log(eMex1+eMe2++eMexneMexi)=log(jne(xjM)e(xiM))=log(e(xiM))log(jne(xjM))=(xiM)log(jne(xjM))

To achieve, where M = max ⁡ ( xi ) , i = 1 , 2 , ⋯ , n M=\max \left(x_{i}\right), i=1,2, \cdots, nM=max(xi),i=1,2,,n , ie M is allxi x_{i}xiThe largest value in . This problem can be solved, and the stability of the value can be maintained while speeding up the calculation.

4. What is label smoothing? Why use label smoothing?

Label smoothing is a regularization method, the full name is Label Smoothing Regularization (LSR), that is, label smoothing regularization.

In the process of calculating the loss of traditional classification tasks, the real label is made into a one-hot form, and then cross-entropy is used to calculate the loss. And label smoothing is to make a label smoothing process on the real one hot label, so that the label becomes a soft label with a probability value. Among them, the probability value at the real label is the largest, and the probability value at other positions is a very small number.

There is a parameter epsilon in label smoothing, which describes the degree of label softening. The larger the value, the smaller the label probability value of the label vector after label smoothing, and the smoother the label. Conversely, the label tends to be hard label. This value is usually set to 0.1 in the experiments of training ImageNet-1k.

5. When using sigmoid or softmax as the activation function, why use the cross entropy loss function instead of the mean square error loss function?

1. Because the cross-entropy loss function can perfectly solve the problem of slow weight update of the square loss function, it has the good property of "when the error is large, the weight update is fast; when the error is small, the weight update is slow".
2. When sigmoid is used as the activation function, if the mean square error loss function is used, then this is a non-convex optimization problem and should not be solved. However, using the cross-entropy loss function is still a convex optimization problem, which is easier to optimize and solve.

For details on formula derivation, see: https://blog.csdn.net/weixin_41888257/article/details/104894141

6. What is the difference between the cross-entropy loss function (Cross-entropy) and the square loss (MSE)?

1. Conceptual difference

**Mean square error loss function (MSE): **In simple terms, the meaning of mean square error (MSE) is to find the average of the square of the difference between the n outputs of n samples in a batch and the expected output.

Cross-entropy (cross-entropy loss function) : Cross-entropy is used to evaluate the difference between the current training probability distribution and the real distribution. It describes the distance between the actual output (probability) and the expected output (probability), that is, the smaller the value of cross entropy, the closer the two probability distributions are.

2. Application scenarios

MSE is more suitable for regression problems , and the cross-entropy loss function is more suitable for classification problems .

7. What is KL divergence?

KL divergence definition

KL (Kullback-Leibler divergence) divergence is mostly used in probability theory or information theory, and it can also be called relative entropy. It is a method used to describe the difference between two probability distributions P and Q.

KL is asymmetric, that is, D(P||Q) ≠ D(Q||P).
In information theory, D(P||Q) represents the information loss generated when the probability distribution Q is used to fit the real distribution P, where P represents the real distribution, and Q represents the fitted distribution of P

KL divergence formula definition

For discrete random variables there are:

D ( P ∥ Q ) = ∑ i ∈ X P ( i ) ∗ [ log ⁡ ( P ( i ) Q ( i ) ) ] \mathrm{D}(\mathrm{P} \| \mathrm{Q})=\sum_{i \in X} P(i) *\left[\log \left(\frac{P(i)}{Q(i)}\right)\right] D(PQ)=iXP(i)[log(Q(i)P(i))]

For continuous random variables there are:

D ( P ∥ Q ) = ∫ x P ( x ) ∗ [ log ⁡ ( P ( i ) Q ( i ) ) ] d x \mathrm{D}(\mathrm{P} \| \mathrm{Q})=\int_{x} P(x) *\left[\log \left(\frac{P(i)}{Q(i)}\right)\right] d x D(PQ)=xP(x)[log(Q(i)P(i))]dx

Physical Definition of KL Divergence

In information theory, it is used to measure the number of extra bits required to encode samples from the P distribution on average using a code based on the Q distribution .
In the field of machine learning, it is used to measure the similarity or closeness of two functions .

8. Handwritten codes of IOU, GIOU, DIOU, CIOU

The IOU code and results are shown below

import cv2
import numpy as np
def IOU_score(box1,box2):
        """
        计算两个区域的iou的值

        para: box1 区域1的两个角的坐标值  x1,y1,x2,y2
        para: box2 区域2的两个角的坐标值  x1,y1,x2,y2
        """
        # 两个框的交
        iou_x1 = max(box1[0], box2[0])
        iou_y1 = max(box1[1], box2[1])
        iou_x2 = min(box1[2], box2[2]) 
        iou_y2 = min(box1[3], box2[3])

        # 上面求出来的为交集的两个角的坐标
        area_inter = max(0,(iou_x2 - iou_x1)) * max(0 , (iou_y2 - iou_y1))

        # 计算两个区域的并集
        area_all = ((box1[2] - box1[0]) * (box1[3] - box1[1])) + ((box2[2] - box2[0]) * (box2[3] - box2[1])) - area_inter
        
        center_x = int((iou_x1 + iou_x2) / 2)
        center_y = int((iou_y2 + iou_y1) / 2)
        return float(area_inter / area_all) , (center_x,center_y)

def main():

    img = np.zeros((512,512,3), np.uint8)   
    img.fill(255)

    box1 = [50,50,300,300]
    box2 = [51,51,301,301]

    cv2.rectangle(img, (box1[0],box1[1]), (box1[2],box1[3]), (0, 0, 255), 2)
    cv2.rectangle(img, (box2[0],box2[1]), (box2[2],box2[3]), (255, 0, 0), 2)

    IOU , center = IOU_score(box1,box2)
    font = cv2.FONT_HERSHEY_SIMPLEX
    cv2.putText(img,"IOU = %.2f"%IOU,center,font,0.8,(0,0,0),2)

    cv2.imshow("image",img)
    cv2.waitKey()
    cv2.destroyAllWindows()

if __name__ == "__main__":
    main()

The GIOU code and results are shown below

import cv2
import numpy as np
def GIOU_score(box1,box2):
        """
        计算两个区域的iou的值

        para: box1 区域1的两个角的坐标值  x1,y1,x2,y2
        para: box2 区域2的两个角的坐标值  x1,y1,x2,y2
        """
        # 两个框的交
        iou_x1 = max(box1[0], box2[0])
        iou_y1 = max(box1[1], box2[1])
        iou_x2 = min(box1[2], box2[2]) 
        iou_y2 = min(box1[3], box2[3])

        g_iou_x1 = min(box1[0], box2[0]) 
        g_iou_y1 = min(box1[1], box2[1])
        g_iou_x2 = max(box1[2], box2[2]) 
        g_iou_y2 = max(box1[3], box2[3])

        # 上面求出来的为交集的两个角的坐标
        area_inter = max(0,(iou_x2 - iou_x1)) * max(0 , (iou_y2 - iou_y1))

        # 计算两个区域的并集
        area_union = max(0,((box1[2] - box1[0]) * (box1[3] - box1[1])) + ((box2[2] - box2[0]) * (box2[3] - box2[1])) - area_inter)

        # 计算最小外接矩形
        area_all = max(0,(g_iou_x2 - g_iou_x1) * (g_iou_y2 - g_iou_y1))

        g_iou = max(0,area_inter/area_union) - max(0,area_all - area_union) / area_all 

        return float(g_iou) , (iou_x1,iou_y1,iou_x2,iou_y2) , (g_iou_x1,g_iou_y1,g_iou_x2,g_iou_y2)

def main():

    img = np.zeros((512,512,3), np.uint8)   
    img.fill(255)

    box1 = [50,50,300,300]
    box2 = [100,100,400,400]

    IOU , area_inter , area_all = GIOU_score(box1,box2)

    font = cv2.FONT_HERSHEY_SIMPLEX
    cv2.putText(img,"GIOU = %.2f"%IOU,(area_inter[0]+30,area_inter[1]+30),font,0.8,(0,0,0),2)

    cv2.rectangle(img, (box1[0],box1[1]), (box1[2],box1[3]), (255, 0, 0),  thickness = 3)
    cv2.rectangle(img, (box2[0],box2[1]), (box2[2],box2[3]), (0, 255, 0), thickness = 3)
    cv2.rectangle(img, (area_all[0],area_all[1]), (area_all[2],area_all[3]), (0, 0, 255), thickness = 3)
    
    cv2.imshow("image",img)
    cv2.waitKey()
    cv2.destroyAllWindows()


if __name__ == "__main__":
    main()

The DIOU code and results are shown below

import cv2
import numpy as np
def DIOU_score(box1,box2):
        """
        计算两个区域的iou的值

        para: box1 区域1的两个角的坐标值  x1,y1,x2,y2
        para: box2 区域2的两个角的坐标值  x1,y1,x2,y2
        """
        # 两个框的交
        iou_x1 = max(box1[0], box2[0])
        iou_y1 = max(box1[1], box2[1])
        iou_x2 = min(box1[2], box2[2]) 
        iou_y2 = min(box1[3], box2[3])

        d_x1 = max(0, (box1[2] + box1[0])/2)
        d_y1 = max(0, (box1[3] + box1[1])/2)
        d_x2 = max(0, (box2[2] + box2[0])/2)
        d_y2 = max(0, (box2[3] + box2[1])/2)

        c_x1 = min(box1[0], box2[0]) 
        c_y1 = min(box1[1], box2[1])
        c_x2 = max(box1[2], box2[2]) 
        c_y2 = max(box1[3], box2[3])

        # 上面求出来的为交集的两个角的坐标
        area_inter = max(0,(iou_x2 - iou_x1)) * max(0 , (iou_y2 - iou_y1))

        # 计算两个区域的并集
        area_union = max(0,((box1[2] - box1[0]) * (box1[3] - box1[1])) + ((box2[2] - box2[0]) * (box2[3] - box2[1])) - area_inter)

        # 计算最小外接矩形
        c_2 = max(0,(c_x2 - c_x1))**2 + max(0,(c_y2 - c_y1))**2
        d_2 =  max(0,(d_x2 - d_x1))**2 + max(0,(d_y2 - d_y1))**2

        g_iou = max(0,area_inter/area_union) - d_2/c_2

        return float(g_iou) , (iou_x1,iou_y1,iou_x2,iou_y2) , (c_x1,c_y1,c_x2,c_y2), (int(d_x1),int(d_y1),int(d_x2),int(d_y2))

def main():
    img = np.zeros((512,512,3), np.uint8)   
    img.fill(255)

    box1 = [50,50,300,300]
    box2 = [250,80,400,350]

    IOU , area_inter , area_all , short_line = DIOU_score(box1,box2)

    font = cv2.FONT_HERSHEY_SIMPLEX
    cv2.putText(img,"GIOU = %.2f"%IOU,(area_inter[0]+30,area_inter[1]+30),font,0.8,(0,0,0),2)

    cv2.rectangle(img, (box1[0],box1[1]), (box1[2],box1[3]), (255, 0, 0),  thickness = 3)
    cv2.rectangle(img, (box2[0],box2[1]), (box2[2],box2[3]), (0, 255, 0), thickness = 3)
    cv2.rectangle(img, (area_all[0],area_all[1]), (area_all[2],area_all[3]), (0, 0, 255), thickness = 3)
    cv2.line(img, (short_line[0],short_line[1]), (short_line[2],short_line[3]), (200,45,45),5)
    cv2.line(img, (area_all[0],area_all[1]), (area_all[2],area_all[3]), (64,78,0),5)
    
    cv2.imshow("image",img)
    cv2.waitKey()
    cv2.destroyAllWindows()


if __name__ == "__main__":
    main()

The CIOU code and results are shown below

import cv2
import numpy as np
from math import pi,atan
def CIOU_score(box1,box2):
        """
        计算两个区域的iou的值

        para: box1 区域1的两个角的坐标值  x1,y1,x2,y2
        para: box2 区域2的两个角的坐标值  x1,y1,x2,y2
        """
        # 两个框的交
        iou_x1 = max(box1[0], box2[0])
        iou_y1 = max(box1[1], box2[1])
        iou_x2 = min(box1[2], box2[2]) 
        iou_y2 = min(box1[3], box2[3])

        d_x1 = max(0, (box1[2] + box1[0])/2)
        d_y1 = max(0, (box1[3] + box1[1])/2)
        d_x2 = max(0, (box2[2] + box2[0])/2)
        d_y2 = max(0, (box2[3] + box2[1])/2)

        c_x1 = min(box1[0], box2[0]) 
        c_y1 = min(box1[1], box2[1])
        c_x2 = max(box1[2], box2[2]) 
        c_y2 = max(box1[3], box2[3])

        w_gt = max(0,box2[2] - box2[0])
        h_gt = max(0,box2[3] - box2[1])

        w    = max(0,box1[2] - box1[0])
        h    = max(0,box1[3] - box1[1])

        # 上面求出来的为交集的两个角的坐标
        area_inter = max(0,(iou_x2 - iou_x1)) * max(0 , (iou_y2 - iou_y1))

        # 计算两个区域的并集
        area_union = max(0,((box1[2] - box1[0]) * (box1[3] - box1[1])) + ((box2[2] - box2[0]) * (box2[3] - box2[1])) - area_inter)

        iou = max(0,area_inter/area_union)

        c_2 = max(0,(c_x2 - c_x1))**2 + max(0,(c_y2 - c_y1))**2
        d_2 =  max(0,(d_x2 - d_x1))**2 + max(0,(d_y2 - d_y1))**2

        v = 4/pi**2 * (atan(w_gt/h_gt) - atan(w/h))**2

        alpha = v / (1-iou + v)

        c_iou = iou - d_2/c_2 - alpha * v

        return float(c_iou) , (iou_x1,iou_y1,iou_x2,iou_y2) , (c_x1,c_y1,c_x2,c_y2), (int(d_x1),int(d_y1),int(d_x2),int(d_y2))

def main():
    img = np.zeros((512,512,3), np.uint8)   
    img.fill(255)

    box1 = [50,50,300,300]
    box2 = [100,80,200,260]

    IOU , area_inter , area_all , short_line = CIOU_score(box1,box2)

    font = cv2.FONT_HERSHEY_SIMPLEX
    cv2.putText(img,"GIOU = %.2f"%IOU,(area_inter[0]+30,area_inter[1]+30),font,0.8,(0,0,0),2)

    cv2.rectangle(img, (box1[0],box1[1]), (box1[2],box1[3]), (255, 0, 0),  thickness = 3)
    cv2.rectangle(img, (box2[0],box2[1]), (box2[2],box2[3]), (0, 255, 0), thickness = 3)
    cv2.rectangle(img, (area_all[0],area_all[1]), (area_all[2],area_all[3]), (0, 0, 255), thickness = 3)
    cv2.line(img, (short_line[0],short_line[1]), (short_line[2],short_line[3]), (200,45,45),5)
    cv2.line(img, (area_all[0],area_all[1]), (area_all[2],area_all[3]), (64,78,0),5)
    
    cv2.imshow("image",img)
    cv2.waitKey()
    cv2.destroyAllWindows()


if __name__ == "__main__":
    main()


9. What is the difference between cost function, loss function, objective function and risk function?

  • The loss function (Loss Function) is defined on a single sample and calculates the error of a sample.

  • The cost function (Cost Function) is defined on the entire training set and is the average of all sample errors, that is, the average of the loss function.

  • The objective function (Object Function) is defined as: the function that needs to be optimized in the end. Equal to empirical risk + structural risk (that is, cost function + regularization term). The cost function is minimized, the empirical risk is reduced, and the regularization term is minimized.

  • Risk function (risk function), the risk function is the expectation of the loss function, because our input and output (X, Y) follow a joint distribution, but this joint distribution is unknown, so it cannot be calculated. But we have historical data, which is our training set, and the average loss of f(x) on the training set is called empirical risk, that is, so our goal is to minimize it, which is called empirical risk minimization.

10. What is the difference between noise**, bias and variance? **

  • Noise: Describes the lower bound of the expected generalization error that any learning algorithm can achieve on the current task , that is, it describes the difficulty of the learning problem itself. In human terms, some labels in the data are not real labels, but also labels of limited noise.

  • Deviation: refers to the difference between the predicted result and the real value, excluding the influence of noise . The deviation is more about the sample error output by a certain model. The deviation is caused by the inability of the model to accurately express the data relationship, such as the model is too simple, nonlinear The data relationship of the data is modeled by a linear model, and the model with a large deviation is the wrong model.

  • Variance: It does not judge the output samples of a certain model, but refers to the discrete difference between the output results of multiple (secondary) models. Note that multiple models or multiple models are written here, that is, different models or the same model are different. The time output results have a large variance. The variance is caused by insufficient data in the training set. On the one hand, the amount (data volume) is not enough, and the over-training of the limited data set leads to a complex model. On the other hand, the quality (sample quality) is not good. The test set The data distribution is not in the training set, so that each time the model is sampled and trained, the model parameters are different each time, and the output result cannot accurately predict the correct result.

11. How to solve the problem of poor robustness of MSE to abnormal samples?

  1. If the abnormal samples are meaningless, the outliers can be smoothed or deleted directly.

  2. If the abnormal samples are meaningful and the model needs to take these meaningful abnormalities into account, consider using a model with stronger expressive ability or composite model or grouping modeling from the model side;

  3. Choose a more robust loss function at the loss level.

12. Knowledge points about entropy?

  • The amount of information measures the degree of uncertainty of an event. The higher the uncertainty, the greater the amount of information. Generally, the uncertainty is defined by the probability of event occurrence. The amount of information is based on the log operation of the probability density function, using the following formula definition:

I ( x ) = − log ⁡ p ( x ) I(x)=-\log p(x) I(x)=logp(x)

  • Information entropy measures the degree of uncertainty of an event set, which is the expectation of the uncertainty of all events in the event set. The formula is defined as follows:

H ( X ) = − ∑ x ∈ X [ p ( x ) log ⁡ p ( x ) ] H(X)=-\sum_{x \in X}[p(x) \log p(x)] H(X)=xX[p(x)logp(x)]

  • Relative entropy (KL divergence) kl divergence , from the perspective of generalization, represents an asymmetric measure of the difference between two probability distributions, kl divergence can also be based on information theory, from this point of view kl divergence We can also call it relative entropy, which actually describes the difference between the information entropy of two probability distributions:

K L ( P ∥ Q ) = ∑ P ( x ) log ⁡ P ( x ) Q ( x ) K L(P \| Q)=\sum P(x) \log \frac{P(x)}{Q(x)} KL(PQ)=P(x)logQ(x)P(x)

The kl divergence, like cosine distance, does not satisfy the strict definition of distance; it is non-negative and asymmetric.

  • The js divergence formula is as follows:

J S ( P ∥ Q ) = 1 2 K L ( P ( x ) ) ∥ P ( x ) + Q ( x ) 2 + 1 2 K L ( Q ( x ) ) ∥ P ( x ) + Q ( x ) 2 J S(P \| Q)=\frac{1}{2} K L(P(x))\left\|\frac{P(x)+Q(x)}{2}+\frac{1}{2} K L(Q(x))\right\| \frac{P(x)+Q(x)}{2} JS(PQ)=21KL(P(x)) 2P(x)+Q(x)+21KL(Q(x)) 2P(x)+Q(x)

The range of js divergence is [0,1], the same is 0, and the opposite is 1. Compared with KL, the discrimination of similarity is more accurate; at the same time, js divergence satisfies the symmetry JS(P||Q)=JS(Q||P)

  • The cross entropy formula is as follows:

H ( P , Q ) = − ∑ p log ⁡ q = H ( P ) + D k l ( P ∥ Q ) H(P, Q)=-\sum p \log q=H(P)+D_{k l}(P \| Q) H(P,Q)=plogq=H(P)+Dkl(PQ)

It can be seen that the cross entropy is the sum of the information entropy of the true value distribution and the KL divergence, and the entropy of the true value is determined and has nothing to do with the parameter θ of the model. Therefore, when deriving gradient descent, optimize the cross entropy and optimize the kl divergence ( relative entropy) is the same;

  • The joint entropy formula is as follows:

H ( X , Y ) = − ∑ x , y p ( x , y ) log ⁡ p ( x , y ) H(X, Y)=-\sum_{x, y} p(x, y) \log p(x, y) H(X,Y)=x,yp(x,y)logp(x,y)

Joint entropy actually measures the information entropy of a new large event set formed by combining two event sets;

  • The conditional entropy formula is as follows:

H ( Y ∣ X ) = H ( X , Y ) − H ( X ) H(Y \mid X)=H(X, Y)-H(X) H(YX)=H(X,Y)H(X)

Conditional entropy of event set Y = joint entropy - information entropy of event set X, which is used to measure the reduction degree of uncertainty of event set Y on the basis of known event set X;

13. How to measure the difference between two distributions?

Use KL divergence or JS divergence

14. What are the characteristics of Huber Loss?

Huber Loss combines MSE and MAE loss. When the error is close to 0, MSE is used to make the loss function derivable and the gradient is more stable; when the error is large, the use of MAE can reduce the influence of outlier and make the training more robust to outlier. The disadvantage is that an additional hyperparameter needs to be set.

15. How to understand Hinger Loss?

It can be seen that when x is greater than a certain value, the loss is 0, and when x is less than a certain value, then the loss needs to be calculated, indicating that the model has punished samples smaller than the threshold, and the larger the penalty, the more Powerful, no penalty is given to samples greater than the threshold. In general, the loss function looks for a boundary, does not punish credible samples, and punishes unreliable samples or samples beyond the decision boundary.

【Project recommendation】

The core code library of top conference papers for Xiaobai: https://github.com/xmu-xiaoma666/External-Attention-pytorch

YOLO target detection library for Xiaobai: https://github.com/iscyy/yoloair

Analysis of papers for Xiaobai's top journal and conference: https://github.com/xmu-xiaoma666/FightingCV-Paper-Reading

reference:

https://blog.csdn.net/zuolixiangfisher/article/details/88649110

https://zhuanlan.zhihu.com/p/514859125

https://blog.csdn.net/weixin_43750248/article/details/116656242

https://blog.csdn.net/hello_dear_you/article/details/121078919

https://blog.csdn.net/weixin_41888257/article/details/104894141

https://blog.csdn.net/weixin_37763870/article/details/103026505

https://blog.csdn.net/to_be_little/article/details/124674924

https://zhuanlan.zhihu.com/p/548782472

Guess you like

Origin blog.csdn.net/Jason_android98/article/details/127174178