Hinge Loss 和 Zero-One Loss

Hinge Loss 和 Zero-One Loss

Wikipedia: https://en.wikipedia.org/wiki/Hinge_loss

Chart description:

  • The vertical axis represents fixed t = 1 t=1t=The values ​​of Hinge loss (blue) and Zero-One Loss (green) of 1 , while the horizontal axis represents the predicted valueyythe value of y .
  • The figure shows that Hinge loss penalizes the predicted value y < 1 y < 1y<1 , which corresponds to the concept of margin in support vector machines.

Hinge Loss

Hinge Loss is a commonly used machine learning loss function, usually used for classification problems in support vector machine (SVM) models. The function is defined as follows:
L ( y , f ( x ) ) = max ⁡ ( 0 , 1 − yi ⋅ f ( x ) ) f ( x ) = w T xi + b (1) L(y, f(x )) = \max(0, 1 - y_i \cdot f(x)) \\ f(x)=w^{\mathrm{T}}x_i+b \tag{1}L ( y ,f(x))=max(0,1yif(x))f(x)=wTxi+b( 1 )
Among them,yi y_iyiis the true label of the sample, f ( x ) f(x)f ( x ) is the predicted value of the model. The value range of this function is a non-negative real number. When the error between the predicted value and the real value is larger, the value of the loss function is larger.

When the sample is correctly classified, that is, yi ⋅ f ( x ) > 0 y_i \cdot f(x) > 0yif(x)>0 , at this time, the value of Hinge Loss is 0, indicating that the model classification is correct and no error occurs.

When the sample is misclassified, that is, yi ⋅ f ( x ) < 0 y_i \cdot f(x) < 0yif(x)<0 , at this time the value of Hinge Loss is1 − yi ⋅ f ( x ) 1 - y_i \cdot f(x)1yif ( x ) indicates the classification error of the model, and the greater the classification error, the greater the value of Hinge Loss.

The goal of Hinge Loss is to minimize the classification error while encouraging the model to produce large margins (i.e. the distance between the correct classification and the classification hyperplane).

In support vector machines, the goal is to find a hyperplane with the largest interval to classify samples, so Hinge Loss can be associated with the interval. For a sample point ( xi , yi ) (x_i, y_i)(xi,yi) , the distance between it and the hyperplane is:
yiw T xi + b ∥ w ∥ (2) \frac{y_i w^T x_i + b}{\|w\|} \tag{2}wyiwTxi+b( 2 )
Among them,www andbbb are the weights and biases in the SVM model. Record this distance asγ i \gamma_ici, Hinge Loss can be re-expressed as:
L ( yi , f ( xi ) ) = max ⁡ ( 0 , 1 − yi ( γ i ∥ w ∥ ) ) (3) L(y_i, f(x_i)) = \max (0, 1 - y_i (\gamma_i \|w\|)) \tag{3}L ( yi,f(xi))=max(0,1yi( ciw))( 3 )
Therefore, Hinge Loss can not only express the classification error, but also promote the model to produce a larger interval, thereby increasing the generalization ability of the model.

Zero-One Loss

Zero-One Loss is a common classification loss function in machine learning. For a binary classification problem, suppose y ∈ − 1 , 1 y \in {-1, 1}y1,1 is the real label,f ( x ) f(x)f ( x ) is the model pair samplexxInstead of x
, Zero-One Loss Exception: L ( y , f ( x ) ) = { 0 if y = f ( x ) 1 otherwise (4) L(y, f(x)) = \begin{cases } 0 & \text{if} y = f(x)\\1 & \text{otherwise}\end{cases}\tag{4}L ( y ,f(x))={ 01if y=f(x)otherwise( 4 )
That is to say, when the prediction result of the model is consistent with the real label, Zero-One Loss is 0; otherwise, Loss is 1. It can be seen from the expression that Zero-One Loss has a very high penalty for prediction errors, because no matter how close the wrong prediction is to the correct one, Loss will be calculated as 1. Compared with other loss functions, Zero-One Loss is often considered a very strict evaluation method.

However, since Zero-One Loss itself is not derivable, it is usually chosen to use some derivable approximation functions, such as Hinge Loss or Cross Entropy Loss, etc. when training the model. Compared with Zero-One Loss, these loss functions are smoother, which can help the model converge faster and more stably.

It should be noted that although Zero-One Loss is very strict in evaluating model performance, it is often not the optimal choice in practical applications. Especially when there is some noise in the labels in the data set, using Zero-One Loss may cause the model to overfit the training set and fail to effectively generalize to the test set. Therefore, in practical applications, we usually use a smoother loss function, combined with some common regularization techniques, such as L1/L2 regularization, to control the complexity and generalization ability of the model.

Guess you like

Origin blog.csdn.net/m0_70885101/article/details/129079418