Comparison of Common Regression and Classification Loss Functions


code



The general representation of the loss function is \(L(y,f(x))\) , which is used to measure the degree of inconsistency between the true value \(y\) and the predicted value \(f(x)\) , generally the smaller the better. In order to facilitate the comparison of different loss functions, it is often expressed as a univariate function, which is \(yf(x)\) in regression problems and \ (yf(x)\) in classification problems . These are discussed separately below.



Loss function for regression problems

In the regression problem, both \(y\) and \(f(x)\) are real numbers \(\in R\) , so the residual \(yf(x)\) is used to measure the degree of inconsistency between the two. The larger the residual (absolute value), the larger the loss function and the worse the learned model (regularization is not considered here).


Common regression loss functions are :

  • Squared loss : \((yf(x))^2\)
  • Absolute loss : \(|yf(x)|\)
  • Huber损失 (huber loss) : \(\left\{\begin{matrix}\frac12[y-f(x)]^2 & \qquad |y-f(x)| \leq \delta \\ \delta|y-f(x)| - \frac12\delta^2 & \qquad |y-f(x)| > \delta\end{matrix}\right.\)



The most commonly used is the square loss, but its disadvantage is that it will impose a large penalty on outliers, so it is not robust enough. If there are more abnormal points, the absolute value loss performs better, but the disadvantage of the absolute value loss is that it is discontinuous and derivable at \(yf(x)=0\) , so it is not easy to optimize.


Huber loss is a combination of the two. When \(|yf(x)|\) is less than a pre-specified value \(\delta\) , it becomes a square loss, and when it is greater than \(\delta\) , it becomes a square loss. It is similar to the absolute value loss, so it is also a relatively robust loss function. The graphic comparison of the three is as follows:





Loss functions for classification problems

For binary classification problems, \(y\in \left\{-1,+1 \right\}\) , the loss function is often expressed as a monotonically decreasing form about \(yf(x)\) . As shown below:



\(yf(x)\) is called margin , which acts like the residual \(yf(x)\) in regression problems .


The classification rule in a binary classification problem is usually \(sign(f(x)) = \left\{\begin{matrix} +1 \qquad if\;\;f(x) \geq 0 \\ -1 \qquad if\;\;f(x) < 0\end{matrix}\right.\)

It can be seen that if \(yf(x) > 0\) , the sample is classified correctly, \(yf(x) < 0\) is classified incorrectly, and the corresponding classification decision boundary is \(f(x) = 0 \) . Therefore, minimizing the loss function can also be regarded as the process of maximizing the margin. Any qualified classification loss function should impose a larger penalty on the samples with margin<0.



1. 0-1 loss (zero-one loss)

\[L(y,f(x)) = \left\{\begin{matrix} 0 \qquad if \;\; yf(x)\geq0 \\ 1 \qquad if \;\; yf(x) < 0\end{matrix}\right.\]

The 0-1 loss imposes the same penalty on each misclassified point, so that points that are "wrongly wrong" (i.e. \(margin \rightarrow -\infty\) ) do not receive as much attention, which is It doesn't intuitively fit. In addition, the 0-1 loss is discontinuous and non-convex, and optimization is difficult, so other surrogate loss functions are often used for optimization.



2、Logistic loss

\[L(y,f(x)) = log(1+e^{-yf(x)})\]


Logistic Loss is the loss function used in Logistic Regression. Here is a simple proof:


The Sigmoid function is used in Logistic Regression to represent the predicted probability: \[g(f(x)) = P(y=1|x) = \frac{1}{1+e^{-f(x)}}\]

\[P(y=-1|x) = 1-P(y=1|x) = 1-\frac{1}{1+e^{-f(x)}} = \frac{1}{1+e^{f(x)}} = g(-f(x))\]

Therefore, using \(y\in\left\{-1,+1\right\}\) , it can be written as \(P(y|x) = \frac{1}{1+e^{-yf(x )}}\) , this is a probability model, using the idea of ​​maximum likelihood:

\[max(\prod P(y|x)) = max(\prod \frac{1}{1+e^{-yf(x)}})\]


Take the logarithm on both sides, and because it is a loss function, the maximum will be turned into a minimum:

\[max(\sum logP(y|x)) = -min(\sum log(\frac{1}{1+e^{-yf(x)}})) = min(\sum log(1+e^{-yf(x)}) \]

In this way, the logistic loss is obtained.



If \(t = \frac{y+1}2 \in \left\{0,1\right\}\) is defined , then the maximum likelihood method can be written as:

\[\prod (P(y=1|x))^{t}((1-P(y=1|x))^{1-t}\]

Take the logarithm and minimize to get:

\[\sum [-t\log P(y=1|x) - (1-t)\log (1-P(y=1|x))]\]

The above formula is called cross entropy loss. It can be seen that logistic loss and cross entropy loss are equivalent in the binary classification problem. The difference between the two is only the definition of the label y.



3、Hinge loss

\[L(y,f(x)) = max(0,1-yf(x))\]


The hinge loss is the loss function used in svm. The hinge loss makes the sample losses of \(yf(x)>1\) all 0, which brings a sparse solution, so that the svm can determine the final result with only a small number of support vectors. hyperplane.



4. Exponential loss

\ [L (y, f (x)) = e ^ {- yf (x)} \]


exponential loss is the loss function used in AdaBoost. Using exponential loss can easily use the additive model to derive the AdaBoost algorithm (The specific derivation process can be seen). However, like squared loss, it is sensitive to outliers and not robust enough.



5、modified Huber loss

\[L(y,f(x)) = \left \{\begin{matrix} max(0,1-yf(x))^2 \qquad if \;\;yf(x)\geq-1 \\ \qquad-4yf(x) \qquad\qquad\;\; if\;\; yf(x)<-1\end{matrix}\right.\qquad\]


Modified huber loss combines the advantages of hinge loss and logistic loss, which can not only generate sparse solutions when \(yf(x) > 1\) to improve training efficiency, but also perform probability estimation. In addition, its penalty for \((yf(x) < -1)\) samples increases linearly, which means less interference from outliers, compared to robust. SGDClassifier in scikit-learn also implements modified huber loss.



Finally, a family portrait of Zhang:

It can be seen from the above figure that the loss functions introduced above can be regarded as monotone continuous approximation functions of 0-1 loss, and because these loss functions are usually convex continuous functions, they are often used instead of 0-1 loss for optimization. Their similarity is that they increase the penalty with \(margin \rightarrow -\infty\) ; the difference is that both logistic loss and hinge loss grow linearly, while exponential loss grows exponentially.

It is worth noting that the modified huber loss in the above figure is similar to the exponential loss, and its robust properties cannot be seen. In fact, this is the same as the time complexity of the algorithm. It can only show a huge difference after it is multiplied:





/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324980676&siteId=291194637