[Machine Learning] Loss function and optimization process

Article directory


In the framework of statistical learning, we usually try to find a function or concept that can explain or fit the given data well. This is usually achieved by minimizing some kind of risk or loss. To find a good model, we try in the predefined hypothesis space HHFind a function in H that has the smallest Empirical Risk on the training data. However, our real goal is to find a function whose expected risk (for all possible data distributions) is minimized. This often requires balancing the complexity of the model with its performance on the data.

Suppose there is an optimal function or concept ccc , which best fits the data, i.e.c = arg min ⁡ h R ( h ) c = \argmin\limits_{h} R(h)c=hargminR ( h ) , in whichR ( h ) R(h)R ( h ) is the functionhhh risk.

In practice, we cannot search in all possible function spaces, so we restrict the search to a predefined hypothesis class HHH in. Given training sampleS = { (S={(X1,Y1),,(Xn,Yn)} , we can calculate the functionhhThe empirical riskRS ( h ) R_S(h) of hRS( h ) is:

R S ( h ) = ∑ i = 1 n ℓ ( X i , Y i , h ) R_S(h) = \sum\limits_{i=1}^n \ell(X_i,Y_i,h) RS(h)=i=1n(Xi,Yi,h)

The expected risk is the average risk over all possible data distributions and is defined as:

R ( h ) = E [ R S ( h ) ] = E [ ℓ ( X i , Y i , h ) ] R(h) = \mathbb{E}[R_S(h)] = \mathbb{E}[\ell(X_i,Y_i,h)] R(h)=E [ RS(h)]=E[(Xi,Yi,h)]

The best hypothesis (target concept) in the entire function space is:

c = arg min ⁡ h R ( h ) c = \argmin\limits_{h} R(h) c=hargminR(h)

And in the predefined hypothesis class HHThe optimal hypothesis in H is:

h ∗ = arg min ⁡ h ∈ H R ( h ) h^* = \argmin\limits_{h \in H} R(h) h=hHargminR(h)

Use empirical risk to estimate hhh From the training data, we get:
h S = arg min ⁡ h ∈ HRS ( h ) = arg min ⁡ h ∈ H 1 n ∑ i = 1 n ℓ ( X i , Y i , h ) h_S = \argmin\ limits_{h \in H} R_S(h) = \argmin\limits_{h \in H} \frac{1}{n} \sum\limits_{i=1}^n \ell(X_i,Y_i,h)hS=hHargminRS(h)=hHargminn1i=1n(Xi,Yi,h)

Therefore, choose an appropriate hypothesis space and loss function ℓ \ell is crucial.

Assume class HHH is a set of functions, where each function attempts to map from input features to output labels,H = { h 1 , h 2 , … } H = \{ h_1, h_2, \dots \}H={ h1,h2,} . Usually,HHH is defined by a specific algorithm or model structure, such as linear regression, decision tree, etc.

First, the 0-1 loss function is the most direct classification error measure. For a given classifier hhh , which simply counts the number of misclassified data points. Mathematically, this is defined as:arg min ⁡ h E [ 1 Y ≠ sign ( h ( X ) ) ] \argmin\limits_{h} \mathbb{E}[1_{Y \neq sign(h(X))} ]hargminE[1Y=sign(h(X))] . But the problems we usually encounter are:

  1. Distribution of real data P (X, Y) P(X,Y)P(X,Y ) is unknown, so we cannot directly calculate the above expectation.
  2. The 0-1 loss is computationally difficult because it is discontinuous and non-convex, which complicates optimization.

The law of large numbers describes the relationship between the sample mean of a random variable and the population mean. It ensures that as the sample size approaches infinity, the sample mean approaches the overall mean. More formally, consider a random variable XXX , whose expected value isE [ X ] \mathbb{E}[X]E [ X ] . ForXXX 'snnn independent and identically distributed samplesX 1 , X 2 , … , X n X_1, X_2, \dots, X_nX1,X2,,Xn, their sample mean is defined as X n ˉ = 1 n ∑ i = 1 n X i \bar{X_n} = \frac{1}{n} \sum_{i=1}^{n}Xnˉ=n1i=1nXi。当 n → ∞ n \rightarrow \infty n 时, X n ˉ → E [ X ] \bar{X_n} \rightarrow \mathbb{E}[X] XnˉE [ _ _ _

Through the law of large numbers, we can use these samples to estimate certain distribution-dependent quantities, such as expected loss. Suppose our goal is to estimate hh by the hypothesisThe expected loss E caused by h [ 1 Y ≠ sign ( h ( X ) ) ] \mathbb{E}[1_{Y \neq \text{sign}(h(X))}]E[1Y=sign(h(X))] . We can use samples D \mathcal{D}from the real distributionD to estimate this expectation:

1 n ∑ i = 1 n 1 Y i ≠ sign ( h ( X i ) ) \frac{1}{n} \sum_{i=1}^{n} 1_{Y_i \neq \text{sign}(h(X_i))} n1i=1n1Yi=sign(h(Xi))

With the number of samples nnAs n increases, the above estimate will approach the true expected loss.

To make the problem solvable in practice, we use so-called surrogate loss functions, which are more tractable in terms of optimization but still aim to approximate a 0-1 loss function.

  • Hinge loss: This is the loss function used in support vector machines.
    ℓ ( X , Y , h ) = max ⁡ { 0 , 1 − Y h ((X,Y,h)=max{ 0,1Yh ( X )}

  • Logistic loss: This is used in logistic regression. It is more robust to outliers and provides good estimates of probabilities.

  • Least square loss: Mainly used in regression problems.

  • Exponential loss: is the loss function used in the AdaBoost algorithm.

Most popular alternative loss functions are designed to simulate the effect of a 0-1 loss function in the large sample limit. These are called classification-calibrated alternative loss functions. This means that if the training data is infinite, a classifier trained using these loss functions will perform consistently with the true best classifier on a 0-1 loss.

Given a surrogate loss function ℓ \ell and the corresponding functionϕ \phiϕ makesϕ( Y h ( X ) ) = ℓ ( X , Y , h ) \phi(Yh(X)) = \ell(X, Y, h)ϕ ( Yh ( X ))=(X,Y,h ) . Here,YYY is the label, the value is( − 1 , 1 ) (-1, 1)(1,1 ) , andh ( X ) h(X)h ( X ) is the classifier for inputXXX 's predicted score. To checkℓ \ellWhether ℓ is classified calibrated, we usually check the following conditions:

  1. ϕ \phiϕ is convex.
  2. ϕ \phiϕ is differentiable at 0, andϕ ′ ( 0 ) < 0 \phi'(0) < 0ϕ(0)<0

Satisfying the above conditions means that in most cases, for a given data point, the classifier hhWhen h minimizes the agent loss, it also minimizes the 0-1 loss.

For example, consider Hinge lossℓ hinge ( X , Y , h ) = max ⁡ { 0 , 1 − Y h ( X ) } \ell_{\text{hinge}}(X,Y,h) = \max \{ 0 , 1-Yh(X) \}hinge(X,Y,h)=max{ 0,1Yh ( X )}

The corresponding ϕ \phiThe ϕ function isϕ ( z ) = max ⁡ { 0 , 1 − z } \phi(z) = \max \{ 0, 1-z \}ϕ ( z )=max{ 0,1z}

This function is at z = 1 z=1z=It is not differentiable at 1 , but at z = 0 z=0z=It is differentiable at 0 , and its derivative is less than 0, so the Hinge loss is classification calibrated.

Now consider the following two classifier definitions:

  • h s h_s hsis an optimal classifier based on limited training data and a surrogate loss function.
  • h c h_c hcis the optimal classifier based on the entire data distribution and a 0-1 loss function.

Using a surrogate loss function and training data, we can find hs h_shs

h s = arg min ⁡ h 1 n ∑ i = 1 n ℓ ( X i , Y i , h ) h_s = \argmin\limits_{h} \frac{1}{n} \sum\limits_{i=1}^n \ell(X_i,Y_i,h) hs=hargminn1i=1n(Xi,Yi,h)

At the same time, if we know the distribution of the entire data, we can find hc h_chc

h c = arg min ⁡ h E [ 1 Y ≠ sign ( h ( X ) ) ] h_c = \argmin\limits_{h} \mathbb{E}[1_{Y \neq \text{sign}(h(X))}] hc=hargminE[1Y=sign(h(X))]

When our training data is infinitely large, hs h_s obtained by using the alternative loss functionhsWill be compared with hc h_c obtained using the 0-1 loss functionhccloser and closer. This can be expressed by the following formula:

E [ 1 Y ≠ sign ( h S ( X ) ) ] ⟶ n → ∞ E [ 1 Y ≠ sign ( h c ( X ) ) ] \mathbb{E}[1_{Y \neq \text{sign}(h_S(X))}] \overset{n \rightarrow \infty}{\longrightarrow} \mathbb{E}[1_{Y \neq \text{sign}(h_c(X))}] E[1Y=sign(hS(X))]nE[1Y=sign(hc(X))]

This means that when we optimize a proxy loss based on a limited sample dataset, we are actually optimizing the empirical loss on that dataset. The law of large numbers guarantees that as the number of samples increases, the expectation of this empirical loss will be close to the true expected loss. Also, if our proxy loss is classification calibrated, then optimizing this proxy loss will implicitly optimize the 0-1 loss. As the size of the training data approaches infinity, the expected 0-1 loss of the classifier obtained by minimizing the surrogate loss function will approach the optimal 0-1 loss.

When the substitution loss function is convex and smooth, we can use a series of optimization algorithms, such as gradient descent, Newton's method, etc., to solve the following problem: h = arg min ⁡ h
∈ H 1 n ∑ i = 1 n ℓ ( X i , Y i , h ) h = \argmin\limits_{h \in H} \frac{1}{n} \sum\limits_{i=1}^n \ell(X_i,Y_i,h)h=hHargminn1i=1n(Xi,Yi,h)

Assuming that function (f(h)) is differentiable at point (h_k), we can use Taylor series to approximate the value of the function near (h_k): f ( x + Δ x ) ≈ f ( x ) + ∇
f ( x ) T Δ x + 1 2 Δ x T ∇ 2 f ( x ) Δ x + … f(x + \Delta x) \approx f(x) + \nabla f(x)^T \Delta x + \ frac{1}{2} \Delta x^T \nabla^2 f(x) \Delta x + \dotsf(x+Δ x )f(x)+f(x)TΔx+21Δx _T2f(x)Δx+

If we only consider the first and second terms of the above approximation (linear approximation), we get:
f ( h + Δ h ) ≈ f ( h ) + ∇ f ( h ) T Δ hf(h + \Delta h) \approx f(h) + \nabla f(h)^T \Delta hf(h+Δh)f(h)+f(h)TΔh

Now, consider our gradient descent update step: hk + 1 = hk + η dk h_{k+1} = h_k + \eta d_khk+1=hk+the dk, where dk d_kdk是 descending direction,Δ h = η dk \Delta h = \eta d_kΔh=the dk. We want to find a Δ h \Delta hΔ h sakef( h + Δ h ) f(h + \Delta h)f(h+Δ h ) is as small as possible. From the linear approximation above, we can see that to makef ( h + Δ h ) f(h + \Delta h)f(h+Increment of Δ h ) ∇ f ( h ) T Δ h \nabla f(h)^T \Delta hf(h)T Δhis as small as possible,Δ h \Delta hΔ h should be related to the gradient∇ f ( h ) \nabla f(h)∇ Align in the opposite direction of f ( h ) . Therefore, the negative direction of the gradient is usually chosen− ∇ f ( h ) -\nabla f(h)f ( h ) as the downward direction.

However, if we want to use the curvature information of the function at a certain point, or speed up the optimization process, we can introduce D k D^kDk such matrix, matrixD k D^kDk is often called the preconditioning matrix or scaling matrix.

Considering the general update rule hk + 1 = hk − η D k ∇ f ( hk ) h_{k+1} = h_k - \eta D^k \nabla f(h_k)hk+1=hkthe Dkf(hk) . Among them,D k D^kDk is a positive definite symmetric matrix. Positive certainty ensures that the updated direction is the downward direction, that is,∇ f ( hk ) TD k ∇ f ( hk ) > 0 \nabla f(h_k)^TD^k \nabla f(h_k) > 0f(hk)TDkf(hk)>0

  • Steepest Descent

    D k = I D^k = I Dk=I

    At this time, the matrix D k D^kDk is the identity matrix, which means we only move in the negative direction of the gradient.

  • Newton's Method

    D k = [ ∇ 2 f ( hk ) ] − 1 D^k = [\nabla^2 f(h_k)]^{-1}Dk=[2f(hk)]1

    Here, matrix D k D^kDk is the inverse of the Hessian matrix, which gives the curvature information of the function. Using the inverse of the Hessian matrix can help the algorithm converge faster, but may be more computationally expensive, especially in high dimensions.

Suppose we have chosen the descent direction dk d_kdk,Exact Line Search Specifications: η = arg min ⁡ η f ( hk − η ∇ f ( hk ) ) \eta = \argmin\limits_{\eta} f(h_k - \eta \nabla f(h_k))the=theargminf(hkηf(hk))

Simply put, it searches for a step size eta \etaη , such that from the current positionhk h_khkStart, move eta \eta along the gradient directionAfter eta step size, function fff reaches its minimum value. Although this method can find the optimal step size, in practical applications it can be expensive because each iteration requires solving a new optimization problem.

L-Lipschitz continuous describes a function whose growth rate is bounded. If a function satisfies Lipschitz continuity, then the rate of change of the function is finite between any two points and is constrained by a certain constant.

Given a function fff , if there exists a positive constantLLL , such that for any two points x 1 x_1in the domainx1and x 2 x_2x2, all satisfy the following conditions:
∣ f ( x 1 ) − f ( x 2 ) ∣ ≤ L ∣ ∣ x 1 − x 2 ∣ ∣ |f(x_1) - f(x_2)| \leq L ||x_1 - x_2| |f(x1)f(x2)L∣∣x1x2∣∣
, then we call the functionfff is L-Lipschitz continuous.

If the gradient is known to be L-Lipschitz continuous, then the learning rate can be chosen using the following formula:

hk + 1 = hk − 1 L ∇ f ( hk ) h_{k+1} = h_k - \frac{1}{L} \nabla f(h_k)hk+1=hkL1f(hk)

  • Gradient descent:

    • Assumptions:
      • Lipschitz gradient: This means that the change in gradient (or derivative) is subject to an upper limit.
      • Convex function: A function is at least locally convex over its domain.
    • Convergence rate:
      O ( 1 k ) O\left(\frac{1}{k}\right)O(k1)
      This means that the error of the algorithm decreases inversely with the number of iterations k.
  • Gradient descent:

    • Assumptions:
      • Lipschitz gradient
      • Strong convexity: Stricter convexity than regular convex functions, which means that the function has a lower bound.
    • Convergence rate:
      O ( 1 − μ L ) k O\left(1-\frac{\mu}{L}\right)^kO(1Lm)k
      among them,μ \muμ is the strong convexity parameter,LLL is the Lipschitz constant. This convergence rate is faster than simple convex functions.
  • Newton's method:

    • Assumptions:
      • Lipschitz gradient
      • Strong convexity
    • Convergence rate:
      ∑ i = 1 k ρ k \sum\limits_{i = 1}^{k}\rho_ki=1krk
      Among them, ρ k → 0 \rho_k \rightarrow 0rk0 . This means that Newton's method converges very quickly, especially when approaching the optimal solution.

\textbf{Newton's method and its variants}:

Newton's method is an iterative optimization algorithm used to find the zeros of a function in the real or complex domain. In machine learning and optimization, Newton's method is often used to find the minimum value of the loss function. The core idea of ​​this method is to use the second-order Taylor series expansion of the function to iteratively approximate the zero point or minimum value of the function.

The advantage of Newton's method is that if the function is quadratic, the minimum can be found in one iteration.

  • In practice, computing the Hessian matrix and its inverse can be difficult, especially when the dimensionality is high. Therefore, several practical variants of Newton's method have been proposed.

  • Modify the Hessian to ensure that it is positive definite:
    To ensure that the Hessian matrix is ​​invertible and positive definite, we can modify it slightly, for example, by adding a regularization term to it.

  • Calculate the Hessian every m iterations:
    Since the calculation of the Hessian matrix can be computationally intensive, one strategy is to calculate it every m iterations and use the most recently calculated Hessian in other iterations.

  • Using only the diagonal of the Hessian:
    By considering only the diagonal elements of the Hessian matrix, the computational complexity can be greatly reduced. This method is called the diagonal Hessian Newton method.

  • Quasi-Newton method:
    The quasi-Newton method is a method that aims to approximate the Hessian matrix without directly calculating it. Among them, BFGS (Broyden–Fletcher–Goldfarb–Shanno algorithm) and L-BFGS (Limited-memory BFGS) are the most famous quasi-Newton methods. L-BFGS is a memory-efficient version of BFGS that only stores the most recent updates of the Hessian matrix.

In machine learning, we are often concerned with two main sources of error: Approximation Error and Estimation Error. The approximation error is due to the predefined hypothesis space HHThe error caused by the restriction of H , which is the optimal hypothesis h ∗ h^*h and target conceptccThe difference between c . The estimation error is the error due to the limited sample size, which is the hypothesish S h_ShSand the optimal hypothesis h ∗ h^*h The difference between.

If the target concept ccc in the predefined hypothesis spaceHHH , then the approximation error will be zero. However, while choosing a large and complex hypothesis space can reduce the approximation error, it will increase the estimation error because the model may overfit the data.

In practical applications, we would like to understand how well knowledge learned from limited training samples can be applied to new, unseen data. The PAC learning framework gives an upper bound on the number of training samples required to learn approximately optimal hypotheses from a predefined hypothesis class. This framework provides a theoretical foundation for the generalization performance of learning algorithms. It reveals the relationship between the size of the training data set, the complexity of the hypothesis classes, and the generalization error of the learned model.

Assume class HHH is PAC learnable, which means that there is an algorithm that can select a hypothesis from a hypothesis class whose error on new data is less than ϵ \epsilon from the error of the best hypothesis in the hypothesis class.ϵ , and the probability of this happening is at least1 − δ 1-\delta1d .

If there is a learning algorithm A \mathcal{A}A and the polynomial functionpoly ( ⋅ , ⋅ ) poly(·,·)poly(⋅,⋅) , such that for anyϵ > 0 \epsilon > 0ϵ>0 andδ > 0 \delta > 0d>0 , for all defined inX × YX \times YX×Distribution DDon YD , as long as the sample sizennninterpoly ( 1 δ , 1 ϵ ) poly(\frac{1}{\delta}, \frac{1} {\epsilon})poly(d1,ϵ1) , then algorithmA \mathcal{A}A learned hypothesish S h_ShSThe following conditions:

p { R ( h S ) − min ⁡ h ∈ H R ( h ) ≤ ϵ } ≥ 1 − δ p\{ R(h_S) - \min\limits_{h \in H} R(h) \leq \epsilon\} \geq 1 - \delta p{ R(hS)hHminR(h)ϵ }1d

Among them, ϵ \epsilonϵ is the upper bound of the error we can accept, and1 − δ 1-\delta1δ is the confidence we want,R ( h S ) R(h_S)R(hS) is the minimum value of empirical risk,R ( h ) R(h)R ( h ) is the minimum value of expected risk. .

If a hypothetical class HHH is too complex, we may need exponential levels of training samples, that is,n > exp ( 1 δ , 1 ϵ ) n > exp(\frac{1}{\delta}, \frac{1}{\epsilon})n>exp(d1,ϵ1) to ensure that the above inequality holds. In this case, assume classHHH is not learnable by PAC.

We use the Empirical Risk Minimization (ERM) algorithm to verify whether a hypothesized class is PAC learnable.

Empirical risk minimization (ERM) is a basic strategy in machine learning, whose goal is to find a hypothesis h ∈ H h \in HhH minimizes the risk (or loss) on the training data. Formally, given a training setSSS , the ERM strategy chooses a hypothesishhh to minimize the empirical riskRS ( h ) R_S(h)RS(h)

h S = arg ⁡ min ⁡ h ∈ H R S ( h ) h_S = \arg\min_{h \in H} R_S(h) hS=arghHminRS(h)

Among them RS ( h ) R_S(h)RS( h ) is the empirical risk, which is the average loss on the training data.

We use the Empirical Risk Minimization (ERM) algorithm to verify whether a hypothesized class is PAC learnable.

If the training data is large enough, then the hypothesis h S h_S selected by ERMhSPerformance on the training data (empirical risk) will be close to its performance on the entire distribution (real risk). However, even if the empirical risk is low, the real risk can be high. This phenomenon is called overfitting.

The more complex the hypothesis space is, the selected hypothesis h S h_ShSWith the best hypothesis h ∗ h^*hThe difference between * may be even greater. The PAC framework provides a way to measure this difference and determine how much training data is needed to ensure that the difference between empirical risk and true risk is less than a certain threshold.

Therefore, to verify whether a hypothetical class is PAC learnable, we need to ensure that:

  • For a sufficiently large training data set, the difference between the true and empirical risks of the hypotheses produced by ERM on that hypothesis class is less than a given threshold ϵ \ epsilonϵ .
  • In order to achieve this, we need to estimate or bound the difference between the true risk and the empirical risk and ensure that it is less than ϵ \epsilonϵ . This is usually done through complexity regularization and/or leveraging tools such as VC dimensionality, Rademacher complexity, etc.

The VC dimension (Vapnik-Chervonenkis dimension) is a tool used to measure the complexity of a hypothesis space (or model class). It provides us with a theoretical tool to judge how many sample data a given hypothesis class (or model class) needs to be learned and ensure good generalization performance.

Given hypothesis class HHH , whose VC dimension is the size of the largest point set that can be broken up by the hypothesis class. In other words, if there exists sizeddd data set, such that for thisddFor any possible classification label of d points, there is a hypothesis h ∈ H h \in HhH can perfectly separate the points, but for any sized+1 d+1d+No set of 1 can do this, then HHThe VC dimension of H is ddd

To understand more visually, consider a linear classifier on a two-dimensional plane. A linear classifier perfectly separates all possible classifications of any three points that are not on the same straight line. However, for four arbitrary points, we cannot guarantee that there is a linear classifier that can classify all possible 2 4 = 16 2^4 = 1624=16 tag combinations perfectly classify these four points. Therefore, the VC dimension of a linear classifier on a two-dimensional plane is 3.

A hypothesis class with high VC dimension is more complex than a hypothesis class with low VC dimension. This means that hypothesis classes with high VC dimensions may have a higher risk of overfitting.

VC inequality It provides the relationship between real risk (i.e. risk over the entire distribution) and empirical risk (i.e. risk over the training data). Given the VC dimensions of a hypothesis class and the size of a training data set, the VC inequality tells us the probability that the difference between true and empirical risks does not exceed a certain value.

Pr ⁡ ( ∣ R ( h ) − R S ( h ) ∣ > ϵ ) ≤ 4 m H ( 2 n ) exp ⁡ ( − n ϵ 2 8 ) \Pr\left( \left| R(h) - R_S(h) \right| > \epsilon \right) \leq 4 m_H(2n) \exp\left(-\frac{n\epsilon^2}{8}\right) Pr(R(h)RS(h)>) _4 mH( 2n ) _exp(8n ϵ2)

Where, m H ( n ) m_H(n)mH( n ) is the hypothesis spaceHHThe size of H that can be broken up isnnThe number of point sets of n , R ( h ) R(h)R ( h ) is the real risk,RS ( h ) R_S(h)RS( h ) is the empirical risk.

From the above inequality, we can estimate how many training samples are needed to achieve a specific generalization error. This is useful for determining whether a given hypothesis class is PAC learnable.

In PAC learning, in order to ensure that the learned hypothesis has good generalization performance, we need enough samples to control the difference between the real risk and the empirical risk. The VC dimension provides us with a way to estimate the required sample size, allowing us to verify whether the hypothesized class is PAC learnable.

Guess you like

Origin blog.csdn.net/weixin_45427144/article/details/132410764