Overfitting problem-regularization method

      After reading a lot of information, I want to put a regularization concept on it, but I really don’t dare to put it, because I’m afraid to scare away a bunch of people. So, just keep it.

      First of all, we know that regularization (Regularization) is to solve the problem of overfitting. Simply put, overfitting (also called high variance) means that the effect of training samples is better, but the effect on the test set is relatively poor. The official One thing is that the generalization ability of the model is too poor.

      Generalization ability: the ability of a hypothetical model to be applied to new samples.

       To solve overfitting, we can use

     (1) Discard some features that cannot help us predict correctly. You can keep it manually, or you can use an algorithm (for example, PCA)

     (2) Regularization processing. Keep all the features, but reduce the size of the parameters.


      There are many ways of regularization, the common ones are data enhancement, L1 regularization, L2 regularization, early stopping, Dropout, etc.

      Regularization cost function = empirical cost function + regularization parameter * regularization term

      Among them, the empirical loss function is what we call the loss function, minimizing the error so that the model better fits the training set


      The concept of norm:

       

 From a probabilistic point of view, many norm constraints are equivalent to adding a prior distribution to the parameters. Among them, the L2 norm is equivalent to the parameters obey the Gaussian prior distribution, and the L1 norm is equivalent to the Laplace distribution. From the Bayesian perspective, regularization is to add a priori knowledge to the model parameter estimation, and the prior knowledge will guide the process of minimizing the loss function to iterate towards the constraint direction.

 Relevant information indicates:

 L0 and L1 can solve the sparse problem

 The L0 problem is a difficult NP combination problem, which cannot be solved directly for larger-scale data;

 There are two algorithms for directly solving the L0 problem:

(1) Greedy algorithm

(2) Threshold algorithm

 problem:

(1) The time cost of the greedy algorithm is too high to guarantee convergence to the global optimum

(2) The threshold algorithm has low time cost, but it is very sensitive to data noise. The solution is not continuous, and the global optimal solution cannot be guaranteed.

  L0 application scenarios: compressed sensing, sparse coding

  L0 transition to L1: relax from a combinatorial optimization problem to a convex optimization problem to solve, the L1 norm is the optimal convex approximation of the L0 norm

  The solid ellipse represents the contour without regularization target, and the dashed circle represents the contour of L1 regularization

  There is a lot of mathematical reasoning involved. In short, L1 regularization can generate a sparse matrix (remove useless features and reset the weight to 0), which is conducive to feature selection.

  Expansion: refer to https://wenku.baidu.com/view/00613bc4f78a6529657d536c.html?from=search

  


  L2 (Tikhonov regular) weight decay

  The goal is to make the weight closer to the origin by adding a regular term to the objective function.

  The solid ellipse represents the contour without the regularization target, and the dashed circle represents the contour of the L2 regularization

  


  In ridge regression, the main problem we solve is that the number of features is greater than the number of samples, which is also a singular matrix problem.

  Singular matrix: If there is a complete linear dependence between the columns of X, that is, one or some of its column elements are just a linear function of another or another column of elements, this is called collinearity or multicollinearity. X collinearity inevitably lead to inter presence of common columns and rows, and that the singular, i.e., the determinant is 0, . Then the inverse of the matrix cannot be solved.

  In regression analysis, there are situations that are similar to but different from singular matrices, that is, the value of the determinant is approximately 0. Such matrices are usually called ill-conditioned matrices or nearly singular matrices. The L2 norm helps to calculate ill-conditioned problems.

  

















Guess you like

Origin blog.csdn.net/qq_28409193/article/details/81018093