Machine Learning, Overfitting and Underfitting, Regularization and Cross Validation

Table of contents

machine learning

Overfitting and Underfitting

Regularization and cross-validation

Regularization

Cross-validation


machine learning

The goal is to make the learned model have good predictive ability not only for known data but also for unknown data .

Different machine learning methods will give different models. When the loss function is given, the training error of the model based on the loss function (training error) and the test error of the model (test error) naturally become the criteria for evaluating the learning method.

Note that the specific loss function used by the machine learning method is not necessarily the loss function used in the evaluation. Of course, it would be ideal to have the two consistent.

The size of the training error is meaningful for judging whether a given problem is an easy-to-learn problem, but it is essentially unimportant. The test error reflects the predictive ability of the learning method for the unknown test data set.

Obviously, given two learning methods, the method with smaller test error has better predictive ability and is the more effective method. Generally speaking, the predictive ability of a learning method to unknown data is called generalization ability .

Overfitting and Underfitting

For machine learning and deep learning models, we not only hope that it can fit the training data set well, but also hope that it can have a good fitting effect (generalization ability) on the unknown data set (test set). The generalization ability of machine learning refers to the predictive ability of the model learned by the method to unknown data, which is an essential property of the learning method. The most widely used method in reality is to use the test error to evaluate the generalization ability of the learning method .

Measuring the quality of the generalization ability involves the so-called underfitting and overfitting of the model.

  • Overfitting refers to performing well on the training dataset but poorly on the unknown data.
  • Underfitting refers to the fact that the model does not learn the characteristics of the data well, cannot fit the data well, and performs poorly on both training data and unknown data.

picture

Underfitting, normal fitting and overfitting

The figure below depicts the relationship between training error and test error and the complexity of the model. When the complexity of the model increases, the training error will gradually decrease and approach 0; while the test error will decrease first, and then increase after reaching the minimum value. Overfitting occurs when the selected model is too complex.

Therefore, it is necessary to prevent over-fitting during learning and perform optimal model selection, that is, to select a model with appropriate complexity to achieve the learning purpose of minimizing the test error.

picture

The relationship between training error and test error and the complexity of the model

The reason for overfitting is:

  • There are too many parameters and the model complexity is too high;

  • Modeling samples were selected incorrectly, resulting in the selected sample data not being sufficient to represent the predetermined classification rules;

  • The sample noise interference is too large, which makes the machine consider part of the noise as a feature and disrupt the preset classification rules;

  • The hypothetical model cannot reasonably exist, or the conditions for the assumption to be true are not actually true.

The reasons for underfitting are:

  • Too few features;

  • Model complexity is too low.

How to solve overfitting?

  • Acquiring and using more data (dataset augmentation) - the fundamental way to solve overfitting;

  • Feature dimensionality reduction, manually select the method of retaining features to reduce the dimensionality of features;

  • Add regularization to control the complexity of the model;

  • Dropout; (
    dropout (random deactivation): dropout is to traverse the nodes of each layer of the neural network, and then set a keep_prob (node ​​retention probability) for the neural network of the layer, that is, the probability that the nodes of this layer have keep_prob is retained , the value range of keep_prob is between 0 and 1. By setting the retention probability of the node in this layer of the neural network, the neural network will not be biased towards a certain node (because the node may be deleted), so that each node The weight of will not be too large, a bit similar to L2 regularization, to reduce the overfitting of the neural network.)

  • Early stopping;

  • Cross-validation.

How to solve underfitting?

  • To add new features, you can consider adding feature combinations and high-order features to increase the hypothesis space;

  • Add polynomial features, which are commonly used in machine learning algorithms, such as adding quadratic or cubic terms to linear models to make the model more generalizable;

  • Reduce the regularization parameters. The purpose of regularization is to prevent overfitting, but if the model is underfitting, you need to reduce the regularization parameters;

  • Use nonlinear models, such as kernel SVM, decision tree, deep learning and other models;

  • Adjust the capacity of the model (capacity), popularly, the capacity of the model refers to its ability to fit various functions;

  • Models with low capacity may have difficulty fitting the training set.

Regularization and cross-validation

Regularization

A typical method for model selection is regularization. Regularization is the realization of the structural risk minimization strategy, which is to add a regularizer or penalty term to the empirical risk. The regularization term is generally a monotonically increasing function of the model complexity, the more complex the model, the larger the regularization value. For example, the regularization term can be the norm of the model parameter vector.

Regularization generally has the following form:

 

Among them, the first item is the empirical risk, and the second item is the regularization item, which is the coefficient to adjust the relationship between the two.

The model with less empirical risk of item 1 may be more complex (with multiple non-zero parameters), and the model complexity of item 2 will be greater. The role of regularization is to select models with small empirical risk and model complexity at the same time.

From a Bayesian estimation perspective, the regularization term corresponds to the prior probability of the model .

  • Complex models can be assumed to have smaller prior probabilities
  • Simple models have larger prior probabilities

Cross-validation

Another commonly used model selection method is cross validation.

If the given sample data is sufficient, a simple way to select a model is to randomly divide the data set into three parts, namely training set, validation set and test set . The training set is used to train the model, the validation set is used for model selection, and the test set is used for final evaluation of the learning method. Among the learned models of different complexity, the model with the smallest prediction error on the validation set is selected. Since the validation set has enough data, it is also effective to use it for model selection.

However, data are insufficient in many practical applications. In order to select a good model, a cross-validation method can be used. The basic idea of ​​cross-validation is to use data repeatedly; split the given data, combine the split data sets into training set and test set, and repeatedly perform training, testing and model selection on this basis .

  1. simple cross validation

The simple cross-validation method is: first randomly divide the given data into two parts, one part is used as the training set, and the other part is used as the test set (for example, 70% of the data is the training set, and 30% of the data is the test set); then use The training set trains the model under various conditions (for example, different numbers of parameters) to obtain different models; evaluate the test error of each model on the test set, and select the model with the smallest test error.

  1. S-fold cross-validation

The most widely used is S-fold cross validation (S-fold cross validation), the method is as follows: First, the given data is randomly divided into S subsets that are mutually disjoint and of the same size; then, the data of S-1 subsets are used to Train the model, use the remaining subsets to test the model; repeat this process with possible S options; finally select the model with the smallest average test error in the S evaluations.

  1. leave-one-out cross-validation

The special case of S-fold cross-validation is S=N, which becomes leave-one-out cross validation (LOOCV), which is often used when data is scarce. Here N is the capacity of the given dataset .

Guess you like

Origin blog.csdn.net/qq_38998213/article/details/132412608