regularization technique

foreword

The core problem in machine learning: the design of the model not only performs well on training data, but also generalizes well on new inputs;
regularization strategy: at the expense of increasing the training error , to reduce the test error (if on the training error small, overfitting may occur);
the best fitted model ( in the sense of minimizing generalization error ) is a properly regularized large model.

The role of regularization is to prevent the model from overfitting and improve the generalization ability of the model. The idea is to add an indicator that characterizes the complexity of the model to the loss function.

Regularization technology is a very common technology in machine learning. It appears under different names in different models or scenarios. For example, take L2 regularization as an example. If L2 regularization is used for linear regression, then this is Corresponds to ridge regression; if L2 regularization is used for neural networks, then L2 corresponds to weight decay.

norm constraint

The ability of the model is limited by adding penalty parameters to the model parameters. The most commonly used method is to add norm constraints on the basis of the loss function.

Usually, in deep learning, only constraints are added to affine parameters, and no constraints are imposed on bias terms. The main reason is that the bias term generally requires less data to fit accurately. Adding constraints often results in underfitting.

L1 and L2 regularization is achieved by adding a constraint on the weight to the loss function. L1 is the sum of the absolute values ​​of the weights, and L2 is the sum of the squares of the weights.

Many parameter vectors in L1 regularization are sparse because many models cause parameters to approach 0, so it is often used in feature selection settings. The most commonly used regularization method in machine learning is to impose L2 norm constraints on the weights.

If analyzed from the perspective of probability, many norm constraints are equivalent to adding a priori distribution to the parameters, where the L2 norm is equivalent to the parameters obeying the Gaussian prior distribution; the L1 norm is equivalent to the Laplace distribution.

Dropout

Dropout is a general-purpose and computationally compact regularization method that has been widely used since it was proposed in 2014. Dropout refers to randomly changing the weight of the hidden layer nodes to 0 during model training. It is temporarily considered that these nodes are not part of the network structure, but their weights will be retained (not updated). The regularization is mainly achieved by setting the keep_prob parameter randomly to make some neurons not work during the training process.

 (Dropout is only used for fully connected layers during training)

Drop Connect

 Drop Connect is another regularization strategy to reduce algorithm overfitting and is a generalization of Dropout. A randomly selected subset of network architecture weights needs to be set to zero during Drop Connect, instead of setting a randomly selected subset of activation functions to zero for each layer in Dropout. Both Drop Connect and Dropout can achieve limited generalization performance since each unit receives input from a random subset of past layer units. Drop Connect is similar to Dropout in that it involves introducing sparsity into the model, except that it introduces sparsity in weights rather than sparsity in the output vectors of layers.

The difference between DropConnect and Dropout is that in the process of training the neural network model, it does not randomly change the output of the hidden layer node to 0, but changes the input weight of each node connected to it with a probability of 1-p. becomes 0. (one is output and the other is input)

early stop

In order to obtain a well-performing neural network, many decisions about the settings (hyperparameters) used are required during network training. One of the hyperparameters is the number of training epochs: that is, how many times should the dataset be fully traversed (one epoch at a time)? If the number of epochs is too small, the network is likely to underfit (that is, the training data is not sufficiently learned); if the number of epochs is too large, there is a possibility of overfitting (that is, the network is likely to experience "noise" in the training data instead of signal fitting).
The early stopping method is designed to solve the problem that the number of epochs needs to be manually set. It can also be viewed as a regularization method that avoids overfitting the network (similar to L1/L2 weight decay and dropout). The principle behind the early stop method is actually not difficult to understand:

    • Divide the data into training and test sets
    • After every epoch (or after every N epochs):
      evaluate network performance on the test set
      If the network performs better than the previous best model: save a copy of the network for the current epoch
    • Use the model with the best test performance as the final network model

As shown in the figure below, the optimal model is the model saved at the time point of the vertical dotted line, that is, the model with the highest accuracy when processing the test set.

In a nutshell, early stopping is regularization by stopping training a neural network early at the appropriate time by comparing the validation set loss with the test set loss.

 data augmentation

The most effective way to prevent overfitting is to enhance the training set. The larger the training set, the smaller the probability of overfitting;
1. The commonly used method in the field of target recognition is to rotate, zoom, etc. The premise is that the category of the picture cannot be changed by transformation, such as handwritten digit recognition, and categories 6 and 9 are easily changed after rotation);
2. Add random noise to the input data in speech recognition;
3. The common idea in NLP is to replace synonyms ;
4. Noise injection, which can add noise to the input, or add noise to the hidden layer or output layer.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325962651&siteId=291194637