Huashu Reading Notes (6)-Regularization in Deep Learning

Summary of all notes: "Deep Learning" Flower Book-Summary of Reading Notes

"Deep Learning" PDF free download: "Deep Learning"

1. Parameter norm penalty

Many regularization methods pass the objective function JJJ adds a parameter norm penaltyΩ (θ) \varOmega(\theta)Ω ( θ ) limits the learning ability of models (such as neural networks, linear regression or logistic regression). We denote the regularized objective function asJ ^ \hat JJ^J ^ (θ; X, y) = J (θ; X, y) + α Ω (θ) \ hat J (\ theta; X, y) = J (\ theta; X, y) + \ alpha \ varOmega (\ theta)J^ (θ;X,and )=J ( θ ;X,and )+α Ω ( θ )
whereα ∈ [0, ∞) \alpha\in[0,\infty)a[0,) is the penalty termΩ \varOmegaΩ and standard objective functionJ (X; θ) J(X;\theta)J(X;θ ) Relatively contributed hyperparameters. Willα \alphaSetting α to 0 means that there is no regularization. α \alphaThe larger the α, the larger the corresponding regularization penalty.

  1. L 2 L^2 L2 regularization
  2. L 1 L^1 L1 regularization

2. Norm penalty as a constraint

Construct a generalized lagrange function to minimize the function with constraints. L (θ, α; X, y) = J (θ; X, y) + α (Ω (θ) − k) L(\theta,\alpha;X,y)=J(\theta;X,y )+\alpha(\Omega(\theta)-k)L ( θ ,α ;X,and )=J ( θ ;X,and )+α ( Ω ( θ )k )
The solution of this constraint problem is given by the following formulaθ ∗ = arg min ⁡ θ max ⁡ α, α ≥ 0 L (θ, α) \theta^*=\argmin_\theta\max_(\alpha,\alpha\ ge0}L(\theta,\alpha)θ=θargm i nα , α 0maxL ( θ ,a )

We can fix α ∗ \alpha^*a , regard this problem as only related toθ \thetaθ有关 的 函数 : θ ∗ = arg min ⁡ θ L (θ, α ∗) = arg min ⁡ θ J (θ; X, y) + α ∗ Ω (θ) \ theta ^ * = \ argmin_ \ theta L ( \ theta, \ alpha ^ *) = \ argmin_ \ theta J (\ theta; X, y) + \ alpha ^ * \ varOmega (\ theta)θ=θargm i nL ( θ ,a)=θargm i nJ ( θ ;X,and )+a Ω(θ)

Three, regularization and under-constraint problems

Most forms of regularization can ensure the convergence of iterative methods applied to underdetermined problems.

You can use Moore-Penrose to solve underdetermined linear equations, and you need to use the pseudo-inverse definition mentioned earlier.

Four, data set enhancement

The best way to make a machine learning model generalize better is to use more data for training.

Data set enhancement is a particularly effective method for a specific classification problem: object recognition. The image is high-dimensional and includes a variety of huge variation factors, many of which can be easily simulated.

Data set enhancement is also effective for speech recognition tasks.

When comparing the results of machine learning benchmarks, it is important to consider the data set enhancements it takes. Under normal circumstances, artificially designed data set enhancement schemes can greatly reduce the generalization error of machine learning technology.

Five, noise robustness

Robustness refers to the ability of a system or organization to resist or overcome adverse conditions.

Another way to use noise in a regularized model is to add it to the weight. This technique is mainly used for recurrent neural networks.

Six, semi-supervised learning

Under the framework of semi-supervised learning, P (x) P (x)Unlabeled samples produced by P ( x ) andP (x, y) P (x, y)P(x,The labeled samples in y ) are used to estimateP (y ∣ x) P (y | x)P ( y x ) or according toxxx predictsyyand

Seven, multi-task learning

Multi-task learning is a way to improve generalization by combining examples from several tasks (which can be regarded as soft constraints imposed on parameters). The additional training samples push the parameters of the model in the same way to better generalize. When part of the model is shared between tasks, this part of the model is more constrained to a good value (assuming the sharing is reasonable ), which tends to generalize better.

  1. The parameters of the specific tasks (only good generalization can be achieved from the samples of the respective tasks).
  2. Common parameters shared by all tasks (benefit from the pooled data of all tasks).

8. Early termination

When training a large model with sufficient representation power or even overfitting, we often observe that the training error will gradually decrease over time but the validation set error will rise again. We can think of early termination as a very efficient hyperparameter selection algorithm.

Early termination is a very inconspicuous form of regularization, it hardly needs to change the basic training process, the objective function or a set of allowed parameter values. This means that early termination can be used easily without disrupting the learning dynamics.

Nine, parameter binding and parameter sharing

A common dependency we often want to express is that certain parameters should be close to each other. Consider the following situation: We have two models that perform the same classification task (with the same category), but the input distributions are slightly different.

Regularize the parameters of one model (classifier trained in supervised mode) to make it close to the parameters of another model trained in unsupervised mode (capture the observed distribution of input data). This structure allows the parameters of many classification models to match the parameters of the corresponding unsupervised models.

The parameter norm penalty is a way of regularizing parameters to make them close to each other, and a more popular method is to use constraints: forcing certain parameters to be equal. Since we interpret various models or model components as sharing a unique set of parameters, this regularization method is usually called parameter sharing.

CNN considers this feature by sharing parameters at multiple locations in the image. The same feature (hidden unit with the same weight) is calculated at different positions of the input. Parameter sharing significantly reduces the number of parameters of the CNN model, and significantly increases the size of the network without the need to increase training data accordingly.

Ten, sparse representation

Of course L 1 L^1L1 Penalty is one of the methods to make the representation sparse. Other methods include penalties derived from the Student-t prior on representation (Olshausen and Field, 1996; Bergstra, 2011) and KL divergence penalties (Larochelle and Bengio, 2008a), which are conducive to expressing elements constrained on the unit interval.

Models containing hidden units can become sparse in nature.

11. Bagging and other integration methods

Bagging (bootstrap aggregating) is a technique that reduces generalization errors by combining several models. The main idea is to train several different models separately, and then let all models vote on the output of the test sample. This is an example of a conventional strategy in machine learning, called model averaging. The technology that uses this strategy is called an integrated approach.

The reason model averaging works is that different models usually do not produce exactly the same error on the test set.

Different integration methods build integration models in different ways. For example, each member of the ensemble can be trained into a completely different model using different algorithms and objective functions.

Model averaging is a very powerful and reliable method to reduce generalization errors.

12. Dropout

Dropout provides a method to regularize a large class of models, which is convenient to calculate but powerful. Under the first approximation, Dropout can be considered as a practical bagging method that integrates a large number of deep neural networks.

  • In the case of Bagging, all models are independent.
  • In the case of Dropout, all models share parameters, where each model inherits a different subset of the parent neural network parameters.

Thirteen, confrontation training

We can reduce the error rate of the original independent and identically distributed test set through adversarial training-training the network on the training set samples against disturbances.

14. Tangent distance, tangent propagation and manifold tangent classifier

The tangent distance algorithm is a non-parametric nearest neighbor algorithm, in which the metric used is not the general Euclidean distance, but is derived from the knowledge of adjacent manifolds about aggregation probability.

The tangent propagation algorithm trains a neural network classifier with additional penalty, so that each output of the neural network is f (x) f(x)f ( x ) is locally invariant to known changing factors.

Tangent propagation is closely related to data set enhancement. Tangent propagation also involves double backpropagation and adversarial training. Double backpropagation regularization makes the Jacobian matrix smaller, while adversarial training finds points near the original input, and the training model produces the same output at these points as the original input.

The shape tangent classifier does not need to know the prior of the tangent vector. The encoder can estimate the tangent vector of the manifold, using this technique to avoid the user specifying the tangent vector.

The next chapter Portal: Huashu reading notes (7)-optimization in the depth model

Guess you like

Origin blog.csdn.net/qq_41485273/article/details/112851363