The two concepts of regularization and normalization were often mistaken before, and this article is featured to prevent mistakes again.
1. Normalization
The function of normalization is to remove the dimension of the data , or to convert the value of the data to the same order of magnitude or limit it within a certain range.
1.1 max-min normalization
i.e. through xxThe maximum and minimum value pair xxof the data set where x is locatedx is normalized:
x ′ = x − x min x max − x min x^{'}=\frac{x-x_{\min }}{x_{\max }-x_{\min } }x′=xmax−xminx−xmin
Among them, x min x_{\min }xminand x max x_{\max }xmaxfor data xxThe minimum and maximum values of the set (row/column) where x is located, after normalization,xxThe range of x is,x ∈ [ 0 , 1 ] x \in [0,1]x∈[0,1]。
1.2 Normalize with mean and variance (standardization)
Put the data xxTransform x to mean 0 and variance 1:
x ′ = x − μ σ x^{'}=\frac{x-\mu}{\sigma}x′=px−m
Among them, μ \muµ和σ \sigmaσ are the mean and variance of the data set, respectively.
After such normalization, the corresponding loss function has a uniform contour shape, which can quickly converge when performing the gradient descent algorithm.
2. Regularization
Regularization is mainly used to avoid overfitting and reduce network errors. The regularization formula is:
L = ∑ n ( y ^ n − ( b + ∑ wixi ) ) 2 + λ ∑ ( wi ) 2 L=\sum_{n}\left(\hat{y}^{n}- \left(b+\sum w_{i} x_{i}\right)\right)^{2} +\lambda \sum\left(w_{i}\right)^{2}L=n∑(y^n−(b+∑wixi))2+l∑(wi)2
Note 1: The formula comes from Professor Li Hongyi's 2020 machine learning courseware
Note 2: Commonly used L2 regularization
where, y ^ n \hat{y}^{n}y^n represents thennthThe true value of n pieces of data,xi x_ixifor secondi input features. Compared with the general loss function formula, it can be seen that regularization means adding a termλ ∑ ( wi ) 2 \lambda \sum\left(w_{i}\right)^{2}l∑(wi)2 , whereλ ≥ 0 \lambda \geq0l≥0 is used to adjust the degree of regularization.
Note: Usually the coefficient w 0 w0w 0 is omitted from the regularization term becausew 0 w0w 0 will make the result dependent on the choice of the origin of the target variable
This formula adds a penalty term to the error function such that the coefficient wi w_iwiwill not reach a large value.
The following table visually shows λ \lambdaThe effect of λ on the coefficients:
ln λ = − ∞ ln λ = − 18 ln λ = 0 w 0 ∗ 0.35 0.35 0.13 w 1 ∗ 232.37 4.74 − 0.05 w 2 ∗ − 5321.83 − 0.77 − 0.06 w 3 ∗ 48568.31 − 31.97 − 0.05 w 4 ∗ − 231639.30 − 3.89 − 0.03 w 5 ∗ 640042.26 55.28 − 0.02 w 6 ∗ − 1061800.52 41.32 − 0.01 w 7 ∗ 1042400.18 − 45.95 − 0.00 w 8 ∗ − 557682.99 − 91.53 0.00 w 9 ∗ 125201.43 72.68 0.01 \begin{array}{r |rrr} & \ln \lambda=-\infty & \ln \lambda=-18 & \ln \lambda=0 \\ \hline w_{0}^{*} & 0.35 & 0.35 & 0.13 \\w_{1 }^{*} & 232.37 & 4.74 & -0.05 \\ w_{2}^{*} & -5321.83 & -0.77 & -0.06 \\ w_{3}^{*} & 48568.31 & -31.97 & -0.05\ \w_{4}^{*}&-231639.30&-3.89&-0.03\w_{5}^{*}&640042.26&55.28&-0.02\w_{6}^{*}&-1061800.52&4 &-0.01\\w_{7}^{*}&1042400.18&-45.95&-0.00\\w_{8}^{*}&-557682.99 & -91.53 & 0.00 \\ w_{9}^{*} & 125201.43 & 72.68 & 0.01 \end{array}w0∗w1∗w2∗w3∗w4∗w5∗w6∗w7∗w8∗w9∗lnl=−∞0.35232.37−5321.8348568.31−231639.30640042.26−1061800.521042400.18−557682.99125201.43lnl=−180.354.74−0.77−31.97−3.8955.2841.32−45.95−91.5372.68lnl=00.13−0.05−0.06−0.05−0.03−0.02−0.01−0.000.000.01
It can be seen that λ \lambdaWhen λ is small, the parameters of the model are very large, which can easily lead to overfitting of the model. Withλ \lambdaWhen λ gradually increases, the coefficient will become very small again, which is not conducive to the fitting effect of the model. λ \lambdaλ controls the complexity of the model and determines the degree of overfitting.