Regularization and normalization

The two concepts of regularization and normalization were often mistaken before, and this article is featured to prevent mistakes again.

1. Normalization

The function of normalization is to remove the dimension of the data , or to convert the value of the data to the same order of magnitude or limit it within a certain range.

1.1 max-min normalization

i.e. through xxThe maximum and minimum value pair xxof the data set where x is locatedx is normalized:
x ′ = x − x min ⁡ x max ⁡ − x min ⁡ x^{'}=\frac{x-x_{\min }}{x_{\max }-x_{\min } }x=xmaxxminxxmin
Among them, x min ⁡ x_{\min }xminand x max ⁡ x_{\max }xmaxfor data xxThe minimum and maximum values ​​of the set (row/column) where x is located, after normalization,xxThe range of x is,x ∈ [ 0 , 1 ] x \in [0,1]x[0,1]

1.2 Normalize with mean and variance (standardization)

Put the data xxTransform x to mean 0 and variance 1:
x ′ = x − μ σ x^{'}=\frac{x-\mu}{\sigma}x=pxm
Among them, μ \muµσ \sigmaσ are the mean and variance of the data set, respectively.
After such normalization, the corresponding loss function has a uniform contour shape, which can quickly converge when performing the gradient descent algorithm.

2. Regularization

Regularization is mainly used to avoid overfitting and reduce network errors. The regularization formula is:
L = ∑ n ( y ^ n − ( b + ∑ wixi ) ) 2 + λ ∑ ( wi ) 2 L=\sum_{n}\left(\hat{y}^{n}- \left(b+\sum w_{i} x_{i}\right)\right)^{2} +\lambda \sum\left(w_{i}\right)^{2}L=n(y^n(b+wixi))2+l(wi)2

Note 1: The formula comes from Professor Li Hongyi's 2020 machine learning courseware
Note 2: Commonly used L2 regularization

where, y ^ n \hat{y}^{n}y^n represents thennthThe true value of n pieces of data,xi x_ixifor secondi input features. Compared with the general loss function formula, it can be seen that regularization means adding a termλ ∑ ( wi ) 2 \lambda \sum\left(w_{i}\right)^{2}l(wi)2 , whereλ ≥ 0 \lambda \geq0l0 is used to adjust the degree of regularization.

Note: Usually the coefficient w 0 w0w 0 is omitted from the regularization term becausew 0 w0w 0 will make the result dependent on the choice of the origin of the target variable

This formula adds a penalty term to the error function such that the coefficient wi w_iwiwill not reach a large value.
The following table visually shows λ \lambdaThe effect of λ on the coefficients:
ln ⁡ λ = − ∞ ln ⁡ λ = − 18 ln ⁡ λ = 0 w 0 ∗ 0.35 0.35 0.13 w 1 ∗ 232.37 4.74 − 0.05 w 2 ∗ − 5321.83 − 0.77 − 0.06 w 3 ∗ 48568.31 − 31.97 − 0.05 w 4 ∗ − 231639.30 − 3.89 − 0.03 w 5 ∗ 640042.26 55.28 − 0.02 w 6 ∗ − 1061800.52 41.32 − 0.01 w 7 ∗ 1042400.18 − 45.95 − 0.00 w 8 ∗ − 557682.99 − 91.53 0.00 w 9 ∗ 125201.43 72.68 0.01 \begin{array}{r |rrr} & \ln \lambda=-\infty & \ln \lambda=-18 & \ln \lambda=0 \\ \hline w_{0}^{*} & 0.35 & 0.35 & 0.13 \\w_{1 }^{*} & 232.37 & 4.74 & -0.05 \\ w_{2}^{*} & -5321.83 & -0.77 & -0.06 \\ w_{3}^{*} & 48568.31 & -31.97 & -0.05\ \w_{4}^{*}&-231639.30&-3.89&-0.03\w_{5}^{*}&640042.26&55.28&-0.02\w_{6}^{*}&-1061800.52&4 &-0.01\\w_{7}^{*}&1042400.18&-45.95&-0.00\\w_{8}^{*}&-557682.99 & -91.53 & 0.00 \\ w_{9}^{*} & 125201.43 & 72.68 & 0.01 \end{array}w0w1w2w3w4w5w6w7w8w9lnl=0.35232.375321.8348568.31231639.30640042.261061800.521042400.18557682.99125201.43lnl=180.354.740.7731.973.8955.2841.3245.9591.5372.68lnl=00.130.050.060.050.030.020.010.000.000.01
It can be seen that λ \lambdaWhen λ is small, the parameters of the model are very large, which can easily lead to overfitting of the model. Withλ \lambdaWhen λ gradually increases, the coefficient will become very small again, which is not conducive to the fitting effect of the model. λ \lambdaλ controls the complexity of the model and determines the degree of overfitting.

Guess you like

Origin blog.csdn.net/weixin_43335465/article/details/120636836