Machine learning - regularization solves overfitting problem

1. Overfitting problem

The figure below is an example of a regression problem:
Insert image description here
the first model is a linear model, underfitting, and cannot adapt well to our training set; the third model is a fourth-power model, which places too much emphasis on fitting the original data. , and loses the essence of the algorithm: predicting new data. We can see that if a new value is given to predict, it will perform very poorly and is overfitting. Although it can adapt to our training set very well, it may not be effective when predicting new input variables. Good; and the middle model seems to be the best fit.

There is also such a problem in classification problems:
Insert image description here
in terms of polynomial understanding, the higher the degree of x, the better the fitting, but the corresponding prediction ability may become worse.

How to deal with overfitting problem?
1. Discard some features that cannot help us predict correctly. You can manually select which features to retain, or use some model selection algorithms to help (such as PCA)
2. Regularization. Keep all features, but reduce parameter magnitude.

2. Regularization

In the above regression problem, if our model is:
h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 2 + θ 3 x 3 3 + θ 4 x 4 4 {h_\theta}\left( x \right)={\theta_{0}}+{\theta_{1}}{x_{1}}+{\theta_{2}}{x_{2}^2}+{\theta_{3}} {x_{3}^3}+{\theta_{4}}{x_{4}^4}hi(x)=i0+i1x1+i2x22+i3x33+i4x44

We can see from the previous examples that it is those higher-order terms that cause overfitting, so if we can make the coefficients of these higher-order terms close to 0, we can get a good fit.

So what we have to do is reduce the values ​​of these parameters θ to a certain extent, which is the basic method of regularization. We decide to reduce θ 3 {\theta_{3}}i3Sum θ 4 {\theta_{4}}i4The size of , what we have to do is to modify the cost function, where θ 3 {\theta_{3}}i3Sum θ 4 {\theta_{4}}i4Set up a little punishment. In doing so, we also need to take this penalty into account when trying to minimize the cost, which ultimately leads to choosing a smaller θ 3 {\theta_{3}}i3Sum θ 4 {\theta_{4}}i4.
The smaller the parameter value, the smoother and simpler the function we get, so it is less prone to overfitting problems. θ 3 {\theta_{3}} selected through such a cost functioni3Sum θ 4 {\theta_{4}}i4The impact on the forecast results is much smaller than before.

Determine the range of the range: min ⁡ θ 1 2 m [ ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + 1000 θ 3 2 + 10000 θ 4 2 ] \underset{\ theta }{\origin{\min }}\,\frac{1}{2m}[\sum\limits_{i=1}^{m}{ { {\left( { {h}_{\ theta } } \left({ {x}^{(i)}}\right)-{ {y}^{(i)}}\right)}^{2}}+1000\theta_{3}^{2} +10000\theta_{4}^{2}]}imin2m _1[i=1m(hi(x(i))y(i))2+1000 i32+10000 i42]

If we have a lot of features and we don't know which of them we want to penalize, we will penalize all the features and let the cost function optimization software choose the degree of these penalties. The result is a simpler assumption that can prevent overfitting problems: J ( θ ) = 1 2 m [ ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 1 n θ j 2 ] J\left( \theta \right)=\frac{1}{2m}[\sum\limits_{i=1}^{m}{ { {({h_ \ theta }({ {x}^{(i)}})-{ {y}^{(i)}})}^{2}}+\lambda \sum\limits_{j=1}^{n}{ \theta_{j}^{2}}]}J( i )=2m _1[i=1m(hi(x(i))y(i))2+lj=1nij2]

Among them, λ is also called the regularization parameter ( Regularization Parameter ).

The role of λ: controls the trade-off between two different goals. The first goal is related to the first term of the cost function, that is, we want to better fit the data. The second goal is related to the second term, which is to maintain the parameters. Keep it as small as possible to keep the hypothesis model simple and avoid overfitting.

Note: By convention, we do not specify θ 0 {\theta_{0}}i0carry out punishment. The possible comparison between the regularized model and the original model is shown in the figure below:
Insert image description here
If the selected regularization parameter λ \lambdaIf λ is too large, all parameters will be minimized, causing the model to becomeh θ ( x ) = θ 0 {h_\theta}\left( x \right)={\theta_{0}}hi(x)=i0, which is the situation shown by the red straight line in the above figure, causing underfitting.
Then why is the added term λ = ∑ j = 1 n θ j 2 \lambda =\sum\limits_{j=1}^{n}{\theta_j^{2}}l=j=1nij2You can make θ \thetaWhat if the value of θ decreases?
Because if we letλ \lambdaIf the value of λ is large, in order to makethe Cost Functionas small as possible, allθ \thetaThe value of θ (excludingθ 0 {\theta_{0}}i0) will be reduced to a certain extent.
But if λ \lambdaThe value of λ is too large, thenθ \thetaθ (excludingθ 0 {\theta_{0}}i0) will tend to 0, so what we get can only be a line parallel to xxThe straight line of the x- axis.

So for regularization, we need to choose a reasonable λ \lambdaThe value of λ can better apply regularization.

3. Regularization of linear regression

For the solution of linear regression, we previously derived two learning algorithms: one based on gradient descent machine learning - multivariate gradient descent method , and one based on normal equations.

1. Gradient descent

The cost function of regularized linear regression is:

J ( θ ) = 1 2 m ∑ i = 1 m [ ( ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 1 n θ j 2 ) ] J\left( \theta \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{[({ {({h_\theta}({ {x}^{(i)}})-{ {y}^{(i)}})}^{2}}+\lambda \sum\limits_{j=1}^{n}{\theta _{j}^{2}})]} J( i )=2m _1i=1m[((hi(x(i))y(i))2+lj=1nij2)]

If we want to use gradient descent to minimize this cost function, because we have not calculated θ 0 \theta_0i0Regularization is performed, so the gradient descent algorithm will be divided into two situations:

R epeat RepeatRepeat u n t i l until until c o n v e r g e n c e convergence convergence{

θ 0 : = θ 0 − a 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( i ) ) {\theta_0}:={\theta_0}-a\frac{1}{m}\sum\limits_{i=1}^{m}{(({h_\theta}({ {x}^{(i)}})-{ {y}^{(i)}})x_{0}^{(i)}}) i0:=i0am1i=1m((hi(x(i))y(i))x0(i))

θ j : = θ j − a [ 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) + λ m θ j ] {\theta_j}:={\theta_j}-a[\frac{1}{m}\sum\limits_{i=1}^{m}{(({h_\theta}({ {x}^{(i)}})-{ {y}^{(i)}})x_{j}^{\left( i \right)}}+\frac{\lambda }{m}{\theta_j}] ij:=ija[m1i=1m((hi(x(i))y(i))xj(i)+mlij]

f o r for for j = 1 , 2 , . . . n j=1,2,...n j=1,2,...n

​ }

For the above algorithm, j = 1, 2, . . . , nj=1,2,...,nj=1,2,...,After adjusting the update formula at time n , we can get:

θ j : = θ j ( 1 − a λ m ) − a 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) {\theta_j}:={\theta_j}(1-a\frac{\lambda }{m})-a\frac{1}{m}\sum\limits_{i=1}^{m}{({h_\theta}({ {x}^{(i)}})-{ {y}^{(i)}})x_{j}^{\left( i \right)}} ij:=ij(1aml)am1i=1m(hi(x(i))y(i))xj(i)

As can be seen from the above formula, first of all, the first term, ( 1 − a λ m ) (1-a\frac{\lambda }{m})(1aml) is a slightly smaller number than 1 because usuallyaaa is a very small number, andmmm is usually very large, and the first term is equivalent toθ j {\theta_j}ijA little smaller. The second term is the same as the gradient descent update before adding the regularization term. Therefore, the change of the gradient descent algorithm of regularized linear regression is that each time θ \theta is updated based on the original algorithm update rules.Theta value is reduced by an additional value.

2. Normal equation

We can also use normal equations to solve the regularized linear regression model. The method is as follows: The
Insert image description here
matrix size in the figure is ( n + 1 ) ∗ ( n + 1 ) (n+1)*(n+1)(n+1)(n+1)

4. Regularization of logistic regression

Machine learning for logistic regression problems - logistic regression algorithm , using gradient descent method to optimize cost function J (θ) J\left( \theta \right)J( i ) .

Insert image description here
Calculate the derivative yourself. Similarly for logistic regression, we also add a regularized expression to the cost function to obtain the cost function:
J ( θ ) = 1 m ∑ i = 1 m [ − y ( i ) log ⁡ ( h θ ( x ( i ) ) ) − ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) ] + λ 2 m ∑ j = 1 n θ j 2 J\left( \theta \right) =\frac{1}{m}\sum\limits_{i=1}^{m}{[-{ { y}^{(i)}}\log \left( {h_\theta}\left( { {x}^{(i)}} \right) \right)-\left( 1-{ { y}^{(i)}} \right)\log \left( 1-{h_\theta}\left ( { {x}^{(i)}} \right) \right)]}+\frac{\lambda }{2m}\sum\limits_{j=1}^{n}{\theta _{j} ^{2}}J( i )=m1i=1m[y(i)log(hi(x(i)))(1y(i))log(1hi(x(i)))]+2m _lj=1nij2
To minimize this cost function, through derivation, the gradient descent algorithm is:

R epeat RepeatRepeat u n t i l until until c o n v e r g e n c e convergence convergence{

θ 0 : = θ 0 − a 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( i ) ) {\theta_0}:={\theta_0}-a\frac{1}{m}\sum\limits_{i=1}^{m}{(({h_\theta}({ {x}^{(i)}})-{ {y}^{(i)}})x_{0}^{(i)}}) i0:=i0am1i=1m((hi(x(i))y(i))x0(i))

θ j : = θ j − a [ 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) + λ m θ j ] {\theta_j}:={\theta_j}-a[\frac{1}{m}\sum\limits_{i=1}^{m}{({h_\theta}({ {x}^{(i)}})-{ {y}^{(i)}})x_{j}^{\left( i \right)}}+\frac{\lambda }{m}{\theta_j}] ij:=ija[m1i=1m(hi(x(i))y(i))xj(i)+mlij]

f o r for for j = 1 , 2 , . . . n j=1,2,...n j=1,2,...n

​ }

Note: It looks the same as linear regression, but we know that h θ ( x ) = g ( θ TX ) {h_\theta}\left( x \right)=g\left( {\theta^T}X \right)hi(x)=g( iT X), so it is different from linear regression.

That’s all about regularization. This article is some notes I recorded while studying Andrew Ng’s machine learning. If you have any questions, please feel free to ask!

Guess you like

Origin blog.csdn.net/Luo_LA/article/details/127640971