Machine Learning Notes: overfitting

Overfitting

If we have a lot of features, we get by learning assumptions may very well be able to adapt to the training set (cost function may be almost zero), but may not be extended to the new data.
Below is an example of a return to the problem:


The first model is a linear model, low fit, not well adapted to our training set; third model is the model of a fourth power, over-fitting, though it could very well adapted to our training set but when a new input variables to predict may be ineffective; middle model seems the most appropriate.
Classification problems also exist such problems:


The problem is, if we had found a fitting question, What should I do?

1. Minimize the number of selected features, discard some can not help us correct prediction feature.
You may manually select which features to retain
or using some algorithm to help model selection (e.g., the PCA)
2. regularization.
Retain all features, but reduce the size of the parameter (magnitude).

 

Normalized cost function
above regression, if our model is:


We decided to reduce the size of θ3 and θ4, we have to do is modify the cost function, where θ3 and θ4 set a little punishment. As shown below, such as adding a large number 1000,10000 at its front. To do so, we try also needs to punish this into account when minimizing the cost and eventually select a smaller number of θ3 and [theta] 4 . The modified cost function as follows:


Selected by such a cost function θ3 and θ4 impact on the predicted results would be much smaller than before .
If we have a lot of features, we do not know which of these features to punish us, we will punish all the features, and make the cost function optimization software to select the degree of these penalties. This result is a relatively simple assumptions can prevent over-fitting problem:


Wherein λ is also known as normalization parameters (Regularization Parameter).
Note: By convention, we do not θ0 punishment.
After the normalization process model and the original model of the comparison may as shown below:

If the selected normalization parameters λ is too large, all the parameters are minimized, resulting in the model case becomes hθ = red line shown in the figure above (x) θ0 i.e., resulting in a low degree of fit.

 

Normalized linear regression
normalized linear regression cost function is:


If we want to use gradient descent issuing the order to minimize this cost function, because we do not punish θ0 θ0 that is not to be normalized , so the gradient descent algorithm will be divided into two situations:

The above algorithm for j = 1,2, ..., n at the time of the update equation is adjusted can be obtained:
 As can be seen, the linear regression of a normalized gradient descent algorithm change is that, every time the original algorithm update make an additional reduction in the value of θ values on a regular basis.
 We're also using the normal equations to solve for the normalized linear regression model, the method is as follows:

Matrix size in FIG. 1 * is n-+ + n-. 1 .

Without proof is given when λ is greater than zero, it is invertible.

 

Normalized logistic regression

Also for logistic regression, we have to increase the cost function is a normalized expression yields:

 To minimize this cost function, the derivation, gradient descent algorithm is obtained:

Note: It looks the same as linear regression, but know hθ (x) = g (θTX), so different from the linear regression.
 

 

Published 98 original articles · won praise 124 · views 30000 +

Guess you like

Origin blog.csdn.net/lyc0424/article/details/104821999