Overfitting
What is the over-fitting?
Linear regression overfitting
Linear regression problem, we used linear equations, quadratic, polynomial equations to fit the data set, as shown:
apparently not a very good fit to the data set linear equations, is less fit , there is a high error ,
Quadratic model is a good fit.
Although the high-order equation through each data sample, but the curve is too tortuous, I do not think it is a good model. I called over-fitting . Another term describing the problem is: a high variance
High variance: We use a function when fitted to the data sample, this function can well fit the training set, can fit almost all of the training data, this function may face problems too big, too many variables, and if we do not have enough data to restrain excessive variable model, then this is the over-fitting .
General speaking, overfitting occurs when too many variables, this time always fit the training data well trained equation, all you cost function may actually be very close to zero, thus leading to the equation can not be generalization to new data samples that they can not predict the price of the new sample
Generalization refers to a hypothetical model ability to be able to reference the new samples.
Logistic regression overfitting
In the following an example set of data samples:
Obviously, there are also logical underfitting linear regression function is used as, there is a high variation model assumptions .
Figure II is added just to fit well a dataset quadratic term.
And this after adding more high, over-fitting, curve distortion function model itself, is not a good predictor of new samples. That can not be generalized to a new sample.
How to solve the problem of over-fitting:
You can draw the appropriate order of the polynomial function by drawing graphics. But there are many variables when the draw function graph is not a good method.
The first approach is to minimize the number of selected variables, specifically, we can check the manual entry of variables, and thus determine which variables are more important, then decide which variables should be retained and which should be discarded.
The second method for calculating the regularization . Regularization will retain all of the characteristic variable. But we will reduce the size of the order of magnitude or parameter values. This is a good way, because every variable we used to.
--------------------