[Machine learning notes] machine learning "over-fitting (Overfitting)" and "less fit (Underfitting)"

Machine Learning "over-fitting (Overfitting)" and "less fit (Underfitting)"

In the field of machine learning, when discussing a model learning machine learning and generalization is good or bad, the term is often used: overfitting (overfitting) and less fit (Underfitting). Over-fitting and less fit are two main reasons for the poor performance of machine learning algorithms.


What is the over-fitting and underfitting?

Over-fitting (overfitting): refers to the problems in the process of the model parameter fitting, because the training data include sampling error, training, complex model will also take into account the sampling, the sampling error also had a very good fit . Specific performance is the ultimate model in the training set good effect; poor results on the test set. Weak model generalization.

Fitting the model is generally used to predict unknown results (not within the training set), although over-fitting good on the training set, but (the test set) poor results in actual use. Meanwhile, on many issues, we can not exhaustive of all states, it is impossible to include all cases in the training set. Therefore, we must solve the problem of over-fitting.

Underfitting (Underfitting): it refers to the model can not get low enough error on the training set.

Simply put, when the learner to learn the training sample "Great", is likely to have some training samples own characteristics as all samples will have the potential of nature, which leads to generalization performance degradation, this is the "over-fitting (overfitting)"; as opposed to the "less fit (underfitting)", which refers to the general nature of the training samples yet to learn. "Machine Learning" (Zhou Zhihua, Tsinghua University Press, P23.)

In the process of training the neural network, the less fit mainly for high-output deviation, and overfitting mainly high variance of output.


Overfitting and poor fitting of simple criteria

Performance on the training set Performance on the test set judgement result
not good not good Underfitting (under-equipped)
it is good not good Over-fitting (over with)
it is good it is good Moderate fit

Causes and solutions overfitting

Common cause

  • Sample selection modeling error, such as too few samples, error sampling methods, sample labels errors, leading to insufficient sample data representative of the selected predetermined classification rule
  • Sample noise is too large, so that the machine is characterized in that some of the noise thereby disrupting the pre-classification rule;
  • The model can not be reasonably assumed that there is, or is the assumption about the actual conditions are not established;
  • Too many parameters, the model complexity is too high;
  • For a decision tree model, if we grow there is no reasonable limit for which it is possible to make free growth node contains only a simple event data (event) or a non-event data (no event), although it can be the perfect match (fit) the training data, but can not adapt to other data sets.
  • For neural network models:

1) the presence of surface classification decision is not unique to the sample data may, with the learning of the weights ,, BP algorithm may converge too complicated decision surface;

Value learning iterations 2) right enough (Overtraining), does not fit the typical characteristics of noise and training examples in the training data.

Solution

  • In the neural network model weights attenuation method may be used, i.e., during each iteration to a small reduction factor of each weight value.
  • Select the appropriate training standards stop, the appropriate level of the training machine;
  • To retain validation data set, to validate the results of the training;
  • Cross validation for additional data;
  • Regularization, i.e. the objective function or cost function during the optimization, after the objective function or cost function plus a regularization term, regular and generally L1 L2 regular like.

Less fit causes and solutions

Common cause

  • Model complexity is too low
  • Feature amount is too small

Solution

Underfitting situation is relatively easy to overcome, common solutions are:

  • To add new features to be added into consideration the combination of features, high-order features, the hypothesis space is increased;

  • Add polynomial features, this machine learning algorithms which use is widespread, such as the linear model by adding the quadratic or cubic terms the model generalization stronger;

  • Reducing the regularization parameter, regularization purpose is to prevent over-fitting, but the model appeared less fit, you need to reduce the regularization parameter;

  • The use of nonlinear models, such as nuclear SVM, decision tree, depth learning model;

  • Capacity adjustment model (Capacity), colloquially, the capacity of which is the ability of the model to fit the various functions;

  • Low capacity model may be difficult to fit the training set; using ensemble learning method, such as on Bagging, a plurality of weak learners Bagging.


Glossary: ​​generalization

Generalization (Generalization Ability) refers to the ability to adapt machine learning algorithms for fresh samples.

The purpose of learning is to learn the law implicit in the data behind the data with the same set of learning outside the law, the trained network can also give the appropriate output, the ability is called generalization. It is generally desirable training samples by trained network has strong generalization ability, is given the ability to enter new reasonable response. It should be pointed out that the more the number is not more training to get the correct input and output mappings. Network performance is measured primarily by its generalization. 


Machine learning to understand the bias and variance


In machine learning, the deviation of the gap from the sample described in the fitted model outputs the real result, loss of function is performed based on the size of the back propagation model deviation. Reduce bias, it requires a complex model, the model parameters increases, but easily lead to over-fitting. Sample variance describes the performance of the trained model on the test set, variance reduction, continue to simplify the model, reducing the parameters of the model, but less likely to cause fits. The fundamental reason is that we always want to use the limited training samples to estimate true unlimited data. Assuming we can get all possible sets of data, and this data set will minimize the loss of function, such a model called "the real model." However, practical applications, and can not obtain all possible data and training, so there must be a real model, but could not get


Supplementary period of Overfitting and Underfitting English description:

Overfitting occurs when a statistical model or machine learning algorithm captures the noise of the data.  Intuitively, overfitting occurs when the model or the algorithm fits the data too well.  Specifically, overfitting occurs if the model or algorithm shows low bias buthigh variance.  Overfitting is often a result of an excessively complicated model, and it can be prevented by fitting multiple models and using validation or cross-validation to compare their predictive accuracies on test data.

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data.  Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough.  Specifically, underfitting occurs if the model or algorithm shows low variance but high bias.  Underfitting is often a result of an excessively simple model.

Both overfitting and underfitting lead to poor predictions on new data sets.

The following reference:

refer from http://www.analyticbridge.com/profiles/blogs/underfitting-overfitting-problem-in-m-c-learning

Underfitting: 

If our algorithm works badly with points in our data set, then the algorithm underfitting the data set. It can be check easily throug the cost function measures. Cost function in linear regression is half the mean squared error ex. if mean squared error is c the cost fucntion is 0.5C 2. If in an experiment cost ends up high even after many iterations, then chances are we have an underfitting problem. We can say that learning algorithm is not good for the problem. Underfitting is also known as high bias( strong bias towards its hypothesis). In an another words we can say that hypothesis space the learning algorithm explores is too small to properly represent the data.

How to avoid underfitting :
More data will not generally help. It will, in fact, likely increase the training error. Therefore we should increase more features. Because that expands the hypothesis space. This includes making new features from existing features. Same way more parameteres may also expand the hypothesis space.

 

Overfitting : 

If our algorithm works well with points in our data set, but not on new points, then the algorithm overfitting the data set. Overfitting check easily through by spliting the data set so that 90% of data in our training set and 10% in a cross-validation set. Train on the training set, then measure the cost on the cross-validation set. If the cross-validation cost is much higher than the training cost, then chances are we have an overfitting problem. In another words we can say that hypothesis space is too large, and perhaps some features are faking the learning algorithm out.

How to avoid overfitting :
To avoid overfitting add the regularization if there are many features. Regularization forces the magnitudes of the parameters to be smaller(shrinking the hypothesis space). For this add a new term to the cost function

 

发布了619 篇原创文章 · 获赞 185 · 访问量 66万+

Guess you like

Origin blog.csdn.net/seagal890/article/details/105084189