[机器学习笔记] 机器学习中的“过拟合（Overfitting）”和“欠拟合（Underfitting）”

机器学习中的“过拟合（Overfitting）”和“欠拟合（Underfitting）”

在机器学习领域中，当讨论一个机器学习模型学习和泛化的好坏时，通常使用术语是：过拟合（Overfitting）和欠拟合（Underfitting）。过拟合和欠拟合是机器学习算法表现差的两大原因。

什么是过拟合和欠拟合？

过拟合（overfitting）：是指在模型参数拟合过程中的问题，由于训练数据包含抽样误差，训练时，复杂的模型将抽样也考虑在内，将抽样误差也进行了很好的拟合。具体表现就是最终模型在训练集上效果好；在测试集上效果差。模型泛化能力弱。

拟合的模型一般是用来预测未知的结果（不在训练集内），过拟合虽然在训练集上效果好，但是在实际使用时（测试集）效果差。同时，在很多问题上，我们无法穷尽所有状态，不可能将所有情况都包含在训练集上。所以，必须要解决过拟合问题。

欠拟合（Underfitting）：是指模型不能在训练集上获得足够低的误差。

简单来说，当学习器把训练样本学得“太好了”的时候，很可能已经把训练样本自身的一些特点当作了所有潜在的样本都会具有的性质，这样就导致泛化性能下降，这就是“过拟合（Overfitting）”；与之相对的是“欠拟合（Underfitting）”，这是指对训练样本的一般性质尚未学好。《机器学习》（周志华，清华大学出版社，P23.）

在神经网络训练的过程中，欠拟合主要表现为输出结果的高偏差，而过拟合主要表现为输出结果的高方差。

过拟合和欠拟合的简单判断标准

训练集上的表现	测试集上的表现	判定结果
不好	不好	欠拟合（欠配）
好	不好	过拟合（过配）
好	好	适度拟合

过拟合的产生原因和解决方法

常见原因

建模样本选取有误，如样本数量太少，选样方法错误，样本标签错误等，导致选取的样本数据不足以代表预定的分类规则
样本噪音干扰过大，使得机器将部分噪音认为是特征从而扰乱了预设的分类规则；
假设的模型无法合理存在，或者说是假设成立的条件实际并不成立；
参数太多，模型复杂度过高；
对于决策树模型，如果我们对于其生长没有合理的限制，其自由生长有可能使节点只包含单纯的事件数据(event)或非事件数据(no event)，使其虽然可以完美匹配（拟合）训练数据，但是无法适应其他数据集。
对于神经网络模型：

1）对样本数据可能存在分类决策面不唯一，随着学习的进行,，BP算法使权值可能收敛过于复杂的决策面；

2）权值学习迭代次数足够多(Overtraining)，拟合了训练数据中的噪声和训练样例中没有代表性的特征。

解决方法

在神经网络模型中，可使用权值衰减的方法，即每次迭代过程中以某个小因子降低每个权值。
选取合适的停止训练标准，使对机器的训练在合适的程度；
保留验证数据集，对训练成果进行验证；
获取额外数据进行交叉验证；
正则化，即在进行目标函数或代价函数优化时，在目标函数或代价函数后面加上一个正则项，一般有L1正则与L2正则等。

欠拟合的产生原因和解决方法

常见原因

模型复杂度过低
特征量过少

解决方法

欠拟合的情况比较容易克服，常见解决方法有：

增加新特征，可以考虑加入进特征组合、高次特征，来增大假设空间；
添加多项式特征，这个在机器学习算法里面用的很普遍，例如将线性模型通过添加二次项或者三次项使模型泛化能力更强；
减少正则化参数，正则化的目的是用来防止过拟合的，但是模型出现了欠拟合，则需要减少正则化参数；
使用非线性模型，比如核SVM 、决策树、深度学习等模型；
调整模型的容量(capacity)，通俗地，模型的容量是指其拟合各种函数的能力；
容量低的模型可能很难拟合训练集；使用集成学习方法，如Bagging ,将多个弱学习器Bagging。

名词解释：泛化能力

泛化能力（Generalization Ability）是指机器学习算法对新鲜样本的适应能力。

学习的目的是学到隐含在数据对背后的规律，对具有同一规律的学习集以外的数据，经过训练的网络也能给出合适的输出，该能力称为泛化能力。通常期望经训练样本训练的网络具有较强的泛化能力，也就是对新输入给出合理响应的能力。应当指出并非训练的次数越多越能得到正确的输入输出映射关系。网络的性能主要用它的泛化能力来衡量。

机器学习中的偏差和方差的理解

在机器学习中，偏差描述的是根据样本拟合出的模型输出结果与真实结果的差距，损失函数就是依据模型偏差的大小进行反向传播的。降低偏差，就需要复杂化模型，增加模型参数，但容易造成过拟合。方差描述的是样本上训练出的模型在测试集上的表现，降低方差，继续要简化模型，减少模型的参数，但容易造成欠拟合。根本原因是，我们总是希望用有限的训练样本去估计无限的真实数据。假定我们可以获得所有可能的数据集合，并在这个数据集上将损失函数最小化，则这样的模型称之为“真实模型”。但实际应用中，并不能获得且训练所有可能的数据，所以真实模型一定存在，但无法获得

补充一段Overfitting和Underfitting的英文的说明：

Overfitting occurs when a statistical model or machine learning algorithm captures the noise of the data. Intuitively, overfitting occurs when the model or the algorithm fits the data too well. Specifically, overfitting occurs if the model or algorithm shows low bias buthigh variance. Overfitting is often a result of an excessively complicated model, and it can be prevented by fitting multiple models and using validation or cross-validation to compare their predictive accuracies on test data.

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough. Specifically, underfitting occurs if the model or algorithm shows low variance but high bias. Underfitting is often a result of an excessively simple model.

Both overfitting and underfitting lead to poor predictions on new data sets.

以下内容参考：

refer from http://www.analyticbridge.com/profiles/blogs/underfitting-overfitting-problem-in-m-c-learning

Underfitting :

If our algorithm works badly with points in our data set, then the algorithm underfitting the data set. It can be check easily throug the cost function measures. Cost function in linear regression is half the mean squared error ex. if mean squared error is c the cost fucntion is 0.5C 2. If in an experiment cost ends up high even after many iterations, then chances are we have an underfitting problem. We can say that learning algorithm is not good for the problem. Underfitting is also known as high bias( strong bias towards its hypothesis). In an another words we can say that hypothesis space the learning algorithm explores is too small to properly represent the data.

How to avoid underfitting :
More data will not generally help. It will, in fact, likely increase the training error. Therefore we should increase more features. Because that expands the hypothesis space. This includes making new features from existing features. Same way more parameteres may also expand the hypothesis space.

Overfitting :

If our algorithm works well with points in our data set, but not on new points, then the algorithm overfitting the data set. Overfitting check easily through by spliting the data set so that 90% of data in our training set and 10% in a cross-validation set. Train on the training set, then measure the cost on the cross-validation set. If the cross-validation cost is much higher than the training cost, then chances are we have an overfitting problem. In another words we can say that hypothesis space is too large, and perhaps some features are faking the learning algorithm out.

How to avoid overfitting :
To avoid overfitting add the regularization if there are many features. Regularization forces the magnitudes of the parameters to be smaller(shrinking the hypothesis space). For this add a new term to the cost function

梅森上校

发布了619 篇原创文章 · 获赞 185 · 访问量 66万+

他的留言板关注