evaluating a learning algorithm

在这篇文章中，主要介绍了多项式的次数d和过拟合以及欠拟合之间的关系，

图1

如上图所示，我们可以得到如下结论

1.高偏差（欠拟合）训练误差和验证误差都很高，

2.高方差（过拟合）训练误差很小，验证误差远大于训练误差

从图中我们可以得到如下结论：

1. 训练数据量少：训练误差很小，验证误差大

2.训练数据量大：验证误差和训练误差都很大，两者差不多相等

3.如果学习算法遭受到高偏差，get more training data will not help much.

结论：

1. 低数据量：训练误差很小，验证误差大

2.低数据量：训练误差值慢慢增大，在某个值之前，测试误差一直减小。并且训练误差小于测试误差，而且这个差异比较明显。

3.If a learning algorithm is suffering from high variance, getting more training data is likely to help

Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});
Create a set of models with different degrees or any other variants.
Iterate through the \lambdaλs and for each \lambdaλ go through all the models to learn some Θ.
Compute the cross validation error using the learned Θ (computed with λ) on the Jcv((Θ) without regularization or λ = 0.
Select the best combo that produces the lowest error on the cross validation set.
Using the best combo Θ and λ, apply it on J test(Θ) to see if it has a good generalization of the problem.

小结：

we can separate our dataset into three different sets
1. Training set: 60%
2. Cross validation set: 20%
3. Test set: 20%

1. Optimize the parameters in Θ using the training set for each polynomial degree.

2. Find the polynomial degree d with the least error using the cross validation set.

3. Estimate the generalization error using the test set with J_{test}(Θ)
(d = theta from polynomial with lower error)

高方差和高偏差

高方差：训练误差小，测试误差和训练误差差距大，可以通过增大样本集合来减少差距，随着样本
数据增多测试误差会减少

数据过拟合会出现高方差问题：比如用5个数据特征去拟合4个数据
how to solve :
1.增大数据集
2.减少数据特征

高偏差；训练误差很大，训练误差和测试误差差距小，随着样本增多，训练误差增大。

数据欠拟合容易出现高方差问题，
how to solve:
1.寻找更好的特征（有代表性的）
2.用更多的特征（增大输入向量的维度）

一般采取判断某函数是高方差还是高偏差，简单的判断是训练误差和测试误差的差距，
大的话为高方差，小的话为高偏差。

参考文献：

机器学习，吴恩达课程