1、Empirical Error and Overfitting

名词解释：

Empirical Error: 经验误差，又叫训练误差 training error ，模型在训练集上的误差。

Overfitting 过拟合

Generalization Error: 泛化误差，模型在新样本集（测试集）上的误差称为“泛化误差”

Underfitting:

Underfitting（欠拟合）: A statistical model or a machine learning algorithm is said to have underfitting when it cannot capture the underlying trend of the data（模型无法捕捉数据的潜在趋势）, it only performs well on training data but performs poorly on testing data（训练数据表现良好，测试数据表现不佳）.

我们老师在课堂上，形容欠拟合 为“学的太少了”

Reasons for Underfitting:
•High bias and low variance （高偏差和低方差）
•The size of the training dataset used is not enough.（训练集不充分）
•The model is too simple.（模型过于简单）
•Training data is not cleaned and also contains noise in it.

Overfitting（过拟合）: A statistical model is said to be overfitted when the model does not make accurate predictions on testing data（在测试集不能做出正确的预测）. When a model gets trained with so much data（训练集数量太多）, it starts learning from the noise and inaccurate data entries in our data set（甚至从噪音里学习）. And when testing with test data results in High variance. Then the model does not categorize the data correctly（不能正确分类）, because of too many details and noise（考虑了太多的细节和噪音）. The causes of overfitting are the non-parametric and non-linear methods because these types of machine learning algorithms have more freedom in building the model based on the dataset and therefore they can really build unrealistic models.

同样套用老师课堂上形容的词：过拟合 是 “学的太细”，把一些不必要的细节学了进去。

Reasons for Overfitting are as follows:
•High variance and low bias（高方差和低偏差）
•The model is too complex （模型过于复杂）
•The size of the training data （训练数据太多）

这张图很好地描述了过拟合和欠拟合地特点。

欠拟合学的太少，分类过于粗糙和简单。
过拟合学的太细，把细枝末节也包含在内，模型太过复杂。

Techniques to reduce underfitting:
•Increase model complexity
•Increase the number of features, performing feature engineering
•Remove noise from the data.
•Increase the number of epochs or increase the duration of training to get better results.

Techniques to reduce overfitting:
•Increase training data.
•Reduce model complexity.
•Early stopping during the training phase (have an eye over the loss over the training period as soon as loss begins to increase stop training).
•Ridge Regularization and Lasso Regularization（正则化）
•Use dropout for neural networks to tackle overfitting.

2、Evaluation Methods

Hold-out Method (留出法)

留出法(Hold-out Method)：直接将数据集D划分成两个互斥的集合，其中一个为训练集S，另一个作为测试集T，这称为“留出法”。
分层采样(stratified sampling)：在对数据集进行划分的时候，保留类别比例的采样方式称为“分层采样”。若对数据集D（包含500个正例，500个反例）则分层采样的到的训练集S（70%）应为350个正例，350个反例，测试集(30%)应为150个正例，150个反例。