Model evaluation and selection of watermelon book study notes

2.1 Experience with the over-fitting error

2.1.1 some concepts
error rate (error rate): percentage of misclassified samples total sample
precision (accuracy): - 1 error rate
: The difference between the predicted output and the actual output of the real learner sample error (error)
training error (training error) | empirical error (empirical error): error learning in the training set
generalization error (generalization error): error on new samples

2.1.2 over-fitting and underfitting
we hope to get a learner generalization error is small.
But, in fact we can do is let the experience of error is minimized, and usually we can get an experience of error is very small, a very good learner performance on the training set. For example, even the training data set to achieve 100% accuracy, but this learning in most cases is not good. Because in fact we want is a very good learning performance in the new sample, the small generalization error learner mentioned earlier.
So, we want to learn is to get a "universal law" and general nature of the training set of data, rather than some of its features samples also learn, this will lead to a decline generalization performance, a phenomenon that is over-fitting.
Corresponding to the over-fitting is underfitting, referring to the general nature of learner training set are not even learn to
underfitting: caused by poor learning ability. Relatively easier to overcome in terms of over-fitting (eg: decision tree branch expansion, the neural network to increase the number of training rounds)
over fitting: caused by too strong learning ability. To overcome too much trouble fitting, over-fitting machine learning is the key obstacle faced by all kinds of machine learning algorithms with measures against over-fitting, but over-fitting is not completely avoided, we can do just ease over-fitting.

2.2 Learning generalization error evaluation method

Here mainly refers to the assessment of the generalization error of the learner.
Typically, we use a set of test (testing set) to test the learner to classify new samples, and to test the error on the test set as learner generalization error approximation
generally we assume from test set samples are real distribution independent and identically distributed samples obtained, and the test set
should be mutually exclusive with the training set, or else we will get false low generalization error.
We focus on, and that is how the dataset is currently owned by dividing get training and test sets, mainly in the following ways:

2.2.1 aside method
using method aside requirements data set into training and test sets as follows:
division manner: distribution of data consistency, the data dividing process to avoid the introduction of additional bias an impact on the final result .
With classification tasks, for example, we need to ensure that the proportion of the training set and test set samples of similar category. From the point of view sampling, sample type holding this ratio is called stratified sampling samples (stratified sampling). If the ratio of training set sample type and test set too different, the error due to differences in the distribution of their data offset estimator generates
frequency allocation: average value as an evaluation method adopting aside several randomly divided into experiment was repeated result.
1 under the premise, we still have a variety of ways to divide the data set into different sets of training / test set, and different sets of training / test suite makes assessment model training is different. Visible to assess the results of a single law set aside enough stable and reliable. So the use of several randomly divided into experiment was repeated after the averaging process as an evaluation result aside. The randomly divided into 100 times, each time generating a training set / test set used to assess, after 100 times 100 obtained results, set aside 100 is the average results of this method returned.
The number of divisions: the sample is about 2/3 to 4/5 for training, the remaining samples for testing.
This is a dilemma aside law in dividing the number of training and test sets. If the training set is too small, the evaluation result large deviation; if the test set is too small, a large variance of the evaluation result (in general, the test current must be at least 30 samples)

2.2.2 cross-validation
: the specific cross-validation step (cross validation) of
each subset after stratification by sampling the data set D DD k kk is divided into two mutually exclusive subsets of similar size (note stratified sampling data distribution consistency)
per set and a k-1 k-1k-1 subsets as the training set, the remainder of that subset as the test set. Apparently, so you get a different set of training set k kk + test set combined to perform k kk training and testing times, and finally returns the mean of these k kk test results.
Leaving the same method, the data set D DD k kk group divided into many different ways. In order to reduce differences due to the different data sets divided introduced, k kk fold cross-validation typically use a different random division is repeated p pp times, the end result is that times k kk p pp fold cross-validation results, on average (common 10 to 10-fold cross validation).
Cross-validation evaluation results of stability and fidelity depends largely on the value of k kk, usually called cross validation "k kk fold cross-validation" (k-fold cross validation) . The most commonly used value of 10 (as well as 5, 20, etc.), this time is called 10-fold cross validation.

Special case of cross-validation: the leave-one
assumes that the data set D DD contains m mm samples, if so k = mk = mk = m, then got a special case of cross-validation: the leave-one (Leave-One-Out , referred LOO). Obviously, leave-one is unique in that it is unaffected by the sample divided randomly, because only sample m mm m mm divided into data subsets, i.e., one sample each of a subset (i.e., not like the as required p pp randomized data partitioning experiments p pp times other cross validation).
Since the leave-one training set of samples only a little more than the entire data set, it is often considered to assess the results of leave-one-more accurate (but not necessarily accurate). However, it is also very difficult defect:
it can be seen, the number of leave-one model need to be trained for m mm months. M mm when the sample size is large, this will obviously bring the computational overhead unbearable (if one million samples, you need to train one million models, and then consider raising ginseng, computational overhead is very horrible)
leave-one the estimated results are not necessarily always accurate than other assessment methods (No free lunch here apply equally)

2.2.3 Bootstrap (for small data sets)
Background:
We want the model to evaluate the data set D DD is trained, but to stay out of the previous law and cross-validation, we have retained as part of the test set, it will the introduction of different sample size estimation bias caused. The leave-one-though relatively minor impact of the sample size (the training set only a small sample), but the computational complexity is too high.
Based on the above background, we wanted a different sample size will reduce the impact brought about by the experimental method can be efficiently estimated. The self-help method (bootstrapping) is a better solution.
Here Insert Picture Description

2.2.4 tune participate in the final model
during model evaluation and selection, in addition to learning algorithm is applicable are selected algorithm parameters need to be set, that parameter adjustment.
Machine learning involves two types of parameters:

  1. Parameters of the algorithm, i.e. hyper parameters, often in less than 10 the number of
  2. Parameters, the model number may be a lot

Many parameters learning algorithm is real value in the value range, for each parameter configuration to train a model is not feasible. Common practice in reality is a certain value within a range of steps, there are several candidate parameter. Obviously, such a candidate can do parameter is not theoretically optimal parameters, this is just a trade-off between computational cost and performance estimates. And even in this case, the computational overhead is still great.
Powerful learning algorithms tend to have a lot of parameters to be set, which will result in great quantities tune parameters, so that in many application tasks, the parameters adjusted well or not often have a critical impact on the performance of the final model.
After the completion of model selection, learning algorithms and parameters have been selected, then it should re-train model data set D. This model uses all of the samples in the training process, this is the model we finally submitted.

Training set, test set, validation set

For the division of the data set, we mentioned before is divided into training and test sets. We generalization error on the test set as an approximate error of the model in the face of a new sample. For the training set, then it is separated here as part of the validation set of data, based on performance on the validation set do model selection and parameter adjustment.

2.3 Performance metrics (metric model generalization ability)

The front section is speaking generalization method of estimation error, and this section is a measure of the model speaking generalization of these two parts is a generalization performance learner to assess necessary.
When comparing the performance of different models, different performance metrics tend to lead to different evaluation results, which means that the model is evaluated in terms of relative performance metrics. The performance metric is based on the task needs to choose.
That evaluation model depends not only on algorithms and data, but also depends on the mission requirements.
As predicted task, to evaluate the performance of the learner, it is imperative to predict the results and compare the true mark. Return task commonly used measure of performance is the mean square error.
The following is a common measure of performance classification tasks:
2.3.1 error rate and the accuracy of
the data set D:
Error rate:
Here Insert Picture Description

2.3.2 查准率、查全率与F1
错误率和精度虽然常用但不能满足所有任务的需求。如对于“挑出的西瓜有多少比例是好瓜”,“所有好瓜中有多少比例被挑出来了”,“检索出的信息有多少比例是用户感兴趣的”,“用户感兴趣的有多少被检索出来的了”等任务,错误率显然无法满足需求,需要其他的性能度量。查全率(recall)和查准率(precision)是更为适用于此类需求的性能度量。
对于二分类问题,在对样本进行分类之后,产生了以下四种类型的样本:
真正例TP
假正例FP
真反例TN
假反例FN
Here Insert Picture Description
则查准率P PP和查全率R RR分别定义为:
Here Insert Picture Description
查准率P PP:所有预测为正例的样本中真正例的比例
查全率R RR:所有真实标记为正例的样本(TP+FN TP+FNTP+FN)中被预测为正例的比例
一般来说,查全率高时,查准率往往偏低;查准率高时,查全率往往偏低。
可以这么想:
追求高查全率时,被预测为正例的样本数就偏多,极端情况是将所有样本都预测为正例,则查全率时100%
追求高查准率时,只将有把握的样本预测为正例,则会漏掉一些正例样本,使得查全率降低,极端情况是只选择最优把握的一个样本预测为正例,则查准率为100%
通常只有在一些简单的任务中,才可能使查全率和查准率都很高。
查准率-查全率曲线(P-R曲线)
按正例可能性将样本排序,依次将排序的样本作为正例计算出查全率和查准率,依次做P-R曲线图
Here Insert Picture Description
上面的图和书中的图的主要区别在于曲线的两个端点,除非正例在样本占的比例极小,才有可能像书中的图一样最右查准率趋于0:
Here Insert Picture Description

上图的每条曲线代表一个学习器,即根据每个学习器的分类结果绘制出各自的P-R曲线,可凭借以下两种方法比较学习器的性能优劣:

  • 若曲线A包住了曲线B,则对应的学习器A性能优于学习器B
  • 若曲线A和曲线B是交叉的情况,则计算曲线下的面积,更大的则对应的学习器性能更优,但这个值不太容易估算,故我们用一个综合考虑查准率和查全率的性能度量:平衡点(BEP),它是查准率=查全率时的取值,更大者对应的学习器更优。(对应上图学习器A优于学习器B)

但其实BEP还是过于简化了,更常用的是F1度量:
F1是基于查准率和查全率的调和平均定义的:
1/F = 1/2(1/P + 1/R)
Here Insert Picture Description

但是在一些应用中,我们对查准率查全率的重视程度不同,此时F1则不满足我们对查准率或查全率的偏好。
如:
Here Insert Picture Description
以上是针对于只有一个二分类混淆矩阵(只针对一个数据集进行了一次分类操作)的情况进行的讨论,在现实任务中,我们有以下情况会希望在多个二分类混淆矩阵上综合考察查准率和查全率:

  • 多次进行训练/测试
  • 在多个数据集上进行训练/测试,希望估算算法的全局性能
  • 多分类任务,每两两类别的组合都对应一个混淆矩阵

面对多个二分类混淆矩阵,主要有以下两种做法:
Here Insert Picture Description
2.3.3 ROC与AUC
我们通过学习器可得到样本对应的预测实值或概率值,将这个预测值与一个分类阈值进行比较,大于阈值为正类,小于则为反类。
根据预测实值或概率值,我们可将样本排序,于是越有可能是正例的样本排在越前面。分类过程就相当于在这个排序中以某个截断点(即阈值)将样本分为两部分,前一部分判作正例,后一部分判作反例。
根据任务需求我们对查准率和查全率有不同的偏好,对此我们采取不同的截断点:

  • 偏好查准率:选择靠前的位置进行截断
  • 偏好查全率:选择靠后的位置进行截断

ROC(Receiver Operating Characteristic)全称是“受试者工作特征”曲线,与P-R曲线相似,我们根据预测值进行排序,按此顺序逐个把样本作为正例进行预测,每次计算出真正例和假正例,以它们为坐标作图。
Here Insert Picture DescriptionHere Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
通过ROC曲线对学习器进行比较的判别标准与P-R曲线类似:

  • 曲线A包住曲线B,则学习器A优于学习器B
  • 若两曲线交叉,则比较ROC曲线下的面积,即AUC

Here Insert Picture Description
Here Insert Picture Description
注:该公式就是在计算每一个个小矩形之和,从而求出AUC的大小
Here Insert Picture Description
Here Insert Picture Description
2.3.4 代价敏感错误率和代价曲线
之前介绍的性能度量大都隐式地假设了均等代价,如错误率是直接计算错误次数,而没有考虑不同错误所造成的不同后果

为衡量不同错误类型所造成的不同损失,可为错误赋予“非均等代价”(unequal cost);在非均等代价下,我们不再简单地希望最小化错误次数,而是最小化总体代价
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
注:对于式(2.25)的推导:
Here Insert Picture Description
代价曲线的绘制:
要点:ROC曲线上的每一点对应了代价曲线上的一条线段

  • 设ROC曲线上的点坐标为(FPR,TPR),则可相应计算出FNR,于是它对应的线段为从(0,FPR)到(1,FNR)的线段,线段下的面积则表示了该条件下的期望总体代价
  • 按照1将ROC曲线上的所有点都绘制出对应的每条线段
  • 所有线段的下界围成的面积即为所有条件下的期望总体代价
    Here Insert Picture Description

2.4 比较检验

前面讲述的是实验评估方法和性能度量,但是单凭这两个就相对学习器进行性能评估还是不够的,原因在于:
我们要评估的是学习器的泛化能力,而通过实验评估方法得到的是测试集上的性能,两者的对比结果可能未必相同
测试集上的性能与测试集的选择有很大的关系,不同的测试集测试结果不一样
很多学习器本身具有随机性,运行多次结果也会不同
这里,我们可以运用统计假设检验(hypothesis test)来佐证我们的性能评估结果
例如,我们在测试集上观察到学习器A性能优于学习器B,则基于统计假设检验结果我们可以推断出A的泛化性能是否在统计意义上优于B,以及这个结论的把握有多大
以下是两种最基本的假设检验方法(性能度量默认为错误率):

2.4.1 假设检验
假设检验中的假设是对学习器泛化错误率分布的某种判断或猜想。这里,虽然我们只能得到测试集上的测试错误率而不是泛化错误率,但是相差很远的可能性较小,相差很近的可能性较大(这种思路很值得学习),所以我们可以用测试错误率估算推出泛化误差率的分布

2.4.2 交叉验证t检验

2.5 方差与偏差

我们可以通过实验估计学习算法的泛化性能,同时我们也希望了解为什么该算法具有这样的性能
偏差-方差分解(bias-variance decomposition)是解释学习算法泛化性能的一种重要工具。具体操作是将学习算法的期望泛化错误率进行拆解
以下是推导过程:
Here Insert Picture Description
注:不考虑噪声,偏差很大可以认为是欠拟合引起的;方差很大可以认为是过拟合引起的
对算法的期望误差进行分解:
Here Insert Picture Description
式(2.41)的推导不难,以下主要是对其中的两处“最后项为0”进行推导:
Here Insert Picture Description
也就是说,泛化误差可分解为方差、偏差与噪声之和
回顾偏差、方差和噪声的含义:

  • 偏差(2.40)度量了学习算法的期望预测与真实结果的偏离程度,即刻画了学习算法本身的拟合能力
  • 方差(2.38)度量了同样大小的训练集的变动所导致的学习性能的变化,即刻画了数据扰动所造成的影响
  • 噪声(2.39)则表达了在当前任务上任何学习算法所能达到的期望泛化误差的下界,即刻画了学习问题本身的难度

本节重要结论

Generalization error can be decomposed into deviation, variance and noise and
bias - variance decomposition explanation, generalization ability by learning algorithms, and the adequacy of the data more difficult task of learning itself, the joint decision of
a given learning task, in order to obtain better generalization performance is required so that small variance (the data have little influence even if disturbance generated) is required so that minor deviations (i.e., sufficiently to fit the data)
by the decomposition of the generalization error can be seen, so that we just deviation and variance are as small as possible to obtain superior generalization performance. However, in general, bias and variance are conflicting (without regard to noise, large deviation can be considered as caused by poor fitting; a large variance can be considered to be caused by over-fitting), namely bias - variance dilemma.
Here Insert Picture Description
Note: the generalization error of the top piece of the curve
on FIG mean: Suppose we can control the degree of training algorithm, when there is insufficient level of training (underfitting), when deviations led generalization error rate; training with a deeper level, the ability to fit the learner gradually increased, the disturbance can occur learner training data to learn, variance and gradually dominated the generalization error rate (without regard to noise, large deviation can be considered as caused by poor fitting; variance a lot can be considered as caused by over-fitting)

Released two original articles · won praise 0 · Views 635

Guess you like

Origin blog.csdn.net/pt798633929/article/details/103228806