Model Evaluation and Selection Machine Learning Chapter 2

machine learning

Chapter 2 Model Evaluation and Selection of Machine Learning



foreword

1. Empirical error and overfitting

Error rate : The ratio of the number of misclassified samples to the total number of samples
Accuracy : 1- Error rate
Error : The difference between the actual predicted output of the learner and the true value of the sample Training
error : Also known as empirical error, it is the error of the learner on the training set
Generalization error : The error of the learner on new samples

We hope to obtain a learner that can perform well on new samples, that is, a learner with strong generalization ability.
Overfitting : The learner "overlearns" on the training samples, and it is likely to take some characteristics of the training samples themselves as the general properties of all potential samples, resulting in a decline in generalization ability.
Underfitting : The learner has not yet learned the general properties of the training samples.

There are many reasons for overfitting, and the common ones are:
  the learning ability of the learner is too strong, and the less common features contained in the training samples are learned.
The reason for underfitting is usually:
  low learning ability.

overfitting underfitting
concept Training samples over-learned Insufficient learning of training samples
common cause learning ability low learning ability
way to overcome Can only be relieved, not overcome 1. Expand branches in decision tree learning
2. Increase the number of training rounds in neural network learning, etc.

  In real tasks, we often have a variety of learning algorithms to choose from. For the same learning algorithm, there may be multiple parameter configurations, and different models will be produced. In this regard, it involves the problem of model selection .
  The ideal solution is: evaluate the generalization error of the candidate model, and then select the model with the smallest generalization error, but we cannot directly obtain the generalization error, and the training error cannot be overcome because of overfitting. It is not suitable as a standard. For model evaluation and selection, please see below.

2. Evaluation method

  Typically, we make choices by evaluating the generalization error of the learner through experimental testing. For this, a test set is needed to test the learner's ability to discriminate new samples, and then the test error on the test set is used as an approximation of the generalization error.
  Test set: A collection of test samples, assumed to be independently sampled from the true distribution of samples (as is the training set). The test set is mutually exclusive with the training set as much as possible, that is, the test sample does not appear in the training set as much as possible, and has not been used in the training set.
  For a data set D containing m samples, both training and testing need to be properly divided to generate training set S and test set T. The following are common division methods.

1. Set aside method

  Directly divide the data set D into two mutually exclusive sets, one is the training set S and the other is the test set T. That is, D=S∪T, S∩T=∅. After the model is trained on the training set, the test set is used to evaluate its test error as an estimate of the generalization error. (For example, assuming that the data set D has 1000 samples, it can be divided into 700 training sets and 300 test sets.)
Note:

  • The division of the training set/test set should maintain the consistency of the data distribution as much as possible, so as to avoid the introduction of additional deviation when the data is divided, which will affect the final result.
  • Even after the sample ratio of the training set/test set is given, there are still many ways to divide the initial data set D. When using the hold-out method, it is generally necessary to use several random divisions, repeat the test evaluation, and then take the average value as the evaluation result of the hold-out method. (For example, if the training set/test set ratio is set to 7:3, the total number is 1000, 700 of the training set are randomly selected, and the remaining 300 are used as the test set)

  We originally hoped to train a model through the entire sample set D, but because we need to set aside a test set, it will lead to a dilemma: the training set S is larger, and the trained model is closer to the model trained with the data set D, but because the test set T is small, the evaluation results may not be accurate enough; if the test set T is larger, the training set S will be far from the data set D, and the trained model will be far from the model trained with D.
  Common practice ratio of training set to data set: 2:3 ~ 4:5

2. Cross-validation method

  Divide the data set D into k mutually exclusive subsets of similar size, that is, D = D 1 ∪ D 2 ∪ ... ∪ D k , and the intersection of any two subsets is empty. Each subset D i keeps the consistency of data distribution as far as possible. Then k-1 subsets are used as the training set, and the remaining one is used as the test set. In this way, the division of k training sets/test sets can be obtained, so that k training and testing can be performed, and the mean value of the k test results is finally returned.
  K-fold cross-validation method : The stability and fidelity of the evaluation results of the cross-validation method depend to a large extent on the value of k . To emphasize this point, the cross-validation method is often called " k-fold cross-validation method ". The commonly used value is 10 , which is called the 10-fold cross-validation method.
  p-fold k-fold cross-validation : There are many ways to divide the data set into k subsets. In order to reduce the difference introduced by different sample divisions, k-fold cross-validation usually randomly uses different divisions to repeat p times, and the final evaluation result is the mean value of the p times of k-fold cross-validation results, which is called p times of k-fold cross- validation . If the division is repeated 10 times and 10 subsets are formed each time, it is called the 10-fold 10-fold cross-validation method, and a total of 100 training/testing times are performed.
  leave one out: A special case of k-fold cross-validation, each sample in the data set is regarded as a subset, that is, D contains m samples, k=m, at this time, leaving a subset means leaving only one sample. Note: The leave-one-out method is not affected by the division method of random samples (there is only one way to divide m samples, one sample is one subset); in most cases, the training set used by the leave-one-out method is very similar to the data set D (only one sample is missing), and the evaluation results of the leave-one-out method are often considered more accurate. Disadvantage: When the data set is large, the training overhead is too large.

3. Bootstrapping

  Based on self-service sampling. For a data set D containing m data samples, it is sampled to generate a data set D'.

method:

  • Create an empty set D';
  • Randomly select a sample from D each time and copy it to D' (D remains unchanged);
  • Repeat m times to generate a new data set D' containing m data.
  • The probability of a sample not being collected in one random sampling is: 1 - 1/m, and the probability of not being collected in m samples is the m power of the above formula, and the limit is obtained: 1/e, which is about 0.368. That is, through bootstrap sampling, about 36.8% of the samples in the initial dataset D do not appear in D'. Use D' as the training set, and D\D' as the test set (here, the test set is a data set that appears in D but not in D').

advantage:

  • It is useful when the data set is small and it is difficult to finitely divide the training and test sets
  • It can generate multiple different training sets from the initial data set, which is of great benefit to methods such as ensemble learning. (Integrated learning, click for details)

shortcoming:

  • The dataset generated by the bootstrap method changes the distribution of the initial dataset and introduces estimation bias.

  When the amount of initial data is sufficient, the leave-out method and cross-validation method are commonly used.

4. Tuning parameters and final model

  Parameter adjustment : When evaluating and selecting a model, in addition to selecting a learning algorithm, it is also necessary to set algorithm parameters. (These algorithm parameters are basically unchangeable during the running of the program, that is to say, they are not obtained through learning)

  The parameters of the learning algorithm are all within the range of real numbers, and the possible values ​​are infinite. It is not feasible to train a model for each parameter configuration. A common practice is to choose a range and variation step for each parameter. Many powerful learning algorithms have many parameters that need to be set, and the amount of parameter tuning is huge. In many application tasks, whether the parameters are well tuned often has a key impact on the final model performance.

  Final Model : As the name suggests, the final submitted model.
  For the data set D containing m samples, only a part is actually used for training the model due to the need to set aside the test set. Therefore, after the model selection is completed, the learning algorithm and parameter configuration have been selected, the model should be retrained with the data set D at this time. At this point, it is the complete and final submitted model that is trained with m samples.

  Also, we often divide datasets into training and test sets, but to distinguish the test data that the model encounters in practice, the dataset that the model uses to evaluate the test is called the validation set .
Please add a picture description

3. Performance measurement

  Performance metrics : Evaluation criteria to measure the generalization performance of the model.
  Performance metrics reflect task requirements. When comparing the capabilities of different models, using different performance metrics often leads to different evaluation results; this means that the quality of a model is relative, and what kind of model is good depends not only on algorithms and data, but also on task requirements.
  In the prediction task, given a sample set D = {(x 1 , y 1 ), (x 2 , y 2 ),..., (x m , y m )}, where y i is the true label of the example xi . To evaluate the performance of a learner f, one compares the learner's prediction f(x) with the true label y. ( mark , click for details)

  A commonly used performance measure for regression tasks is: "Mean square error"
  can be described as: For a given data set D and a learned learner f, the mean square error is the difference between the predicted value of each sample and the true value, and the square is calculated. Sum the m squares and divide by m to find the average.
The formula is as follows:Please add a picture description

1. Error rate and accuracy

  Error rate and precision are two of the most commonly used performance measures in classification tasks.
  Error rate : The proportion of the number of misclassified samples to the total.
  Accuracy : The proportion of correctly classified samples to the total.
Please add a picture description

2. Precision, recall and F1

  For the binary classification problem, the samples can be divided into: true examples (TP), false positive examples (FP), false negative examples (TN), and true negative examples (FN) according to the combination of their true category and the class predicted by the learner.
  True example (TP): A true positive example, that is, both the predicted value and the true value are positive examples. (For example, a melon is a good melon, and the result of the model discrimination is also a good melon, which is a true example)
  False positive example (FP): A false positive example, that is, it is predicted to be a positive example, but it is actually a negative example. (For example, a melon is predicted to be a good melon, but it is actually a bad melon)
  False negative example (TN): False negative example, that is, it is predicted to be a negative example, but it is actually a positive example. (For example, a melon is predicted to be a bad melon, but it is actually a good melon)
  True negative example (FN): A true negative example, that is, both the predicted value and the real value are negative examples.
  There are TP+FP+TN+FN=the total number of samples.
The confusion matrix of the classification results is shown in the figure below:Please add a picture description

Definition of precision and recall:

  • The precision rate, also known as the "accuracy rate", is the ratio of the number of positive positive examples to the total number of positive positive examples . The higher the precision rate, the higher the accuracy of the positive examples, and the smaller the FP, that is, xx% of the positive examples are correct.
  • The recall rate, also known as the "recall rate", is the ratio of the number of positive positive examples to the total number of positive positive examples . The higher the recall rate, the greater the number of positive cases found, and the smaller the FN, that is, xx% of the positive cases have been "found".
    Please add a picture description
      Precision and recall are contradictory metrics . Generally speaking, when the precision rate is high, the recall rate is low; when the recall rate is high, the precision rate is low.
      The samples are sorted according to the prediction results of the learner, and the samples that are more likely to be judged as positive by the learner are higher. As the number of sample predictions increases, the precision rate gradually decreases and the recall rate gradually increases. The precision rate and recall rate are plotted on the vertical axis and the horizontal axis to obtain the precision rate-recall rate curve, referred to as " PR curve ". Please add a picture description
      As shown in the figure, the PR curve can visually display the precision rate and recall rate of the learner in the overall sample.
      How to compare the performance of two learners?
  • Balance point (BEP): The value when precision rate = recall rate.
  • F1 measure: Please add a picture description
      In some applications, precision and recall are given different weights. The general form of the F1 metric, Fβ, can express different preferences for precision and recall. As shown in the figure:
    Please add a picture description
      where β>0 measures the relative importance of the recall rate to the precision rate. β=0 degenerates into F1; β>1, the recall rate has a greater impact; β<1, the precision rate has a greater impact.

3. ROC and AUC

  ROC: "Receiver Operating Characteristic" curve. In different task requirements, it will be applied to sort the data according to the learner's discrimination. The ROC curve is to study the generalization performance of the learner from the perspective of the quality of the sorting.
  AUC: the area under the ROC curve. According to the ROC curve, the performance of the learner is judged.

  ROC curve:
  Similar to the PR curve, the samples are sorted according to the prediction results of the learner, and in this order, the samples are used as positive examples for prediction. Calculate the values ​​of two important quantities each time, and use them as the horizontal and vertical coordinates to draw a graph to obtain the ROC curve. The vertical axis of ROC is the true positive rate (TPR), and the horizontal axis is the false positive rate (FPR), which are defined as:Please add a picture description

4. Cost-sensitive error rate and cost curve

  Cost-sensitive error rate : The error rate under unequal cost.
  In order to weigh the different losses caused by different types of errors, errors can be assigned "unequal costs".
  Taking the binary classification task as an example, we can set a "cost matrix" according to the domain knowledge of the task. As shown in the figure, cost ij represents the cost of predicting the i-th sample as the j-th sample.
  Generally speaking, cost ii = 0; if the loss caused by classifying class 0 as class 1 is greater, then cost 01 > cost 10 ; the greater the difference in the degree of loss, the greater the difference between cost 01 and cost 10 . insert image description here
  The performance metrics we mentioned before all implicitly assume equal costs, that is, cost 01 = cost 10 , and do not consider the different consequences of different errors. Under the unequal cost, we no longer want to minimize the number of mistakes in the short answer, but hope to minimize the "overall cost". Cost-sensitive error rate:
insert image description here
  The formula above is an example, where class 0 is a positive example, class 1 is a negative example, and D+ and D- represent positive and negative examples.
  Under the unequal cost, the ROC curve cannot directly reflect the expected overall cost of the learner, but the "cost curve" can . The horizontal axis of the cost curve is the positive probability cost with a value of [0,1], and the vertical axis is the normalized cost with a value of [0,1]. The positive probability cost and the normalized cost are shown in the figure:
insert image description here
  where, FPR is the false positive rate (as defined in the previous section), and FNR = 1 - FPR is the false negative rate.
Drawing of the cost curve:
  Each point on the ROC curve corresponds to a line segment on the cost plane:
  set the coordinates on the ROC curve as (FPR, TPR), and calculate the corresponding FNR.
  Draw a line segment from (0, FPR) to (1, FNR) on the cost plane (the area under the line segment represents the expected overall cost under this condition). Convert each point on the ROC curve into a line segment on the cost plane, and then take the lower bound of all line segments. The enclosed area is the expected overall cost of the learner under all conditions
  . As shown below:
Please add a picture description

4. Comparative inspection (this chapter involves more mathematical foundations, and only a brief introduction is given here)

The performance comparison of the learner cannot directly obtain the value of the performance measurement for comparison, for the following reasons:

  1. What we want to compare is the generalization performance of the learners, but the experimental evaluation method obtains the performance on the test set. (both are not necessarily the same)
  2. The performance on the test set also has a lot to do with the choice of the test machine itself. (Different test set sizes will result in different results; even if the same size contains different samples, the results will be different)
  3. Many learner algorithms are inherently random. (Even if the parameters are the same, the results of multiple runs on the same test set are different)

For the above problems, we have the following methods to compare and test the performance of the learner.

1. Hypothesis testing

  The "hypothesis" in hypothesis testing is a certain judgment or guess about the generalization error rate distribution of the learner, such as "ε=ε 0 ". In real tasks, we do not know the generalization error rate of the learner, only its test error rate. The generalization error rate is not necessarily the same as the test error rate, but intuitively, the possibility of the two being close should be relatively high, and the possibility of being far apart is relatively small. The distribution of the generalization error rate can thus be derived from the test error rate estimate.
  Hypothesis testing is mainly about testing the generalization performance hypothesis of a single learner.
  假设检验涉及到概率论与数理统计知识,有需要的同学请参考书籍理解。

2. Cross-validation t-test

  The cross-validation t-test compares the performance of different learners.
  For two learners A and B, if we use the k-fold cross-validation method, the test error rates obtained are ε 1 A , ε 2 A, ..., ε k A and ε 1 B, ε 2 B, ..., ε k B, where ε i A and ε i B are the results obtained on the same i-th fold training/test set, and the k-fold cross-validation "paired t-test" is used for comparative testing. The basic idea is that if two learners are the same, then they use the same training/test set to get the same test error rate, that is, ε i A = ε i B.

3. McNemar test

  For the binary classification problem, if the performance of the two learners is the same, the number of samples that the learner B discriminates correctly when the learner A is correctly discriminated = the number of samples that the learner A discriminates correctly when the learner B discriminates correctly.

4. Friedman test and Nemenyi follow-up test

  Both the cross-validation t-test and the McNemar test compare the performance of two algorithms on a data set. Many times, we compare multiple algorithms on a set of data sets. At this time, there are two methods for multi-algorithm comparison, one is pairwise comparison, and the other is the Friedman test based on algorithm sorting for direct comparison.
When the performance of the algorithms is significantly different, a follow-up test is required to further distinguish each algorithm. The Nemenyi follow-up test is commonly used.

5. Bias and Variance

Take the regression task as an example:
  the variance is the sum of
    (each predicted value - expected prediction)² divided by the total
  deviation is
    the difference between the expected output and the true label

The generalization error can be decomposed into the sum of bias, variance and noise, that is, generalization error = bias + variance + noise

  In general, bias and variance are in conflict, known as the bias-variance dilemma. Given a learning task, assuming we can control the training level of the learning algorithm, when the training is insufficient, the deviation dominates the generalization error rate, and when the training gradually deepens, the variance gradually dominates the generalization error rate.

insert image description here


Summarize

Learning machine learning well requires a certain mathematical foundation. To learn in-depth knowledge of machine learning, you need to learn machine learning well and lay a solid foundation. Only by learning in-depth knowledge in the future can it be easier to understand.

Guess you like

Origin blog.csdn.net/G_Shengn/article/details/127548510