Machine Learning (Zhihua Zhou) Reading Notes---Chapter 2 Model Evaluation and Selection

*2.1 Empirical evaluation and selection*
Keywords:
error rate, accuracy, error, training error (empirical error), generalization error,
key concept explanation:
1. Underfitting solution: expand branches in decision learning tree, increase training number of rounds


*2.2 Evaluation method*
Keywords:
test set, set out method, cross-validation method (K-fold cross-validation), bootstrap method, parameter adjustment
Key concept explanation:
1. The test set should be as mutually exclusive with the training set as possible
2. Set aside Method: directly divide the data set D into two mutually exclusive sets, one of which is used as the training set S and the other as the test set T
3. Cross-validation method: first divide the data set D into K mutually exclusive sets of similar sizes Subsets, each of which keeps the data distribution as consistent as possible. Then use k-1 subsets as the training set each time, and the remaining subset as the test set.
4. Self-help method: self-sampling m times are held each time, and the training set D^ is obtained, and DD^ is used as the test set.
5. Parameter adjustment: train a model for each parameter configuration, and then use the parameters corresponding to the best model as the result

2.2.1 Set-out method
Note :
1. The division of training/test sets should keep the consistency of data distribution as much as possible, using stratified sampling
2. Use several random divisions, repeat the experimental evaluation and take the average as the set-out method 3. The
training set data is not much and not much, generally 2/3~4/5

2.2.2 Cross-validation
method Notes:
1. Dividing the data set into K subsets also has a variety of division methods. In order to reduce errors, repeat the division multiple times and take the average
. 2. Leave-one-out method: one sample per subset, which is more accurate , but the overhead is too large when the amount of data is large

2.2.3 Bootstrapping
Note :
1. About 36.8% of the initial training set does not appear in the final sampling, which is used as a test set, which is called out-of-bag estimation.
2. It is suitable when the number of clusters is small, but the distribution of the initial number of clusters is changed

2.2.4 Participate in the final model
Note :
1. The data set used for evaluation and testing in model evaluation and selection is often called the validation set, and the data encountered in the actual use of the learned model is called the test data. Use the test set to determine the generalization ability, divide the training data into a training set and a validation set, and give the performance on the validation set for model selection and parameter tuning.

2.3 Performance Metrics
Keywords:
Precision, Recall, PR Curve, Mean Square Error, Error Rate, Precision, Equilibrium Point (BEP), F1, Fβ, Macroseries, Microseries, ROC, AUC, Loss, Non-Equal Cost, Cost-Sensitive Error Rate , cost
curve 
Mean squared error for a given sample set:
discrete:

continuous:

2.3.1 Error rate and precision
Classification error rate:
discrete:

continuous:

Precision:
Discrete:

continuous:

2.3.2 Precision and recall
For the dichotomous problem, the examples can be divided into true examples (TP), false positive examples (FP), true negative examples (TN), and false negative examples (FN) according to the combination of the true category and the learner's predicted category.
Precision:

Recall rate:

Precision and recall are contradictory measures


Note:
1. If one learner curve is completely covered by another learner in the RP curve, it can be concluded that the latter is better. Generally, the performance is compared by comparing the area under the curve

2. Performance metrics: BEP, F1, Fβ, macro series, micro series
BER: P=R
F1:

Fβ :
β>1 The recall rate is more important, otherwise the precision rate has a greater impact.
Macro series: Calculate the precision rate and recall rate respectively on each confusion matrix

Micro series: Average the corresponding elements of each confusion matrix

2.3.3 ROC and AUC
tips: 1. ROC
(Receiver Operating Characteristics): True Case Rate (TPR) is on the vertical axis, and False Positive Rate (FPR) is on the horizontal axis
TPR:

FPR:

2.ROC curve:

(0, 1) corresponds to an ideal model that ranks all positive examples before all negative examples;
(0, 0) corresponds to predicting all examples as
the area under the ROC curve of negative examples, that is, AUC to compare the pros and cons
AUC:

loss: the area above the image

AUC=1-lrank

2.3.4 Cost-sensitive error rate and cost curve

tips:
cost sensitive error rate:
cost curve:

The horizontal axis is the positive probability cost:

The vertical axis is the normalized cost:

Drawing method: Each point on the ROC curve corresponds to a line segment on the cost plane, (0, FPR) connects (1, FNR) FNR=1-TPR.
Image geometric meaning: the area under the line segment represents the expected total cost under this condition, and the area enclosed by the lower bounds of all line segments is the expected overall cost of the learner under all conditions.


2.4 Comparative Test
Keyword : Statistical Hypothesis Test
2.4.1
1. The hypothesis in hypothesis testing is some judgment or conjecture about the generalization error rate distribution of
the learner 2. The probability that a learner with a generalization error rate ε is tested as ε():
3. Binomial test, confidence level

4. If the test error rate is less than the critical value, it can be considered with a confidence of 1-α that the generalization error rate of the learner is not greater than ε0, otherwise the hypothesis is rejected.
5. Repeat the leave-out method by t-test
Average test error rate and variance
So:

Obey the t-distribution with k-1 degrees of freedom

Bilateral hypothesis
2.4.2 Cross-validation t-test
1. Compare the performance of different learners, and use the k-fold cross-check "paired t-test". If the performance is the same, make k pairs of test error rates to do t-test.

2. The premise of hypothesis testing is that the test error rate is an independent sample of the generalization error rate, but not necessarily independent and overlapping, then use the 5*2 cross-validation method

2.5 Bias and variance
1. Bias-variance decomposition is a tool to explain the generalization performance of learning algorithms

The difference between the expected output and the true label is called the bias, and the generalization error can be decomposed as:
    

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326568723&siteId=291194637