Machine Learning-Model Evaluation and Selection (Chapter 2) After-Class Exercises

1Exercise

1.1 Question 1

The data set contains 1000 samples, including 500 positive examples and 500 negative examples. It is divided into a training set containing 70% samples and a test set containing 30% samples for the evaluation of the hold-out method. Try to estimate how many division methods there are.

Answer: The "hold-lout" method directly divides the data set D into two mutually exclusive sets. The division of training set and test set should maintain the consistency of data distribution as much as possible. According to the requirements of the question, 700 training samples need to be selected as the training set and 300 test samples as the test set. The ratio of positive examples to negative examples is 1:1, that is, there are 350 positive examples and 350 negative examples in the training set, and 150 in the test set. positive examples and 150 negative examples. Shared C_{500}^{350}\times C_{500}^{350}species.

1.2 Question 2

The data set contains 100 samples, half of which are positive and half negative examples. Assume that the model generated by the learning algorithm predicts new samples as categories with more training samples (random guessing is performed when the number of training samples is the same). Try using 10% off The cross-validation method and the leave-one-out method evaluate the error rate respectively.

Answer: 10-fold cross-validation: Since the number of positive and negative examples in each training sample is the same, the probability of the result being judged as a positive and negative example is also the same, so the error rate expectation is 50%. Leave-one-out method: If a positive example is taken as a test sample, the ratio of positive examples to negative examples in the training sample is 49:50. At this time, the model's prediction result for the test sample is a negative example, and the error rate is 1; if a negative example is taken as the test sample, then the ratio of positive examples to negative examples in the training sample is 50:49. At this time, the model's prediction result for the test sample is a positive example, and the error rate is 1. To sum up, the expected error rate is 1.

1.3 Question 3

If the F1 value of learner A is higher than that of learner B, analyze whether the BEP value of A is also higher than that of B.

answer:

The F1 value can be calculated at any point on the curve within two-dimensional coordinates. 

BEP is a certain point on the curve. The Break-Event Point (BEP for short) is such a metric, which is the value when precision rate = recall rate.

Now look at the right half of the intersection of learner A and learner B. Suppose there is a point (0.85, 0.5) on A and a point (0.9, 0.5) on B. Then it is calculated:

F1(A) of learner A=(2*0.85*0.5)/(0.85+0.5)=0.630

F1(B)=(2*0.9*0.5)/(0.9+0.5)=0.643 of learner B

There are: F1(B)>F1(A)

Therefore, since B's F1(B) is greater than A's F1(A), it does not follow that B's BEP is greater than A's BEP.

1.4 Question 4

Describe the relationship between true positive rate (TPR), false positive rate (FPR), precision rate (P), and recall rate (R).

answer:

Recall rate (R): the proportion of true positive examples that are predicted to be positive examples. R=TP/(TP+FN)

True positive rate (TPR): the proportion of real positive examples that are predicted as positive examples. TPR=TP/(TP+FN)

Obviously recall rate (R) = true case rate (TPR)

Precision rate (P): the proportion of true positive examples among the instances predicted as positive examples. P=TP/(TP+FP)

False positive rate (FPR): the proportion of true negative examples that are predicted to be positive examples. FPR=FP/(FP+TN)

The higher the recall rate, the lower the recall rate; the higher the recall rate, the lower the recall rate.

1.5 Question 5

Try to prove the formula:

Machine Learning (Xigua Book) Chapter 2 Model Evaluation and Selection After-Class Exercises - Jianshu (jianshu.com)

1.6 Question 6

Describe the relationship between error rate and ROC curve.

Answer: Usually we call the proportion of incorrectly classified samples to the total number of samples called the error rate.

Any point on the ROC curve corresponds to (FPR, TPR)

TPR=TP/(TP+FN)

FPR=FP/(FP+TN)

The error rate corresponds to\frac{FN+FP}{m_{+}+m_{-}}

Therefore, the error rate varies for each point.

\frac{\left ( 1-TPR \right )m_{+}+FPRm_{-}}{m_{+}+m_{-}}

1.7 Question 7

Try to prove that any ROC curve has a cost curve corresponding to it, and vice versa.

answer:

True case rate:TPR=\frac{TP}{TP+FN}

False positive rate:FPR=\frac{FP}{TN+FP}

False counterexample rate:FNR=1-TPR

It can be known from the definition that TPR and FPR both increase from 0 to 1, then FNR decreases from 1 to 0.

Each ROC curve corresponds to a cost curve, because the first cost line segment is (0, 0), (1, 1), and the last is (0, 1) (1, 0).

All cost line segments will always have a common area, which is the expected overall cost, and the boundary of this area is the cost curve, and it must be from (0, 0) to (1, 0).

In the case of a limited number of samples, the ROC is a polyline. At this time, the ROC curve cannot be restored based on the cost curve. But if there are theoretically infinite samples, the ROC is a continuous polyline, and the cost curve is also a continuous polyline. The tangent line of each point can be used to calculate TPR and FNR, thereby obtaining a unique ROC curve.

1.8 Question 8

answer:

Normalization: Convert original metric values ​​into dimensionless values. By scaling the attribute data, the entire value range of a given attribute is mapped into a new value range through a function, that is, each old value is replaced by a new value.

Min-max normalization:

Scale down:\frac{x_{max}^{'}-x_{min}^{'}}{x_{max}-x_{min}}  

Offset of original data:x-x_{min}

The meaning of the entire formula is : reduce the offset of the original data in equal proportions, and then add the reduced offset to the minimum value of the new range to get the value of the new data.

Advantages: The relationship between offsets between original data is retained; the value range after data normalization can be specified; it is a method with the least computational complexity.

Disadvantages: The maximum and minimum values ​​after specification need to be known in advance; if the new data in the original data exceeds the original range (maximum and minimum range), "cross-border" will occur, and all previous results need to be recalculated; if some values ​​in the original data are very outliers High (that is, making the maximum or minimum value of the original data very large), most of the normalized data will be particularly concentrated and difficult to distinguish.

z-score normalization:

The formula indicates that the original data is subtracted from the mean of all original data \bar{x}, and then divided by the standard deviation \sigma _{x}, which is the normalized data.

After deforming the formula, we know that z-score normalization actually calculates the ratio of the squared deviation distance of each data compared to the mean to the sum of all squared deviation distances.

Advantages: Every time the original data is added or deleted, its mean \bar{x}and variance \sigma _{x}may change and need to be recalculated; according to its calculation principle, it is easy to know that the value range of the normalized data is basically [-1,1], between the data The distribution is relatively dense and the sensitivity to outliers is relatively low; the calculation amount is relatively large.

1.9 Question 9

slightly

1.10 Question 10

slightly

Reference article:

[1] Zhou Zhihua. Machine Learning. Tsinghua University Press. 2016

https://max.book118.com/html/2020/1229/8006134067003032.shtm

https://blog.csdn.net/cherryc2015/article/details/60132563

https://www.docin.com/p-2296476862.html

https://blog.csdn.net/huzimu_/article/details/123306748

https://www.zhihu.com/question/337049681

https://blog.csdn.net/u014134327/article/details/94603249

Guess you like

Origin blog.csdn.net/aaaccc444/article/details/133073265