Machine Learning Notes 2 - Comparison Test

comparative test

1. Why introduce comparative testing

If we want to compare the learning performance of two learners, we not only need to find two values ​​​​about performance measures, and then quantitatively compare them. Because, in machine learning, the comparison of learner performance involves many aspects, and these aspects sometimes have contradictory relationships.

The main factors we need to consider to evaluate learning performance are the following:

  • Generalization ability , that is, the ability of a machine learning algorithm to adapt to new samples. For example, during the college entrance examination, most of us will take the "May 3" test and answer a lot of questions. However, during the college entrance examination, some students can do well in the exam, but some do not. The reason is that the questions in the college entrance examination cannot be the original questions we usually train on. No one has done it. Different students have different generalization abilities, so there is a gap in their ability to draw inferences from one example. In the end, everyone's results are very different. . Putting the ability of high school students to draw inferences from one example to machine learning is the generalization ability of the learner.
  • The selection of test sets , different sizes, and test sets of different samples may lead to different training results.
  • The randomness of the algorithm , because machine learning is a black box, we cannot know the specific running process. It is possible that the same test set, run with the same parameters on the same learner, will produce different results. result;

Therefore, we need to specifically study the comparative testing problem of machine learning.

2. Several comparative testing methods

1. Hypothesis testing - binomial test

First introduce two concepts:

  • The first is called the generalization error rate eee refers to the probability that the learner will make an error in classifying a sample under normal circumstances. In actual situations, we cannot know its exact value;
  • The second is called the test error rate e ′ e'e , that is, the learner has exactlye ′ m e'me msamples have been misclassified, and we can generally only obtain this value;

The difference between the two is that the generalization error rate is a theoretical value that cannot be obtained, while the test error rate is a value that we can measure. The method of statistical hypothesis testing is to use e ′ e'e' Estimatedeethe value of e ;

A generalization error rate e ′ e'e learner to test a sizemmThe test sample set of m can only have two situations: correct classification and incorrect classification, wheree ′ m e'me mare classified incorrectly, and the rest are classified correctly, so this is a typical binomial distribution, then we can get the error rate distribution function of the test:
P ( e ′ ; e ) = ( me ′ m ) ee ′ m ( 1 − e ) m − e ′ m P(e';e)= \left(\begin{matrix} m\\ e'm \end{matrix}\right) e^{e'm}(1-e )^{me'm}P(e;e)=(mem)eem(1e)me m
P ( e ′ ; e ) P(e';e)P(e;e ) abouteeThe first derivative of e , let δ P ( e ′ ; e ) / δ e = 0 \delta P(e';e)/\delta e=0δ P ( e;e ) / d e=0,解得e = e ′ e=e'e=e time,P ( e ′ ; e ) P(e';e)P(e;e ) maximum;∣ e − e ′ ∣ |ee'|ee增大、P ( e ′ ; e ) P(e';e)P(e;e ) decreases, just like the binomial distribution;

Then comes the testing process. Since the Xigua book is a bit obscure, I will try to describe the binomial hypothesis test in a simpler way.

Prerequisite : For a distribution function, we know that it is a binomial distribution, but the parameters of this binomial distribution (the binomial distribution has only one parameter, which is the positive probability p, in this example is the generalization error rate ) is unknown.

hypothetical method

  • We can use assumptions to artificially specify a "threshold range" for parameter values. In this example, we assume that e ≤ e 0 e\leq e_0ee0

  • Then we artificially set a confidence level α \alpha for testing based on objective needs or some calculation formulas (I won’t go into details here, specific problems will be analyzed in detail).α , this variable represents the standard for obvious errors. For example, the probability of being greater than 6 in this binomial distribution isα \alphaα , I set the threshold to beα \alphaα means thatI think if the number of classification errors of the learner is greater than 6, then it is an obvious error. This assumption is not true;

  • Conversely, 1 − α 1-\alpha1α represents the standard that will not cause obvious errors, that is, the standard that can be trusted;

We are now going to use this 1 − α 1-\alpha1α , to deduce that under this confidence level, we assumee 0 e_0e0Is it reasonable?

Determine the plausibility of assumptions

There are two methods to judge rationality. I will mainly introduce the one in the Xigua book;

The general idea is: use the 1 − α 1-\alpha we set1The value of α is used as a constraint to find the integral area of ​​the distribution curve that is not larger than this value. In the example in the book, the bar 1 statistical chart is used, then multiply the horizontal and vertical coordinates of each bar and then perform a summation operation to find a value not greater than1 − α 1-\alpha1The maximum value of α , and finally find the ei e_iat this timeei

The calculation formula is written as follows:
∑ i = e 0 m + 1 m ( mi ) ei ( 1 − e ) m − i < 1 − α \sum_{i=e_0m+1}^{m} \left(\begin {matrix} m\\ i \end{matrix}\right) e^i(1-e)^{mi} <1-\alphai = e0m+1m(mi)ei(1e)mi<1α
means to accumulate the "area of ​​abscissa and ordinate" from the leftmost one of the bar chart, and then add it until it is just no more than1 − α 1-\alpha1up to α .

We take out the ii at this timei , the corresponding abscissa is the acceptable threshold standard, and then take our presete 0 e_0e0Compare with it, if e 0 e_0e0If it is less than this threshold standard, it means that the assumption is acceptable and is a reasonable assumption; otherwise, it is unreasonable and needs to be re-assumed.

In fact, at the end of the day, this is a "self-written and self-directed" process. To sum up, first get a rough threshold through the characteristics of the binomial distribution, and then find a more precise point in this threshold. For different learners, we can use this method to manually set parameters and compare the quality of the two learners.

2. Hypothesis testing - t test

Many times, in order to ensure the universality and accuracy of the results, we will do many estimation tests. Suppose we get k error rates e 1 ′ , . . . , ek ′ e_1',....,e_k 'e1,....,ek, then we can easily get the average error rate μ \muμ and varianceδ 2 \delta^2d2的:
μ = 1 k ∑ i = 1 kei ′ , δ 2 = 1 k − 1 ∑ i = 1 k ( ei ′ − μ ) 2 . \mu=\frac{1}{k} \sum_{i=1}^{k}e_i' ,\\ \delta^2=\frac{1}{k-1} \sum_{i=1}^ {k}(e_i'-\mu)^2.m=k1i=1keid2=k11i=1k(eim )2.
Since each estimated detection is independent, this probability distribution function can be written as:T t = k ( μ − e 0 ) δ T_t=\frac{\sqrt{k}(\mu-e_0)}{\
delta}Tt=dk ( me0)
Among them this e 0 e_0e0is the generalization error rate, which is a function that satisfies the normal distribution;

Next, we set a confidence level α \alpha just like the binomial test.α , α 2 \frac{\alpha}{2}on both sides of the normal distribution curve2aTake an obvious error interval at each location, as shown in the figure:

Then assume a threshold interval μ = e ′ \mu=e'm=e , calculate the definite integral under the white image curve (refer to the binomial test for the bar chart calculation method), compare the two hypotheses, ife ′ e'e At the confidence levelα \alphaIf α is within the range calculated, it is acceptable.

Table of critical values ​​commonly used for bilateral t-test:

The overall process of this part is very similar to the binomial test. Except that the formula and distribution curve are different, the rest of the steps can be copied from the binomial test, so I won’t go into details here.

3. Cross-validation t-test

The above two methods are mainly aimed at measuring the performance of a single learner. For multiple learners, the k-fold cross-validation t test can be used to complete it.

For example, the error rates of two learners A and B after k tests are e 1 A , . . . , ek A {e_1^A,...,e_k^A}e1A,...,ekA e 1 B , . . . , e k B e_1^B,...,e_k^B e1B,...,ekB, in which ei A , ei B e_i^A,e_i^BeiA,eiBis the i-th fold result obtained after k-fold training. Find the difference between each set of training and test results: Δ i = ei A − ei B \Delta_i=e_i^A-e_i^BDi=eiAeiB

Assume that each test is independent of each other, so we can put Δ 1 . . . Δ k \Delta_1...\Delta_kD1. . . DkTreated as an independent sampling process of k times of performance differences, the mean μ \mu of this set of data is calculatedμ and varianceδ 2 \delta^2d2,因此我们有:
α   i s   a c c e p t a b l e      s . t . T t = a b s ( k μ δ ) < t α 2 \alpha~is~acceptable~~~~ s.t. T_t=abs(\frac{\sqrt{k}\mu}{\delta})<t_{\frac{\alpha}{2}} α is acceptable    s.t.Tt=abs(dk m)<t2a
The idea of ​​​​this process is also to set a confidence level α \alphaα , and then determine whether the threshold is within the acceptable range of confidence.

In fact, our training\test sets can easily overlap, so each test is not necessarily independent of each other .

We can roughly speculate that if our training and test sets overlap, then the k-fold test results we get will have a certain degree of correlation and similarity. If we still assume that these test results are independent of each other, that is to say If we regard a bunch of results that were originally "glued together" as "not glued together" , it means that the acceptable range of our hypothesis is actually too large , and we will overestimate the probability of the hypothesis being true. .

So we can use "5×2 cross-validation" to alleviate this problem. The so-called 5×2 means to do 5 times of 2-fold cross-validation. The data must be disrupted before each 2-fold cross-validation to ensure that the divisions are not repeated. The basic idea is as follows:

  • For two learners A and B, each 2-fold cross-validation will produce two sets of test error rates. We calculate the difference between them and get the difference Δ i 1 \Delta_i^1 on the first fold .Di1and the difference on the second fold Δ i 2 \Delta_i^2Di2

  • Average the two results of the first 2-fold verification μ ′ \mu'm , make a variance δ ′ 2 \delta'^{2}for each 2-fold experimental resultd2 , the average value adopts the average value of the previous first foldμ ′ \mu'm

  • Its probability distribution function is:
    T t ′ = μ ′ 0.2 ∑ i = 1 5 δ i 2 T_t'=\frac{\mu'}{\sqrt{0.2\sum_{i=1}^{5}\delta_i^ 2}}Tt=0.2i=15di2 m

Why does this alleviate the independence problem between test results? It's actually very simple, because our mean is just a set of error rates for one test result. All subsequent test results use the previous set of error rates when calculating the variance, so the size of the variance is only the same as the original data and the first The error rate of each test is related. But I think this method will be more convincing if the results of the first test are more realistic. (My personal thoughts are welcome to correct me).

4. McNemar test

The main idea of ​​this testing method is to list a contingency table for the classification results of the two two-category learners, and then derive the chi-square distribution function of the performance difference between the two learners, and then conduct a hypothesis test. Next, this method will be roughly described. A step of:

  • First, list the contingency table for the binary classification results of the two learners A and B:

  • After that ∣ e 01 − e 10 ∣ |e_{01}-e_{10}|e01e10 , this variable obeys a normal distribution; this test method considers the variable
    T x 2 = ( ∣ e 01 − e 10 ∣ − 1 ) 2 e 01 + e 10 T_{x^2}=\frac{(|e_{01 }-e_{10}|-1)^2}{e_{01}+e_{10}}Tx2=e01+e10(e01e101)2

  • This is a chi-square distribution with a degree of freedom of 1, and the rest of the hypothesis testing steps are the same as above: make a hypothesis to set the degree of reliability, deduce the acceptable range, test whether the hypothesis threshold is within the acceptable range, and draw a conclusion.

5. Friedman test and Nemenyi follow-up test

Cross-validation t-test and McNemar test are only suitable for pairwise comparisons. When comparing learners of multiple algorithms, it is obviously too troublesome to compare algorithms pairwise, so a method is introduced that can compare two algorithms on the same set of data. A test method for comparing multiple algorithms.

The basic idea of ​​the F-test method is to sort the performance of the learner according to the test results on the same set of data sets, and assign ordinal values ​​1, 2, 3..., and equally divide the ordinal values ​​if they are the same, as shown in the figure below:

The better the performance, the smaller the value assigned, and then calculate the average order value r r_i of each algorithm on each data setri

It should be noted here that the data in the table is just a special case. This average ordinal value does not completely mean that the algorithm with a small ordinal value must be better , because for different data sets, algorithm A may not be as good as algorithm B, and the two kinds of learning The algorithm may also have similar results on the same data set, so we need to use an estimation threshold to determine which algorithm is better.

Suppose k algorithms are compared on N data sets, ri r_iriThe mean and variance of are ( k + 1 ) / 2 (k+1)/2(k+1 ) / 2 and( k 2 − 1 ) / 12 N (k^2-1)/12N(k21 ) / 1 2 N , the distribution function is obtained (the derivation process is omitted):
TX 2 = 12 N k ( k + 1 ) ( ∑ i = 1 kri 2 − k ( k + 1 ) 2 4 ) TF = ( N − 1 ) TX 2 N ( k − 1 ) − TX 2 T_{X^2}=\frac{12N}{k(k+1)}(\sum_{i=1}^{k}r_i^2-\frac {k(k+1)^2}{4}) \\ \\ T_F=\frac{(N-1)T_{X^2}}{N(k-1)-T_{X^2}}TX2=k(k+1)12N(i=1kri24k(k+1)2)TF=N(k1)TX2(N1)TX2
T X 2 T_{X^2} TX2is the original Friedman test distribution function, TF T_FTFis the F test distribution function; in this algorithm, Xiguashu clearly provides a critical value table:

Acceptability test is still the same as above.

If the hypothesis that all algorithms have the same performance is rejected, then subsequent testing is required to make a further comparison of the algorithms. Because in the previous steps we only understood that not all of the k algorithms have the same performance, now we need to "find" the learners with different performance and compare which one is better .

The following introduces the specific process of the Nemenyi test algorithm:

The basic principle is to compare the difference between the average order values ​​of the two algorithms with the CD value. If the mean difference exceeds the CD value, it means that there is a difference in the performance of the two learners.

The CD value calculation formula is as follows:
CD = q α k ( k + 1 ) 6 N CD=q_\alpha \sqrt{\frac{k(k+1)}{6N}}CD=qa6 Nk(k+1)
where q α q_\alphaqaThe value table is as follows:

image-20220121121425644

If the difference between two ordinal values ​​is greater than the CD value, then the algorithm with a smaller average ordinal value is better than the algorithm with a large average ordinal value to a certain extent.

So looking back, the average ordinal value is only a reference aspect. We also need to use the Friedman test and the Nemenyi follow-up test before we can more rigorously use the average ordinal value to measure the quality of the algorithm.

Next section: Machine Learning Notes 3 - Linear Regression

Guess you like

Origin blog.csdn.net/qq_45882682/article/details/122618793