Day2 "Machine Learning" Chapter 2 Study Notes

  This chapter should be regarded as a chapter on the theory of price comparison. I have some basic probability theory, but I didn't understand much at first. The definitions of some of the formulas and the derivation of model errors should still be familiar. They were mentioned in the probability theory class before, but they were a little vague. At that time, the class was relatively shallow.

Day2 Chapter 2 Model Evaluation and Selection

2.1 Empirical error and overfitting

  Usually we refer to the ratio of the number of misclassified samples to the total number of samples as the "error rate", that is, the error rate E=a/m, a error in m samples, 1-a/m is called "precision" (accuracy)", we call the difference between the actual output of the learner and the actual output of the sample as "error", and the error on the training machine during the learning period is called "training error" or "experience" Error (empirical error)", the error on the new sample is called "generation error (generation error)", obviously, when we do machine learning, we hope to get a learner with a small generalization error.

  A good learner should learn as much as possible the "universal laws" that apply to all potential samples from the training samples. Note that the "universal laws" mentioned here. Here are two concepts of the learner: "overfitting" and "underfitting". Overfitting: that is, the learner has learned this book too well, and regards some characteristics of the training sample itself as the general properties of potential samples. Underfitting: as the name suggests, it means that the general properties of the training samples have not been learned well. There are many factors that lead to overfitting. The most common one is that the learning ability is too strong, and the less general characteristics of the sample are learned, while the underfitting is caused by the low learning ability. Underfitting is easy to overcome, but overfitting is difficult to overcome (to constructively prove "P=NP", as long as you believe "P≠NP", overfitting is inevitable), which is more troublesome.

2.2 Evaluation method

  In general, we can make choices by evaluating the generalization error of the learner through experimental tests.

  To do this, we need to use a "testing set" to test the learner's ability to discriminate against new samples, and then use the "testing error" on the test set as an approximation of the generalization error. The test set mentioned here comes from the data set and should be excluded from the training set as much as possible, that is, the test samples should not appear in the training set as much as possible, and have not been used in the training set. The 10 exercises are used as after-school exercises. If the teacher wants to check the students' learning in the course, if they use these 10 exercises as the test questions, they will obviously not get the correct test results. However, if we only have a dataset D={(x 1 , y 1 ), (x 2 , y 2 ), ..., (x m , y m )} containing m samples, we need to train and test , we need to do appropriate processing on D, and generate training set S and test set T from it. There are the following common practices:

  2.2.1 Diversion method

       The "hold-out method" directly divides the data set D into two mutually exclusive sets, the training set S and the test set T, that is, D=S∪T, S∩T=∅. We usually use about 2/3~4/5 samples of the dataset for training and the remaining samples for testing.

  2.2.2 Cross-validation method

       "Cross validation" first divides the dataset D into k mutually exclusive subsets of similar size, that is, D=D 1 ∪D 2 ∪…∪D k , Di i ∪ D j =∅ (i≠ D j =∅ j), and each subset D i keeps the original data distribution as consistent as possible. Then, each time the union of k-1 subsets is used as the training set, and the remaining subset is used as the test set, so that k groups of training/testing sets can be obtained, so as to perform k training and testing, and the final return is k Therefore, we usually refer to cross-validation as "k-fold cross validation". A special case of this is k=m: Leave-One-Out (LOO).

  2.2.3 Self-help method

       We hope to evaluate the model trained with D, but in the leave-out method and cross-validation method, since some samples are reserved for testing, the training set used by the actual evaluation model is smaller than that of D, which will inevitably introduce some factors. Estimation bias due to different training set sizes. A better solution is "bootstrapping", which is directly based on bootstrap sampling. Using the self-help method through some series of processing, the effect is that both the actual evaluation model and the expected evaluation model use m training samples, and there are still about 1/3 of the total data and samples that do not appear in the training set are used for Tests, such test results, are also called "out-of-bag estimates". The bootstrap method has smaller datasets. Useful when partitioning the training/training set efficiently is difficult.

  2.2.4 Tuning parameters and final model

       Most learning algorithms have some parameters that need to be set, and the performance of the learned models varies significantly with different parameter configurations.

2.3 Performance Metrics

Evaluating the generalization performance of a learner requires not only an effective and feasible experimental estimation method, but also an evaluation standard for measuring the generalization ability, which is the performance measure.

  2.3.1 Error rate and precision

  2.3.2 Precision, Recall and F1

  2.3.3 ROC and AUC

       The full name of ROC is the "Receiver Operating Characteristic" curve, which originated from the radar signal analysis technology used for enemy aircraft detection in "World War II". In detection applications, do not introduce it into the field of machine learning. AUC (Area Under ROC Curve), AUC can be obtained by summing the area of ​​each part under the ROC curve.

  2.3.4 Cost-sensitive error rate and cost curve

2.4 Comparison test (I don’t understand this piece of knowledge too much, I will read it several times later)

With experimental evaluation methods and performance metrics in place, it is possible to measure a certain performance metric of the learner using some experimental evaluation method, and then compare these results, but the question is how to do this "comparison".

The statistical hypothesis test "hypothesis test" provides an important basis for the performance comparison of the learners mentioned above.

  2.4.1 Hypothesis testing

  Concepts from probability theory and statistics

  2.4.2 Cross-validation t-test

  2.4.3 McNemar test

  2.4.4 Friedman test and Nemenyi follow-up test

2.5 Bias and variance

  Bootstrap sampling has important uses in machine learning. ROC curves were introduced into machine learning in the late 1980s, and AUC was widely used in machine learning from the mid-1990s.

  (The second chapter notes are here, continue to study the subsequent chapters)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325158222&siteId=291194637