Overview of Machine Learning Theory 2

6. Usually we think that the test sample is also sampled from the true distribution of the sample with independent and identical distribution (the same as the whole). When dividing the training set and the test set, two points need to be paid attention to. One is that the test set should be mutually exclusive with the training set as much as possible, that is, the test sample should not appear in the training set as much as possible, and has not been used in the training set. The second is to divide the training/test set to maintain the consistency of the data distribution as much as possible, to avoid the impact of the extra deviation introduced by the data division process on the final result, that is, to scale the data set as much as possible.
However, in fact, we know that it is almost impossible to achieve these two points completely and perfectly. Therefore, a variety of strategies have been developed to divide the data set, including the leave-out method, cross-validation method, and self-service method.

7. In order to compare the performance of the two learners, we have introduced multiple indicators, such as error rate, accuracy, precision, recall, F1, etc., and in order to more intuitively reflect this comparison, we introduce A variety of graphs are used to show, such as PR curve, ROC, etc. P31

8. When drawing the ROC, we can sort the test samples, the "most likely" is the positive example at the top, and the "most likely" is the positive example at the bottom. In this way, the classification process is equivalent to dividing the sample into two parts with a certain cutoff point in this sorting, the first part is used as a positive example, and the second part is used as a negative example. The advantage of this is that in different application tasks, we can use different cut-off points according to the task requirements. If we pay more attention to accuracy, we can choose the higher position in the sequence to proceed; if we pay more attention to the complete search Rate, you can choose a back position for truncation.

9. Considering the index of "cost-sensitive error rate", we hope to select a learner that minimizes the "overall cost". At this non-equal cost, the ROC curve cannot directly reflect the expected overall cost of the learner. Therefore, we introduce a cost curve, P35

10.
In fact, comparing the performance of the two models is far more complicated than everyone thinks.
Because this involves several important factors: First, we want to compare the generalization performance, and then through the experimental evaluation method we get only the performance on the test set, the comparison results of the two may not be the same; second, the test set The above performance has a lot to do with the choice of the test set itself. And no matter what test sets of different sizes are used, different results will be obtained. Even if the test sets of the same size are used, if the test samples included are different, the test results will be different. Third, many algorithms actually have a certain degree of randomness. Even if the same parameter settings are used and run multiple times on the same test set, the results will be different.
Therefore, it is precisely because there are so many uncertain factors that we have to seek help in statistics.
Generally speaking, we first need to verify whether the metrics of a single learning period are statistically reliable in a statistical sense, and then It is also necessary to verify in a statistical sense whether the learner A performs better than the learner B in a statistical sense.

In order to determine that there is a real difference between the two models, we need to perform a statistical significance test. Finally, what we hope to get is a statement similar to this: After testing, the A classifier is indeed better than the B classifier, and the error of the statement is within plus or minus 4%.
In general, because of our purposes (mainly considering computational overhead), the training set we use is not the entire data set. Therefore, this brings with it various problems (fitting problems, unequal distribution problems, etc.), so we have proposed various "remedial measures (various methods of dividing data, various algorithm improvements)". It can be said that the series of parameters we obtain through the training set may not fully reflect the overall data set, that is, these parameters may not be credible.
Therefore, we must make sure that these evaluation parameters are credible before actually evaluating them based on the parameters! ! !
The idea is based on the possibility that the evaluation parameters may be wrong.

11. The linear model is simple in form and easy to model, but it contains some important basic ideas in machine learning. Many more powerful nonlinear models can be obtained by introducing hierarchical structure or high-dimensional mapping on the basis of linear models.
The linear model is generally written in the form of a vector: f(x)=wTx+b, for example: f good melon=0.2 X color+0.5 X root+0.3*X knock +1

12. For the continuity of discrete nominal attributes, consider two cases. One is that if the attribute value directly has an "order" relationship, it can be converted into a continuous value through continuity. Such as {high, short} can be transformed into {1.0,0.0}. Second, if there is no order relationship between attribute values, they are usually converted into K-dimensional vectors. Such as {watermelon, pumpkin, cucumber} can be transformed into {1,0,0}, {0,1,0}, {0,0,1}.

13. Linear regression attempts to learn a linear model to predict real-valued output markers as accurately as possible. The mean square error is the most commonly used performance metric in regression tasks (because the mean square error has a very good geometric meaning). We seek a linear regression model, usually the one with the smallest mean square error. The method of solving the model based on the minimization of the mean square error is called the "least square method". Of course, the least squares method is very versatile, not limited to linear regression.
In linear regression, the method of least squares is trying to find a straight line that minimizes the sum of Euclidean distances from all samples to the straight line.

14. Strategies to deal with the problem of category imbalance.
At present, the classification learning methods we have learned have a common basic assumption, that is, the number of training examples of different categories is equivalent. If there is a slight difference in the number of training examples in different categories, it usually has little effect. But if the difference is large, it will cause confusion to the learning process. For example, in anomaly detection, 99% of data may be normal, while only 1% of abnormal data. But it is precisely this 1% of abnormal data is valuable and 99% of data is not valuable. Therefore, if we run directly on such a training set, then the learned model is bound to have no value. Because you did not provide enough examples for the learner to learn, of course you can say that this algorithm does not work, but for this type of imbalance problem, the current learning algorithm does not have a good solution. So we can only think of ways to improve from the data.
A basic strategy for class imbalance learning is rescaling. In fact, we make some artificial adjustments to the training set (add or delete samples) to balance the sample categories in the training set.
We know that the ideal assumption is: "The training set is an unbiased sampling of the real sample population." However, this assumption is often not true in real life. Therefore, the current techniques for dealing with such problems are generally: oversampling (adding some positive examples to make the number of positive and negative examples close) and undersampling (removing some negative examples to make the number of positive and negative examples close).
The time overhead of undersampling is usually much smaller than that of oversampling. Because it discards many counterexamples, the training set of the classifier is much smaller than the initial training set, and the oversampling method adds many positive examples, making the training set of the classifier larger than the initial set.
It should be noted here that the oversampling method cannot simply repeat the sampling of the initial positive samples, which will cause severe overfitting. The correct approach is to generate additional positive examples by interpolating the positive examples in the training set; and under-sampling cannot randomly discard negative examples because this may lose some important information. The correct approach is to use integrated learning. Divide a large counterexample into a small portion for use by different learners.

15. The reality of machine learning:
1. Unless training is performed on every possible data, there will always be multiple hypotheses that make the true error rate non-zero, that is, the learner cannot guarantee that it is completely consistent with the objective function.
2. The training sample is Randomly selected training samples are always misleading.
Therefore, any machine learning algorithm cannot guarantee 100% accuracy, so we introduce probability.
Our requirements for a learner are:
1. We do not require the assumption that the learner outputs a zero error rate, but only require the error rate to be limited to a certain constant ε, which can be arbitrarily small.
2. It is not required that the learner can successfully predict all arbitrarily extracted data, but only requires that the probability of failure is limited to a certain constant μ, which can be arbitrarily small.
In short, we only require that the learner may learn a hypothesis that is approximately correct, so we get "may be approximately correct learning" or PAC learning (this is the content of computational learning).
A PAC-learnable learner must satisfy two The following conditions:
1. The learner must output a hypothesis with an arbitrarily low error rate with an arbitrarily high probability.
2. The time of the learning process increases at most in a polynomial manner.
For PAC, the number of training samples and the computing resources required for learning are closely related related
summary, we know that machine learning impossible to find the perfect philosophical sense, but we can do apply on probability. The fundamental reason is because the data set we use cannot be a truly complete data set. The data set we can get can only be regarded as an independent and identically distributed sampling data set (and this independent and identically distributed is not in the complete sense, it is also a probability), because there will be PAC to guarantee that we This machine learning is probabilistically feasible, and in order to eliminate this impact as much as possible, you will see many measures in specific algorithms.
Finally, one more sentence, PAC can be seen as a theoretical framework and foundation for machine learning. This theory is a bit old and controversial, but there are still many people who insist on doing it.

Guess you like

Origin blog.51cto.com/14492651/2550994