"Statistical Learning Methods" (Li Hang): model evaluation selection, regularization and cross-validation, generalization ability, generative model and discriminative model, supervised learning application

PS: The written content is reading notes, if you want to know more detailed content, please buy genuine books

1.4 Model evaluation and selection

1.4.1 Training Error and Testing Error

Training error: the error of the model in predicting the training set

Test error: the error of the model on the test set test results

1.4.2 Overfitting and model selection

Over-fitting: The model selected during learning contains too many parameters, so that the model predicts well for known data but poorly for unknown data.

When the complexity of the model increases, the training error will gradually decrease and approach 0, while the test error will first decrease and then increase. When the complexity of the selected model parameters is too high, the model is often too "dependent" on the training data. Therefore, in learning, it is necessary to choose a model with appropriate complexity. There are two commonly used methods: regularization and cross-validation .

1.5 Regularization and cross-validation

1.5.1 Regularization

The typical method of model selection is regularization. Regularization is the realization of the structural risk minimization strategy, which is to add a regularizer or penalty term to the empirical risk. The regularization term is generally a monotonically increasing function of the model complexity, the more complex the model, the larger the regularization value. For example, the regularization term can be the norm of the model parameter vector.

The regularization term generally has the following form:

min\frac{1}{N}\sum_{i=1}^NL(y_i,f(x_i))+\lambda J(f)

Among them, the first item is the empirical risk, and the second item is the regularization item, which \lambda >= 0is the adjustment coefficient.

1.5.2 Cross Validation

If the given sample data is sufficient, a simple method for model selection is to randomly divide the data set into three parts, namely training set, validation set and test set. The training set is used to train the model, the validation set is used for model selection, and the test set is used for model selection.

1. Simple cross-validation

First, the training data is randomly divided into two parts, one part is used as the training set, and the other part is used as the verification set (for example, 70% of the data is the training set, and 30% of the data is the test set); Set validation, select the model parameters with the smallest error on the validation set.

2. S-fold cross-validation

First, the given training data is divided into S subsets that are mutually disjoint and of the same size, and then the model is trained using the data of S-1 subsets, and the remaining subsets are used to verify the model. This process is repeated to select optimal model parameters.

1.6 Generalization ability

1.6.1 Generalization error

The generalization ability of a learning method refers to the predictive ability of the model learned by the method to unknown data. The most commonly used method in practice is to use the error of the test set to evaluate the generalization ability of the learning method, which depends on the test set.

The generalization error is the error of the learned model for unknown data. The smaller the generalization error, the stronger the generalization ability, and the better the effect of the model on unknown data.

1.6.2 Generalization Error Upper Bound

A learning method will have an upper bound on the generalization error. By comparing the upper bounds of the generalization error of different learning methods, the generalization ability of different learning methods can be compared.

The upper bound on the generalization error for the binary classification problem :

Theorem 1.1 (Upper bound of generalization error for binary classification problems) For binary classification problems, when the hypothesis space is a set of finite functions \Gamma =\begin{Bmatrix} f_1 &f_2 &... &f_d \end{Bmatrix}, for any function f\in \Gamma, at least with probability 1-\delta,0<\delta<1, the following inequalities hold:

R(f)\leq \hat{R}(f)+\varepsilon(d,N,\delta)

in:\varepsilon (d,N,\delta)=\sqrt{\frac{1}{N}(log\ d+log\ \frac{1}{\delta})}

R(f) on the left side of the inequality is the generalization error, and the right side is the upper bound of the generalization error. In the generalization error upper bound, the first item \hat{R}(f)is the training error, the smaller the training error, the smaller the generalization error; the second item \varepsilon(d,N,\delta)is a monotonically decreasing function of N, which tends to 0 when N approaches infinity; it is also \sqrt{log\ d}For functions of order, it is assumed that the more functions the space contains, the larger its value will be.

1.7 Generative model and discriminative model

Supervised learning methods can be divided into generative methods and discriminative methods. The learned models are called generative and discriminative models, respectively.

The generation method learns the joint probability distribution P(X,Y) from the data, and then finds the conditional probability distribution P(Y|X) as the predicted model, that is, the generation model:

P(Y|X)=\frac{P(X,Y)}{P(X)}

Such methods are called generative because the model represents a generative relationship that produces an output Y given an input X. Typical generative models include Naive Bayes and Hidden Markov Models.

The discriminant method directly learns the decision function f(X)or conditional probability distribution from the data P(Y|X). The discriminant method is concerned with what output Y should be predicted for a given input X. Typical discriminant models are: k nearest neighbor method, perceptron, decision tree, logistic regression model, maximum entropy, support vector machine, boosting method and conditional random field.

1.8 Supervised Learning Applications

1.8.1 Classification problem (classification)

Classification is a central problem in supervised learning. In supervised learning, when the output variable Y takes a finite number of discrete values, the prediction problem becomes a classification problem. At this point the input X can be either discrete or continuous. Supervised learning learns a classification model, called a classifier, from data. When there are multiple categories of classification, it is called a multi-classification problem.

The indicators for evaluating classifiers are accuracy, precision and recall.

According to the classification situation, there are four situations:

TP: (true-positive) - the number of positive classes predicted as positive classes.

FN: (false-positive) - the number of positive classes predicted as negative classes.

FP: (false-positive) - the number of negative classes predicted as positive classes.

TN: (true-negative) - the number of negative classes predicted as negative classes.

Accuracy rate (the proportion of correct predictions among the number of positive class predictions):P=\frac{TP}{TP+FP}

Recall rate (so in the prediction results of positive samples, the proportion of correct predictions):R=\frac{TP}{TP+FN}

In addition, there is F_1a value, which is the harmonic mean of precision and recall: \frac{2}{F_1}=\frac{1}{P}+\frac{1}{R}, that isF_1=\frac{2TP}{2TP+FP+FN}

1.8.2 Labeling problem (tagging)

The labeling problem can be thought of as a generalization of the classification problem. The input of the labeling problem is a sequence of observations, and the output is a sequence of labels or states.

1.8.3 Regression problem (regression)

Regression is another important problem in supervised learning. Regression is used to predict the relationship between an input variable (independent variable) and an output variable (dependent variable), specifically, when the value of the input variable changes, the resulting change in the value of the output variable. A regression model is just a function that represents the mapping from input variables to output variables. Learning for regression problems is equivalent to function fitting: choose a function curve that fits known data well and predicts unknown data well.

According to the number of input variables, regression problems can be divided into unary regression and multiple regression; according to the type of relationship between input variables and output variables and the type of model, it can be divided into linear regression and nonlinear regression.

Guess you like

Origin blog.csdn.net/APPLECHARLOTTE/article/details/127421363
Recommended