Statistical Learning Methods-Li Hang-Chapter 1: Introduction to Statistical Learning Methods-Notes 1


0 Machine learning classification

Machine learning is a method of [given a set of training data sets , learn experience and rules from it, and then usually build a model with many parameters , and based on this model to predict the input and output of the new instance].

Note: Not all machine learning methods require a model.

0.1 Supervised learning

The training data set for supervised learning is composed of input data (usually a vector ) and expected output (or called label value). The output of the model can be a continuous value (regression analysis in this case) or predict a classification label (classification in this case).

Application scenarios such as handwriting recognition (classification problem):
Insert picture description here

0.2 Unsupervised learning

The training data set consists of training data without label values, and the model needs to cluster or group the input data.

The main applications of unsupervised learning include: classification, association rules, and dimensionality reduction.

Application scenarios such as image noise reduction:
Insert picture description here

0.3 Semi-supervised learning

As the name implies, the training data set for semi-supervised learning consists of two types of data: (1) a small part of data consisting of input data (usually a vector ) and expected output (or label value); (2) most of it is unlabeled The value of the training data is composed.

0.4 Reinforcement learning

Reinforcement learning is used to describe and solve the problem of agents in the process of interacting with the environment through learning strategies to maximize returns or achieve specific goals .

Reinforcement learning emphasizes how to act based on the environment to maximize the expected benefits.


The main content of "Statistical Learning Methods"-Li Hang's book is the classification problem in supervised learning .

The need for knowledge reserve :

(1) Mathematical analysis/advanced mathematics: integral, differential, function extreme value, etc.;

(2) Linear algebra/matrix analysis: matrix operations, derivation, etc.;

(3) Probability statistics: common distribution, conditional distribution, etc.;

(4) Programming language: able to understand simple programs (assignment, operation, loop, condition).


1 Introduction to statistical learning methods

1.1 The steps of supervised learning

1. Obtain a limited training data set ;

2. Determine the hypothesis space of the model, that is , all candidate models ;

3. Determine the criteria for model selection, that is, learning strategies ;

4. Realize the algorithm for solving the optimal model;

5. Select the optimal model through learning methods ;

6. Use the learned optimal model to predict or analyze new data .

As shown in the figure below, pay attention to the correspondence of the labels:
Insert picture description here
Training Set : T = {(x 1 , y 1 ), (x 2 , y 2 ),..., (x N , y N )}, there are N training data Pair (or N training examples), where the input variable x i is generally a multi-dimensional vector, and y i is the label value. All input variables x i constitute the input space, and all tag values yi constitute the output space. In addition, there is a feature space , which is generally the same as the input space. If the input space is processed to a certain extent, it becomes a feature space. If the input data x is taken as (x, x 2 , x 3 ), the three-dimensional space formed is the feature space.

Two models (typed wrong below, min should be max):
Insert picture description here

1.2 Three elements of statistical learning

(1) Model : ①The decision function F is composed of multiple candidate models f, X is the input space, Y is the output space, θ is the parameter of the model, and one θ corresponds to one candidate model f; ②The conditional probability distribution F consists of multiple It consists of a conditional probability P, where P θ (Y|X) is the conditional probability distribution of the output space Y under the condition of a given input space X.
Insert picture description here
Example: The input is x, the output is y, and the [Assumed Space] is a one-dimensional linear space, then Y = a 0 + a 1 x. At this time = [theta] (A 0 , A . 1 ) T .

(2) Strategy : How to evaluate multiple candidate models, so as to select an optimal model from them. The loss function is a function of [true value and predicted value] for each instance, which means that each instance data corresponds to a loss function value.
Insert picture description here
For all instance data in the entire training data set, there are a total of N loss function values. How to comprehensively judge to select the optimal model? There are two criteria, among which L(y i , f(x i )) is the loss function value loss of the i-th instance:
Insert picture description here
structural risk minimization is a strategy proposed to prevent over-fitting. Structural risk minimization is equivalent to regularization. Structural risk adds a regularization term or penalty term that represents the complexity of the model to empirical risk . In the case where the hypothesis space, loss function and training data set are determined, the definition of structural risk is:
Insert picture description here
where J(f) is the complexity of the model, which is a functional defined on the hypothesis space F. The more complex the model f, the greater the complexity J(f); the reverse is also possible. That is, the complexity represents [penalties for complex models] (because the simpler the model, the better, of course, the more complex the greater the penalty). λ≥0 is a coefficient used to weigh empirical risk and model complexity. [Small structural risk] Need [experience risk and model complexity are small at the same time ]. Models with low structural risk tend to have better predictive performance on training data and unknown test data.

The strategy of structural risk minimization considers that the model with the least structural risk is the optimal model. Therefore, seeking the optimal model is to solve the optimization problem: in
Insert picture description here
this way, the supervised learning problem becomes the [optimization of empirical risk or structural risk function] problem. At this time, the empirical or structural risk function is the optimal objective function .

(3) Algorithm : Refers to the specific calculation method of the learning model.

1.3 Model evaluation

There are two types of errors: the
Insert picture description here
model should not only have the smallest error in the training data set, but more importantly, the error in the test data set. Accordingly, the evaluation of this model should be higher.

1.4 Cross validation

The purpose of cross-validation is [choose the appropriate model].

If the given sample data is sufficient, a simple method for model selection is to randomly divide the data set into three parts:

(1) Training set: used to train the model;

(2) Validation set (validation set): used for model selection-among the learned models of different complexity, select the model with the smallest prediction error on the validation set;

(3) Test set: used for the final evaluation of the learning method.

Since the validation set has enough data, it is also effective to use it to select the model. However, the data is generally insufficient in practical applications. Therefore, in order to select a good model, a cross-validation method can be used .

The basic idea of ​​cross-validation is to repeatedly use data -segment the given data, combine the segmented data set into a training set and a test set, and then repeat training, testing, and model selection on this basis.

1.4.1 Simple cross-validation

First, randomly divide the given data into two parts, one as the training set and the other as the test set. For example, 70% of the data is the training set, and 30% of the data is the test set. Then use the training set to train the model under various conditions (such as different number of parameters) to obtain different models; evaluate the test error of each model on the test set, and select the model with the smallest test error.

1.4.2 S-fold cross validation

First, randomly divide the given data into S disjoint subsets of the same size.

Use S-1 subsets of data to train the model, and use the remaining subset to test the model.

Repeat the above process for S possible choices. Finally, the model with the smallest average test error in the S evaluations is selected.

1.4.3 Leave one to cross-validate

The special case of S-fold cross-validation is S=N (N is the capacity of a given data set), which is called leave-one-out cross-validation, which is often used when data is lacking.

1.5 Generalization ability

The generalization ability of a learning method refers to the predictive ability of the model learned by this method for unknown data , which is an essential property of the learning method. In reality, the most widely used method is [ evaluating the generalization ability of learning methods through test errors ]. This evaluation depends on the test data set. Because the test data set is limited, it is very likely that the resulting evaluation results are unreliable. Statistical learning theory attempts to theoretically analyze the generalization ability of learning methods .

First, give the definition of generalization error: If the learned model is f-tip (you can see what it is below, I can’t type it out), then the error of using this model to predict unknown data is the generalization error:
Insert picture description here
The generalization error reflects the generalization ability of the learning method. If the model learned by the A method has a smaller generalization error than the model learned by the B method, then the A method is more effective. In fact, the generalization error is the expected risk of the learned model .

The generalization ability analysis of learning methods is often carried out by studying the upper bound of the probability of generalization error , which is referred to as the upper bound of generalization error . Specifically, it compares their pros and cons by comparing the upper bounds of generalization errors of the two learning methods. The upper bound of generalization error usually has the following properties: it is a function of the sample size N, when the sample size N increases, the generalization upper bound tends to 0; at the same time, it is a function of the assumed space capacity, that is, the number of candidate models d, Assuming that the larger the space capacity d, the harder the model is to learn, and the larger the upper bound of generalization error.

The following is the definition of generalization error:
Insert picture description here
N is the number of training data, d is the number of functions in the hypothesis space, and δ is the probability. The meaning of the above inequality is: [the training error R(f) tip of the alternative model f] + ε = [the upper limit of the generalization error of the alternative model]. The generalization error is the expected risk of the learned model, and the generalization ability of the learning method is generally evaluated by testing the error.

1.6 Generative Model and Discriminant Model

Insert picture description here
The generation method needs to learn P(X,Y) from the data, that is, it needs to learn the joint probability distribution of X and Y, and then obtain the conditional probability distribution P(Y|X) as the predicted model, that is, the generative model. It is called the "generating" method because the model represents the relationship between a given input X and an output Y. Typical generative models are: Naive Bayesian method and Hidden Markov model.

The discriminant method directly learns the decision function f(X) or the conditional probability distribution P(Y|X) from the data as the predictive model, that is, the discriminant model. The discriminant method is concerned with what output Y should be predicted for a given input X. Typical discriminant models include: k-nearest neighbor method, perceptron, decision tree, logistic regression model, maximum entropy model, support vector machine, lifting method and conditional random field.

1.7 Classification issues

Insert picture description here
Here is the confusion matrix:
Insert picture description here

1.7.1 Accuracy

Accuracy is the proportion of all predictions that are correct (positive and negative) :
Insert picture description here

1.7.2 Accuracy

Precision rate (precision, or PPV, positive predictive value), precision rate. That is, the proportion of the correct prediction being positive to the total prediction being positive :
Insert picture description here

1.7.3 Recall rate

Recall (recall, or sensitivity, sensitivity, true positive rate, TPR, True Positive Rate), that is, the proportion of correct predictions that are positive to all that are actually positive :
Insert picture description here

1.7.4 F1

The F1 value (H-mean value) is the harmonic mean value of the precision rate P and the recall rate R. The larger the F1, the better the model.
Insert picture description here

1.8 Labeling issues

In the labeling problem, the input and output are both vectors and the dimensions of the two are the same:
Insert picture description here
Take a chestnut: information extraction-extract basic noun phrases from English articles. To this end, the article should be marked. An English word is an observation, and an English sentence is a sequence of observations. Markers indicate the "beginning", "end" or "other" of the noun phrase (respectively denoted by B, E, O), and the mark sequence indicates the basic noun phrase in the English sentence location. When extracting information, the words marked "beginning" to "end" are regarded as noun phrases. For example, given the following observation sequence, that is, an English sentence, the tagging system generates the corresponding tag sequence, that is, the basic noun phrase in the sentence:
Insert picture description here

1.9 Regression problem

Regression is an important problem of supervised learning, used to predict the relationship between input variables (independent variables) and output variables (dependent variables), especially when the value of the input variable changes, the value of the output variable will follow. The change situation.

The regression model is a function that represents the mapping from input variables to output variables . The learning of regression problem is equivalent to function fitting : choose a function curve to fit the known data well and predict the unknown data well.

The regression problem is divided into two processes : learning and prediction . First, given a training data set:
Insert picture description here
here, x∈R n is the input vector, y∈R is the corresponding output label, i=1,2,...,N. The learning system builds a model based on the training data, that is, the function Y=f(X); for the new input x N+1 , the prediction system determines the corresponding output y N+1 according to the learned model Y=f(X) .

Regression according to the number of input variables , divided into one regression and multiple regression; follow between the input and output variables type of relationship that is the type of model, divided into linear and nonlinear regression regression.

The most commonly used loss function for regression learning is the square loss function . In this case, the regression problem can be solved by the least square method.


END

Guess you like

Origin blog.csdn.net/qq_40061206/article/details/112258635