Li Hang - statistical learning methods - notes -1: Introduction

EDITORIAL

This series of major record note "statistical learning method" of seven commonly used machine learning classification algorithms, including Perceptron, KNN, Naive Bayes, decision trees, logistic regression and maximum entropy models, SVM, boosting.

Three algorithms textbooks are also concerns about the probability model estimation and tagging problems, not yet included in the study plan, the notes did not address, including the EM algorithm, hidden Markov model, Conditional Random Field (CRF).

So this series of notes includes a total of nine notes:
an Introduction (Chapter 1 corresponds to the book)
7 algorithm (corresponding to the book chapter 2-8)
a summary (corresponds to the book Chapter 12)

Statistical Learning

Learning : Herber A. Simon had to "learn" gives the following definition: "If a system can improve its performance through the implementation of a process, which is to learn."
Statistical learning : learning is statistical machine learning computer system to improve system performance through the use of data and statistical methods. Now people refer to machine learning, usually refers to the statistical machine learning.
The premise of statistical learning : learning basic assumptions about the statistical data is the same data with a certain statistical regularity. Because of their statistical regularity, so you can use probabilistic methods to be addressed. For example, the feature data described with available random variables, data describing the statistical law with a probability distribution.
Statistical study include : supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, this book focuses on unsupervised learning.

Supervised learning

Three kinds of tasks : input and output are predicted problems for continuous variables called regression output variables to predict problems finite number of discrete variables called classification, prediction problem input and output variables are called sequence tagging problem.
Supervised learning hypothesis : Suppose the input and output of the random variable \ (X \) and \ (Y \) follows the joint probability distribution \ (P (the X-, the Y-) \) . In the learning process, it is assumed that the joint probability distribution of existence, the training data and test data is regarded as distributed in accordance with the joint probability \ (P (X, Y) \) independent and identically distributed generated.
Independent and identically distributed : a random time during any of the values are random variables, if they obey the same distribution of random variables, and independent of each other (X1 values do not affect the value of X2, X2 values do not affect the value X1 ), then these random variables are independent and identically distributed.

One of the three elements of statistical learning: model

Models and assumptions space : the problem of statistical learning the primary consideration is to learn what kind of model. Supervised learning, the model is to study the conditions of the probability distribution function or decision, assuming space model contains all possible decisions or conditional probability distribution function.
Group decision function : hypothesis space can be defined as a set of decision function, \ (\ F. Mathcal {} = \ {F \ | \ the Y = F_ \ Theta (X-), \ Theta \ in n-R & lt ^ \} \) .
Conditional probability distribution Group : hypothesis space may be defined as the conditional probability of the set, \ (\ mathcal {F.} = \ {P \ | \ of P_ \ Theta (the Y \ | \ X-), \ Theta \ in R & lt ^ n-\} \) .
Probability model and non-probabilistic model : model expressed in the decision-making function of the non-probabilistic models, model expressed in the conditional probability is a probability model. Sometimes both models are two explanations, both can be seen as a probabilistic model, it can be seen as a non-probabilistic model. For simplicity, the model, when it comes, sometimes only one model available.

Statistical learning three elements II: Strategy (loss function)

Common loss function :
(1) 0-1 Loss: \ [L (the Y, F (X-)) = \ left \ {\ the begin Matrix {1}, the Y & \ F NEQ (X-) \\ 0, the Y = & F (X-) \ End {Matrix} \ right \].
(2) squaring loss: \ (L (the Y, F (X-)) = (Yf is (X-)) ^ 2 \)
(. 3) absolute loss: \ ( L (the Y, F (X-)) = | Yf is (X-) | \)
(. 4) on the number of losses: \ (L (the Y, P (the Y | X-)) = - \ log P (the Y | X-) \)

Expected risk : the following formula, which is the theoretical model \ (f (X) \) on the joint distribution \ (P (X, Y) \) loss in the average sense, called the expected risk (expected risk). Select the desired goal of learning is minimal risk model, but in fact the joint distribution \ (P (X, Y) \) is unknown. If you know the joint distribution \ (P (the X-, the Y-) \) , can be obtained directly from the joint distribution conditional probability distribution \ (P (the Y-| the X-) \) , there is no need to learn.
\ [R_ {exp} (f ) = E_p [L (Y, f (X))] = \ int _ {\ mathcal {X} \ times \ mathcal {Y}} L (Y, f (X)) P ( X, Y) dxdy \]

ERM : the following formula, which is the average loss on the training data model, called the empirical risk (empirical risk). The law of large numbers, sample \ (N \) time tends to infinity, \ (EMP R_ {} (F) \) approaches \ (R_ {exp} (F) \) , it is usually desirable to estimate the empirical risk risk, empirical risk minimization think experience minimal risk model is the best model.
\ [R_ {emp} (f ) = \ frac {1} {N} \ sum_ {i = 1} ^ {N} L (y_i, f (x_i)), \\ f ^ * = \ min_ {f \ in \ mathcal {F}} R_ {emp} (f) \]

Structural Risk Minimization : but in reality limited training samples, even small, need to \ (R_ {emp} \) be corrected. Structural risk minimization (structure risk minimization) to prevent over-fitting to the proposed scheme, on the basis of empirical risk adding complexity model represents a regularization term or penalty terms shown in the following formula, \ (J (F ) \) is the complexity of the model, complexity represents a punishment for complex models, \ (\ the lambda \ geqslant 0 \) is the coefficient used to weigh the risks and complexity of empirical models.
\ [R_ {srm} (f
) = \ frac {1} {N} \ sum_ {i = 1} ^ {N} L (y_i, f (x_i)) + \ lambda J (f) \] structure minimal risk It requires experience of risk and complexity of the model at the same time small. Small structural risk models tend to have better prediction of the training data, and unknown test data.

Elements of Statistical Learning thirty-three: algorithm (optimization algorithms)

Algorithm to optimize : statistical study based on the training set (data), based on learning strategies (loss), choose the best model from the hypothesis space (model), and finally solve the optimal model to consider what algorithm (algorithm). At this time, statistical learning problem as an optimization problem, statistical learning algorithm called algorithm optimization problems.

Optimization : If the optimization problem has an explicit analytical solution is relatively simple, but often there is no analytical solution, which requires a numerical method to solve. How guaranteed to find the global optimal solution, and solving process efficiency, has become an important issue. We can use the existing statistical learning optimization algorithm (commonly used gradient descent, Newton method and quasi-Newton method), sometimes also need to develop optimization algorithms alone.

Model evaluation and model selection

Evaluation criteria : Evaluation criteria when the loss function given time, based on the training and testing errors error model loss function is naturally called the learning method. Note that the loss of function of the specific use of statistical learning methods may not be used when assessing the loss of function, of course, so it is ideal both consistent (in reality due to the 0-1 loss is not continuously guide, assessing with 0-1 loss , with additional losses during training, such as classification tasks mostly logarithmic loss).
Training error : an average loss of about model train sets (loss experience). The size of the training error, to determine whether a given problem is not an easy question to learn is meaningful, but essentially unimportant.
Test Error : average loss model on the test set (0-1 loss when the loss function, the error becomes a test error rate on the test set error rate, error rate accuracy was plus 1). Test error reflecting method of learning ability to predict the unknown test data set, typically learning method ability to predict unknown data is called generalization .
Model Selection : When the hypothesis space containing different complexity (for example, a different number of parameters) of the model, we must face the problem of model selection, we hope to learn a suitable model. If there is "true" model assumes that space, then the selected model should approximate "true" model.
Overfitting : If the blind pursuit of improving the predictive ability of the training data, the complexity of the chosen model tends to "true" higher than the model, a phenomenon known as over-fitting. Over-fitting means too many parameters of the model chosen included learning that the emergence of this model to predict the data very well known, but the unknown data to predict poor phenomenon.
Model selection and over-fitting : model selection can be said to be merged designed to avoid over-improve the predictive power of the model, commonly used in model selection methods regularization and cross-validation.

Regularization :
regularization is to achieve structural risk minimization strategy is to empirically risk plus a regularization term or penalty term. Regularization term complexity of the model is typically a monotonically increasing function, the more complex model, the greater value is regularized. Positive model parameters may be the norm of a vector, such as the L1 norm or the L2 norm. Squaring loss in regression equation as follows plus L2 norm.
\ [L (w) = \ frac {1} {N} \ sum_ {i = 1} ^ {N} (f (x_i; w)) ^ 2+ \ frac {\ lambda} {2} \ left \ | w \ right \ | _2 ^ 2
\] experience less risk model for the first term may be more complex (multiple parameter non-zero), then the complexity of two models will be relatively large. Regularization of the role is to select the empirical risk and complexity of the model while smaller models.

Regularization in line with the principle of Occam's razor: All models can choose, it is possible to interpret well known and very simple data model is the best, which is the model that should be selected.

Training / validation / test : if the sample is sufficient, a simple method of model data is selected randomly divided into a training set, validation set, the test set. Training set used to train the model, verifying a model set for the model, the test set for final evaluation of the learning process. In models of varying complexity to the study, the test set to choose a model smallest prediction error, since the validation set has enough data so that it is effective for model selection.

But the actual application data are not sufficient, in order to select a good model can be used cross-validation approach. The basic idea is to reuse the data is divided into training and test sets, on the basis of repeated training, testing and model selection.

Simple cross-validation : zoned random data in two parts, as part of the training set, as part of the test set (such as a triple seven). Then (such as a different number of parameters) training model, the training set for evaluation under various conditions on each set of test models, select the smallest measurement error model.

S-fold cross-validation (Application max) : randomly cut into S disjoint, subsets of the same size; training model data subsets S-1, for testing the remaining subset; the possible choices are repeated S , you will obtain an average error; selecting a minimum error average test model as the best model.

Leaving a cross-validation : the special case of a fold cross-validation S \ (S = N \) (sample size), leaving a verification referred to, often used in case of lack of data.

Generalization

Generalization : generalization ability of learning means learning by the predictive ability of the method to model the unknown data.
Test Error : the most realistic approach is used to evaluate the ability of generalization error learning method by testing, but the assessment is dependent on the test data set, since the test data set is limited, most likely resulting evaluation the result is not reliable.
Generalization error : generalization error is expected risk to learning model.
The upper bound of generalization error : statistical learning theory tries to analyze theoretical generalization ability of learning methods. Learning generalization ability of analysis is often carried out through the upper bound of the generalization error probability study, referred to as the upper bound of the generalization error. Specifically, it is bound on the generalization error of size comparison of two methods of learning to compare their advantages and disadvantages.
Generalization properties of the error bound : with increasing sample size, the generalization bound tends to zero. The larger the hypothesis space, the model more difficult to learn, the greater the generalization upper bound.
Bound on the generalization error dichotomous : for binary classification, when it is assumed limited space is a set of functions \ (\ mathcal {F} = \ {f_1, f_2, ..., f_d \} \) when, for any a function \ (F \ in \ mathcal F.} {\) , at least with probability \ (l- \ Delta \) , the following inequality holds (hereinafter, simply assuming space containing bound on the generalization error of the limited functions, the general the hypothesis space to find the generalization error bounds are not so simple).
\ [R (f) \ leqslant \ hat {R} (f) + \ varepsilon (d, N, \ delta) \\ \ varepsilon (d, N, \ delta) = \ sqrt {\ frac {1} {2N } (\ log d + log \ frac {1} {\ delta})} \]
Inequality left \ (R (f) \) is the generalization error. Generalization error is bounded on the right. Wherein the first right-hand side of the training error, training error is smaller, the generalization error bound is also smaller. The right second term is a monotonically decreasing function of N, when N tends to 0 goes to infinity; it is also \ (\ sqrt {\ log d } \) function order, assuming space \ (\ mathcal {F} \ ) more function contains, the greater its value.

Generation model and discriminant model

Generating method and identification method : The method of supervised learning methods can be divided into generating and discriminating method, learned models were generated model and discriminant model.

Generating model : learning data generating method by the joint probability distribution \ (P (X-, the Y) \) , then the conditional probability distribution is obtained \ (P (Y | X) \) as the prediction model, i.e. the model generation.
\ [P (Y | X)
= \ frac {P (X, Y)} {P (X)} \] is called generation method, because the model is represented by a given input X and the Relationship generating output Y. Typical generation model are: Naive Bayesian methods and hidden Markov models.

Discriminant model : direct discrimination method of learning by the data decision function \ (f (X) \) or conditional probability distribution \ (P (Y | X) \) as predicted by the model, namely discriminant model. Identification method concern is for a given input X, what should the predicted output Y. Typical discriminant model include: KNN, machine perception, decision trees, logistic regression, maximum entropy model, SVM, boosting methods and conditions with the airport.

The method of generating the advantages
(1) can restore the joint probability distribution
(2) learning convergence faster, i.e., with increasing sample size N, the model can be learned more quickly converge to the true model.
(3) When there is an implicit variable, can still generate learning method, the method can not be used at this time is determined.

The advantages of identification methods :
(1) direct learning conditional probability \ (P (Y | X) \) or decision-making function \ (f (the X-) \) , predicted to face, often higher accuracy.
(2) Direct Learning \ (P (Y | X) \) or \ (F (X-) \) , may be of various degrees of abstraction of data, and defining feature using the feature, it is possible to simplify the learning problems.

Guess you like

Origin www.cnblogs.com/liaohuiqiang/p/10979545.html