Introduction to statistical learning methods of statistical learning

Introduction to statistical learning methods of statistical learning

Statistical learning (statistical learning), also called statistical machine learning (statistical maching learning).

Statistical learning by the supervised learning (supervised learning), unsupervised learning (unsupervised learning) and reinforcement learning (reinforcement learning) and other components.

Statistical learning methods can be summarized as follows: from a given, limited training data set of learning departure, assuming independent and identically distributed data is generated; and a collection of model assumptions to be learned belongs to a function, called the hypothesis space (hypothesis space); apply an evaluation criteria (evaluation criterion), to select an optimal model from a hypothesis space, making it known training data and the test data is unknown at the optimal prediction evaluation criteria given ; select the best model implemented by the algorithm. In this way, learning methods, including statistical hypothesis space model, mode selection criteria and the type of learning algorithm model. Called statistical learning methods of the three elements, referred to as a model (model), strategy (starategy) and algorithms (algorithm).

Classification of statistical learning

1. Basic classification

supervised learning

It is essentially a learning input to output mapping statistical laws.

Input variable X and an output variable Y of different types, may be continuous or may be discrete. If the input and output variables are continuous variables to predict problem is called regression; output variables to predict problems finite number of discrete variables called classification; input and output variables are variables predicted sequence of questions is called the labeling issue.

Supervised learning is assumed that the input and output of random variables X and Y to follow the joint probability distribution P ( X , Y ). P ( X- , the Y ) represents a distribution function, the distribution or density function.

unsupervised learning

Essentially statistical laws or potential learning data structure.

Reinforcement Learning

Reinforcement learning is a machine learning in intelligent learning system optimal behavioral strategies in continuous interaction with the environment.

Nature is to learn the best Sequential decisions.

④. Semi-supervised learning and active learning

2. Press model classification

① probability model with non-probabilistic model

In supervised learning, probabilistic model takes the form of conditional probability distribution P ( Y | X ) , the function takes the form of a non-probabilistic model Y = F ( X ) . In unsupervised learning, probabilistic model takes the form of conditional probability distribution P ( Z | X ), or P ( X | Z ) , a non-functional form of the probability model taking Z = G ( X ) .

In supervised learning, the probability model is a generative model, probabilistic model is a non-discriminant model.

② linear models and nonlinear models

③ parametric model and the non-parametric model

3. Classification Algorithm

Into online learning (online learning) and learning batch (batch learning).

Online learning means that each time you have a sample, forecast, after learning models, machine learning and repeating the operation.

Batch learning accept all data, learning model, after forecasting.

Online learning is usually more difficult to learn than the batch, it is difficult to learn a higher accuracy rate forecast model, because each model update, the limited data available.



 

Loss and risk function

A loss function measurement model to predict good or bad, good or bad risk function measurement model predicts an average sense. Loss function is F ( X- ) and Y  non-negative real-valued function, denoted L ( Y , F ( X- )).

1) loss function 0-1 (0-1 loss function)

 

 2) quadratic loss function (quadratic loss function)

 

 

 3) absolute loss function (absolute loss function)

 

 4) Number of loss function (logarithmic loss function) or logarithmic likelihood loss function (log-likelihood loss function)

 

 Since the input-output model is a random variable following the joint distribution P ( X- , the Y ) , it is desirable that the loss function

 

 This is the theoretical model f ( the X- ) on the joint distribution P ( the X- , the Y- ) loss in an average sense, called the risk function (risk function) or expected loss (expected loss).

Select the desired goal of learning is minimal risk model. Since the joint distribution P ( X- , the Y ) is unknown, R & lt exp ( F ) can not be calculated directly. In fact, if you know the joint distribution P ( the X- , the Y- ) , can be obtained directly from the joint distribution conditional probability step P ( the Y-| the X- ) , there is no need to learn. Because I do not know the joint probability step by step, it becomes necessary to learn.

Given a training data set

 

 Model f ( the X- ) About average loss experience in the training data set called risk (empirical risk) or the loss of experience (empirical loss), denoted R emp

 

 Expected risk R exp ( f ) is a model of joint distribution of the expected loss, empirical risk R emp ( f ) is the model of the average loss on the training sample set. The law of large numbers, when the sample size N when tends to infinity, empirical risk R & lt EMP ( F ) tends to risk the desired R & lt exp ( F ) . So a natural idea is to estimate the expected risk with risk experience. However, because in reality limited training data, even small, so experience with risk estimate expected risk is often not ideal, to experience the risk of a certain correction. This is related to the supervised learning two basic strategies: empirical risk minimization and structural risk minimization.

Empirical risk minimization and structural risk minimization

 ERM (Empirical Risk Minimization, ERM ) strategy believes that experience minimal risk model is the best model. According to this strategy, according to the ERM is seeking the optimal model for solving optimization problems

 

 

 Which, F is the hypothesis space.

When the sample size is large enough, ERM can guarantee a good learning. For example, maximum likelihood estimation (maximum likeihood estimation) is an example of empirical risk minimization. When the model is a conditional probability distribution function loss is a function of the number of losses, experience risk minimization is equivalent to maximum likelihood estimation.

However, when the sample size is small, the ERM learning effect is not very good, will have over-fitting.

Structural risk minimization (Structural Risk Minimization, SRM ) to prevent over-fitting strategy proposed. Structural risk minimization is equivalent to regularization. Structural risk in terms of experience coupled with a regular term risk of penalty or the complexity of the model. In the case where the hypothesis space, loss of function and the determined training data set, the risk definition structure is:

 

 Where J ( F ) is the complexity of the model is defined in the hypothesis space F on functional. Model f more complex, the complexity of J ( f ) becomes. In other words, complexity represents a punishment for complex models. λ≥0 coefficient is to weigh the risks and complexity of empirical models. Small structural risk requires experience risk and complexity of the model at the same time small. Small structural risk models tend to have better prediction of the training data, and unknown test data. For example, the maximum a posteriori probability estimation Bayesian estimate (maximum posterior Probability Estimation, MAP ) is an example of structural risk minimization. When the model is a conditional probability stepwise, loss function is a logarithmic function loss, complexity of the model is represented by a prior probability model, the structural risk minimization is equivalent to the maximum a posteriori estimation.

Structural risk minimization strategies considered minimal structural risk model is the best model. So seek the optimal model is to solve the optimization problem

 

Guess you like

Origin www.cnblogs.com/loveEunha/p/11515171.html