Machine learning algorithms Summary 1: Introduction to statistical learning methods

Summary "statistical learning method" learning experience
statistical study (statistical learning) is based on computer models and data to build statistical probability model using the data a discipline forecast and analysis.
Statistical learning is the study of data (data), statistical learning basic assumptions about the data is the same data with a certain statistical regularity, the data is divided into continuous variables and discrete variables.
Statistical learning three elements: model (model), policy (strategy) and algorithms (algorithm).
1. The model
assumed that space model contains all possible decisions or conditional probability distribution function;
2. policy
loss function measurement model to predict a good or bad, good or bad risk function measurement model predicts an average sense.
Loss function is f (X) and Y non-negative real-valued function, denoted L (Y, f (X) ).
Common loss function:
(1) loss function 0-1
Here Insert Picture Description
(2) quadratic loss function, commonly used model: Linear Regression
Here Insert Picture Description
(3) the absolute loss function
Here Insert Picture Description
(4) loss function of the number, the model used: Logic Regression
Here Insert Picture Description
smaller loss function value, model the better.
Expected loss function, that is a function of risk (expected loss) are:
Here Insert Picture Description
mathematical expectation or mean (mean) Definition: test probability of each possible outcome multiplied by the sum of its results, reflecting the average value of the random variable size. Law of large numbers states: As the number of repetitions approaches infinity, the arithmetic mean value of almost certainly converge to the desired value.
Here, the goal of learning is to choose the most desirable model under risk, but the joint distribution P (X, Y) is unknown, so the use of empirical risk (loss experience) expected loss approach.
Loss experience:
Here Insert Picture Description
When the sample size is large enough, the use of empirical risk minimization; when the sample size is very small, will have a "over-fitting" phenomenon, the use of structural risk minimization.
Risk defined structure:
Here Insert Picture Description
wherein, regularization term J (f) is the model complexity, the complexity of the model is a monotonically increasing function, the coefficient for the complexity of the model and the empirical risk tradeoffs.
3. algorithm
statistical learning problem as an optimization problem, statistical learning algorithms for solving optimization problems become algorithms. For example: least squares linear regression, logistic regression gradient descent method.
Statistical learning, including supervised learning, unsupervised learning, semi-supervised learning and reinforcement learning.
Supervised learning from the training data (training data) collection of learning model, the test data (test data) to predict. Data on the composition of training, the training set is usually represented by the input (or feature vectors) and outputs:

supervised learning problem can be divided into three categories: regression, classification and labeling issues.
Input and output variables are continuous variables predictive problem is called regression; output variables to predict problems finite number of discrete variables called classification; input and output variables are called variable sequence prediction problem question mark.
Supervised learning is assumed that the input and output of random variables X and Y to follow the joint probability distribution p (X, Y), the training data and test data are distributed in accordance with the joint probability P (X, Y) generated by independent and identically distributed.
Joint probability distribution is defined: Let (X, the Y) is a two-dimensional random variable, x, y are arbitrary real numbers, the function F (x, y) = P {X <= x, Y <= y}, referred to as (X, Y) distribution function (or distribution functions).
The relationship between the joint probability, marginal probability and the conditional probability:
Here Insert Picture Description
Reference Bowen
supervised learning using the training data set to learn a model, then the model to predict the test sample set. As shown below:
Here Insert Picture Description
the purpose of the supervised learning is to learn a mapping from the input to the output of the mapping is the supervised learning model, which can be probabilistic model or probability model byConditional probability distribution P (X | Y) or decision function Y = f (X) represents. In the prediction process, by the model
Here Insert Picture Description
or
Here Insert Picture Description
the corresponding output are given.
Training error model of the average loss on the training data set, test error is about average loss model test data set.
Deviation is the training error, variance is the test error - training error. Reference Bowen
Overfitting: high variance; underfitting: high deviation. How to solve the over-fitting and underfitting? Reference Bowen
overfitting refers to excessive parameters of the model chosen learning contained, well known data to predict emergence of unknown data to predict poor phenomenon.
The relationship between training error and test error and the complexity of the model:
Here Insert Picture Description
solved fitting method:
1. regularization
of positive terms is the norm of the parameter vector of, regularization of the role is to select the empirical risk and complexity while smaller models the model, in line with the principle of Occam's razor.
Occam's razor: All models can choose, it is possible to interpret well known and very simple data model is the best, which is the model that should be selected.
2. Cross validation (applicable to the case of little data)
cross-validation basic idea: given data segmentation, the segmentation of the training data set and diversity combining test set, repeatedly trained on this basis, testing and model selection.
(1) simple cross-validation: the data has been randomly divided into training and test sets, such as 7: 3 split;
(2) S-fold cross-validation: First, random data known cut into the same size as the S disjoint subsets; S-1 and using the data to train subsets, subset using the remaining test model; this procedure was repeated for S for possible selection; S evaluation, and finally selecting the smallest average model test error;
(. 3) the special case S fold cross-validation, i.e., S = N.
Generalization ability of the model refers to the ability of the model to predict unknown data.
Supervised learning method divided into generation and identification method. The method of learning data generated by the joint probability distribution P (X, Y), generating a typical model are: naive Bayes method and hidden Markov models; data directly from the learning identification method decision function f (X) or the conditional probability distribution P (Y | X), typical discriminant model has: k neighbors, machine perception, decision trees, logistic regression, maximum entropy models, support vector machines, upgrade methods and conditions with the airport.
Here Insert Picture Description
Supervised learning or learning a classification model from the data classification decision function, called
a classifier
. Evaluation of classifier performance indicators are generally classification accuracy , which is defined as: the number of samples for a given set of test data, the ratio of correct classification of the total sample.
Other Evaluation : accuracy rate (precision), recall (Recall) and the value F1 (harmonic mean of precision and recall).
TP: n is a positive number of classes based prediction; FN: negative positive class prediction class number; FP: negative positive class prediction class number; TN: negative class prediction class number is negative.
Accuracy rate: P = TP / (TP +recall: R = TP / (TP +values: 2TP / (2TP + FP +
Classification of statistical learning methods: k-nearest neighbor, Perceptron, naive Bayes, decision trees, decision lists, logistic regression, support vector machines, upgrade methods, Bayesian networks, neural networks, Winnow and so on.
Marking of statistical learning methods: the hidden Markov model, CRFs.
Learning equivalent in function fitting regression: selecting a function curve so that it fits the known data and unknown data prediction, is divided into non-linear regression and linear regression.
Bibliography: "statistical machine learning" Lee Hang

Released five original articles · won praise 3 · Views 169

Guess you like

Origin blog.csdn.net/qq_35946628/article/details/104341039