Machine Learning (5) of integrated learning (RF \ AdaBoost \ GBDT)

1 thought integrated learning

Integrated learning idea is to produce a new learner after several learner (regression classifier & device) combination. Weak classifiers (weak learner) those classification accuracy is only slightly better than random guessing classifier (errorrate <0.5).

Successful integration algorithm is to ensure that the diversity of weak classifiers (Diversity). And integration precarious algorithm can get a more significant performance gains.

Common thought has integrated learning: Bagging, Boosting, Stacking

About 1.1 Bagging

Bagging method also known as bootstrap aggregation method (Bootstrap Aggregating), thought : on the original dataset has a sampling back into the way, re-select a new S data sets were trained integration technology of the S classifier. That training data of these models allow duplicate data.

Bagging method trained model to predict when the new classification of samples, will use the majority vote or averaging a way to count the final classification result.

Bagging method weak learner model algorithm can be substantially, eg: Linear, Ridge, Lasso, Logistic, Softmax, ID3, C4.5, CART, SVM, KNN like.

NOTE : Bagging way sampling with replacement, the number of samples and each subset of the original number of samples must be the same, but allows duplicate data subset.

Bagging the basis of strategies to improve → RF → RF variant algorithm Extra Tree / Totally Random Trees Embedding (TRTE) / Isolation Forest

1.2 Boosting Profile

Enhance learning (on Boosting) is a machine learning technique may be used for classification and regression problems, which each step generating a weak prediction models (e.g., tree), and the total cumulative weighted model; if each step of the prediction model weak embodiment are generated based on a gradient of the loss function, then lifting called gradient (gradient boosting);

Enhance the technical sense: if there is a problem of weak predictive model, you can get a strong predictive model by way of technological upgrading;
common model has: Adaboost, Gradient Boosting (GBT / GBDT / GBRT)

1.3 Stacking Introduction

Stacking refers to train a model for other model composition (Combine) (model group / group learning device) technique. I.e., a plurality of different first trained models, and then to output the respective models previously trained as an input to train a new a new model, thereby obtaining a final model. Logistic regression using a single layer of the general composition as the model.

2 Random Forest (Random Forest)

2.1 algorithm flow

  1. From the selected sample set of n samples with Bootstrap samples;
  2. Randomly select from all the properties of K properties, selected as the best properties split node tree created;
  3. Repeat the above two steps m times, i.e. m build decision trees;
  4. The m random forest tree formed by the voting result of the decision data that belongs to the class.

    2.2 Extra Tree

    RF basic principles and the same, the difference is as follows:
  5. RF random sampling of the training set sub-tree, the decision tree Extra Tree each sub-set of training using the original data;
  6. RF and conventional decision tree will be selected as feature points when divided, will choose the optimal gain characteristic values ​​based on the principle of information, information gain ratio, Gini coefficient, standard deviation and the like; and Extra Tree will randomly select a characteristic value divided Decision tree.

Extra Tree randomly selected division point as feature values, this will lead to the decision tree is generally greater than the size of the generated RF tree. That variance Extra Tree model with respect to the further reduction of RF. In some cases, Extra Tree generalization stronger than the RF.

2.3 VINES

TRTE (Totally Random Trees Embedding) is a way of transforming data unsupervised. The low-dimensional mapping of high-dimensional data sets to, so that high-dimensional data is mapped to a better classification used in the regression model.

TRTE conversion algorithm similar algorithm RF method, to establish a decision tree T to fit the data. After the completion of constructing a decision tree, each data set in the T data decision tree leaf node positions decided on, the switching position information of the feature vector conversion operation is completed.

2.4 Isolation Forest(IForest)

IForest is an outlier detection algorithm, using similar RF way to detect outlier; IForest difference algorithm and the algorithm is that the RF:

  1. In the random sampling process, in general, only a small amount of data can be;
  2. Tree during the build process, IForest partitioning algorithm randomly selects a feature, and the feature divided randomly selecting a division threshold;
  3. IForest algorithm to construct a decision tree generally max_depth depth is relatively small.

Difference reason: abnormality detection object is, so long as it can differentiate abnormal, do not need a large amount of data; outlier detection during the addition, it is generally not too large tree.

For determining outliers, sucked fitted to the test sample x T decision trees. Calculating the depth ht (x) each tree leaf node of the sample. Thereby calculating the average depth H (x); then use the following formula to calculate the probability of an abnormal value of the sample point x, p (s, m) in the range [0,1], closer to 1, it is abnormal the greater the probability that point.

2.5 RF advantages and disadvantages of random forests

RF main advantages :

  1. Training can be parallelized for large-scale training sample has the advantage of speed;
  2. As a result of the decision tree is divided randomly selected feature list, so that when the sample dimensions relatively high, still has a relatively high performance training;
  3. Give the list given the importance of various features;
  4. Because of random sampling, small trained variance model, strong generalization ability;
  5. RF simple;
  6. Insensitive to the missing portion of the feature.

RF main disadvantages :

  1. On some relatively large noise characteristics, RF model vulnerable to over-fitting;
  2. The division features more value will have a greater impact on the RF decisions, which may affect the results of the model.

3 AdaBoost

3.1 algorithm principle

Adaptive Boosting is an iterative algorithm. Each iteration will produce a new study in the training set, and then use that learning is forecast for all samples to assess the importance of each sample (Informative). In other words is concerned, the algorithm will be given a weighting for each sample, each trained learner mark / prediction of individual samples, if a sample point is the more correct predictions, it will reduce its weight with; otherwise improve weight of the sample weight. The higher the weight of the sample weight training next iteration in the greater proportion, that is to say the more difficult it becomes important to distinguish between samples in the training process.

The entire iterative process until the error rate is small enough or reaches a certain number of iterations.

Adaboost algorithm is a linear combination groups classified as strong classifier, and classified to a smaller error rate in a large basic classifier weights, the classification error rate larger base classifier with a small weight value; Construction of Linear combination:

the final classifier function converts Sign is based on a linear combination of:

Build process 3.2 algorithm

  1. Suppose training set T = {(X ~ 1 ~, Y ~ 1 ~), (X ~ 2 ~, Y ~ 2 ~) .... (X ~ n ~, Y ~ n ~)}.
  2. Initialization training data weight distribution:Here Insert Picture Description
  3. Having a weight distribution of the training data set to learn D ~ m ~ obtain basic classifier:
    Here Insert Picture Description
  4. Calculating G ~ m ~ (x) the classification error in the training set:
    Here Insert Picture Description
  5. Computing model G ~ m ~ Right (x) of the weight coefficient α ~ m ~:
    Here Insert Picture Description
  6. Weight training dataset weight distribution:
    Here Insert Picture Description
  7. Where Z ~ m ~ is the standardization factor (normalized):
    Here Insert Picture Description
  8. Building a linear combination of the basic classifier:
    Here Insert Picture Description
  9. To give a final classifier:
    Here Insert Picture Description
    == ultimate goal ==:
    that the following equation reaches a minimum value of α ~ m ~ and G ~ m ~ is to solve for the final AdaBoost Algorithm.
    Here Insert Picture Description

3.3 summary

Advantages :

  1. Value can handle continuous and discrete values;
  2. The robustness of the model is relatively strong;
  3. Explain the strong, simple structure.

Drawback :
abnormal samples sensitive to abnormal samples may gain weight values higher weight in the iterative process, the final impact of model results.

4 GBDT

Iterative tree lift gradient (GBDT) is also a Boosting algorithm, and AdaBoost following differences:
AdaBoost algorithm using the error before one weak learner to update the weight values weight the sample, and then a round of iteration; GBDT iteration is but GBDT weak learner must be requested CART model, and model training GBDT in the time required sample loss is predicted by the model as small as possible.

GBDT consists of three parts: DT (Regression Decistion Tree), GB (Gradient Boosting) and Shrinkage (attenuation).

Composed of multiple decision trees, trees add up all the results is the final result .

Iteration difference between tree and random forest:

Random Forest drawn using different samples of the different constructs subtree that is the result of the tree construction and the front m m-1 is not related to the tree

Iterative tree when building subtree subtree after previously formed residual build results as the next input data to build a sub-tree; and when the final prediction in order to predict the sub-tree builder, and adding the prediction result .

4.1 algorithm principle

  1. Training samples given to a number of input vectors X and Y component output variable (X ~ 1 ~, Y ~ 1 ~), (~ 2 X ~, Y ~ 2 ~) ...... (X ~ n ~, Y ~ n ~), the goal is to find the approximate function F (X), so that the loss function L (Y, F (X)) minimize the loss of value.
  2. L general loss function using the least squares or the absolute value of the loss function loss function.
    Here Insert Picture Description
  3. The optimal solution is:
    Here Insert Picture Description
  4. Assumed that F (X) are a family of optimal basis functions f ~ i ~ (X) and weight:
    Here Insert Picture Description
  5. Thought extended greedy algorithm to obtain F ~ m ~ (X), find the optimal f:
    Here Insert Picture Description
    In the greedy method is still difficulty in selecting an optimal basis functions F each, using a gradient descent method of approximate calculation, given a constant function F ~ 0 ~ (X).
    Here Insert Picture Description
    The gradient descent learning rate is calculated:
    Here Insert Picture Description
  6. Using the data (x ~ i ~, α ~ im ~) (i = 1 ...... n) is calculated to find a fit residuals CART regression trees, to give the m-tree:
    Here Insert Picture Description
  7. Update Model
    Here Insert Picture Description

4.2 GBDT regression algorithm and classification algorithm

The only difference between the two is to choose a different loss functions.

Regression algorithm is typically selected loss function mean squared error (Least Squares) or mean absolute error; classification algorithm in a general loss function is a logarithmic function to represent choice.

4.3 summary

GBDT of advantages :

  1. Value can handle continuous and discrete values;
  2. In the case where relatively few adjustment parameters, also good prediction results;
  3. The robustness of the model is relatively strong.

GBDT of disadvantages :
due to the association between the presence of weak learners, difficult to parallelize training model.

Guess you like

Origin www.cnblogs.com/tankeyin/p/12144312.html