Past and Present Boosting algorithm (Part I)

Micro-channel public number: AIKaggle
welcome suggestions and brick, if necessary resource, No. public message;
if you feel AIKaggle help you, welcome appreciation


table of Contents


This series of articles will sort out Boosting algorithm development, starting from the Boosting algorithm, introduced Adaboost, GBDT / GBRT algorithm, XGBoost algorithm, LightGBM algorithm, CATBoost algorithm, Thunder GBM algorithm, introduce the principle of family Boosting algorithm, framework, derivations, This article - Past and Present Boosting algorithm (Part I) will introduce AdaBoost algorithm and gradient algorithm to enhance the tree, next will be described in detail XGBoost algorithm, LightGBM algorithm, CATBoost algorithm, Thunder GBM algorithms, but also concerned about the public number: AIKaggle acquisition algorithm dynamic.

1 Introduction

  • Boosting is a family of weak learners may be promoted to a strong learning algorithm's, there is a strong dependence between its individual learner, and must generate a serial, is a representative ensemble learning algorithm (ensemble learning) a. By combining a plurality of learners and constructed to complete the task, sometimes referred to as a multi-classifier system (multi-classifier system), based on a learning Committee (committee-based learning) and the like.
  • Boosting algorithm involves two parts, the additive model distribution algorithm and forward. That additive model is a strong classifier obtained by adding a series of linear weak classifiers. Usually the following combination:
    \ [F_ {M (X; P)} = \ sum_. 1} = {m} ^ {n-\ beta_mh (X; a_m) \]
  • Wherein, \ (H (X; a_m) \) is a weak classifiers, \ (a_m \) is weak classifier learned optimal parameters, \ (\ beta_m \) is weak learner in the strong classifiers the proportion, \ (P \) are all \ (a_m \) and \ (\ beta_m \) combination. These linear weak classifiers strong classifier consisting of the addition.
  • The forward stepwise algorithm that is in the training process, the next iteration resulting classifier is trained to come in the last round basis. I.e. can be written in this form:
    \ [F_m (X) = F_ {}. 1-m (X) + \ beta_mh_m (X; a_m) \]
  • Loss of function due to different uses, Boosting algorithm and therefore have different types, AdaBoost Boosting algorithm is the loss of function of the index loss.

2. AdaBoost

  • AdaBoost algorithm will now be described. Assuming that a given training data set of a second-class classification

\[T = {(x_1,y_1), (x_2,y_2),..., (x_N,y_N)}\]

  • Wherein, each instance of the marking from the sample, Example \ (x_i \ in X-\ ^ n-R & lt Subset \) , labeled \ (y_i \ in the Y = \ {-. 1,. 1 + \} \) , \ (X-\ ) is an example of space, \ (the Y \) is the set of markers. Using the following algorithm AdaBoost, a weak classifier learning series or group classifier from the training data, and a linear combination of these weak classifiers becomes a strong classifier.

    2.1 AdaBoost algorithm framework

Input: training data set \ (T = {(x_1, Y_1), (x_2, Y_2), ..., (x_n,, y_N)} \) , weak learning algorithm; wherein \ (x_i \ in X \ subset R ^ n-\) , \ (y_i \ in the Y = \ {-. 1,. 1 + \} \)
process:

(1) Initialization weight distribution of the training data
\ [D_1 = (w_ {11 }, .., w_ {1i}, ..., w_ {1N}), w_ {1i} = \ frac1N, i = 1, 2, ..., N \]

(2) \ (m = 1,2, ..., M \)
(2.a) having a weight distribution \ (D_m \) of the training data set to learn, to give the base classifier
\ [G_m (x) : X-\ rightarrow \ {-. 1, + 1'd \} \]
(2.b) is calculated \ (G_m (x) \) classification error rate on the training data set \ (e_m \)
\ [e_m = P (G_m (x_i) \ NEQ y_i) = \ sum_. 1 = {I} ^ {N}} W_ the I {mi the (G_m (x_i) \ NEQ y_i) \]
(2.c) calculated \ (G_m (x) \) of factor:
\ [\ alpha_m = \ frac12 log \ FRAC-e_m. 1} {} {e_m \]
log is the natural logarithm here.
Value distribution (2.d) the right to update the training data set
\ [D_ {m + 1} = (w_ {m + 1, 1}, ..., w_ {m + 1, i}, ..., w_ m +. 1 {, N}) \]
\ [W_ m +. 1 {,} I = \ {mi The FRAC W_} {} {} Z_m exp (- \ alpha_my_iG_m (x_i)), I =. 1, ..., N \]
here, \ (Z_m \) is the normalization factor, it makes \ (D_ {m + 1} \) be a probability distribution.
\ [Z_m = \ sum_ {i = 1} ^ {N} w_ {mi} exp (- \ alpha_my_iG_m (x_i)) \]

(3) Construction of the basic classifier linear combination
\ [f (x) = \
sum_ {m = 1} ^ M \ alpha_mG_m (x) \] to give a final classifier
\ [G (x) = sign (f (x) ) = sign (\ sum_ {m = 1} ^ m \ alpha_mG_m (x)) \]
output: The final classifier \ (G (x) \)

2.2 Reflection

  1. What kind of weak classifiers can be used as base classifier, that is, to select the base classifier What are the requirements?
  2. Why this factor selected group classifier \ [\ alpha_m = \ frac12 log \ frac {1-e_m} {e_m} \]

3. Gradient lifting tree algorithm (GBDT / GBRT)

  • Methods to enhance the function of the decision tree-based upgrade called tree of classification decision tree is a binary tree classification, regression problem tree is a binary regression trees. Addition to enhance the tree model can be represented as a decision tree model.
    \ [f_M (x) = \ sum_ {m = 1} ^ MT (x; \ Theta_m) \]
  • Wherein, \ (T (X; \ Theta_m) \) represents the decision tree; \ (\ Theta_m \) is the parameter decision tree; (M \) \ number of the tree.
  • For binary classification, boosting tree algorithm simply AdaBoost algorithm based classifiers can be limited to two types of classification tree.

    3.1 Tree Algorithm Gradient lifting frame

Input: training data set \ (T = {(x_1, Y_1), (x_2, Y_2), ..., (x_n,, y_N)} \) , the loss function \ (L (y, f ( x)) \) ; wherein \ (x_i \ in X-\ ^ n-R & lt Subset \) , \ (y_i \ in the Y \ ^ n-R & lt Subset \)
process:

(1) Initialization \ (f_0 (x) = \ arg \ min_ {c} \ sum_ {i = 1} ^ NL (y_i, c) \)

(2) \ (m = 1,2, ..., M \)
(2.a) of \ (I = 1,2, ..., N \) , calculated \ (r_ {mi} = - [\ FRAC {\ partial L (y_i, F (x_i)))} {\ partial F (x_i)}] _ {F (X) = F_ {}. 1-m (X)} \)
(2.b) for \ (r_ {mi} \) fitting a regression tree, to obtain a first \ (m \) leaf nodes of the tree area \ (R_ {} MJ, J =. 1, ..., J \)
(2. c) a \ (J = 1,2, ..., J \) , calculated:
\ [MJ} = {C_ \ Arg \ min_c \ sum_ {x_i \ R_ {MJ}} in L (y_i, F_ {m } -1 (x_i) + C) \]
(2.d) update \ (f_m (x) = f_ {m-1} (x) + \ sum_ {j = 1} ^ Jc_ {mj} I (x \ in R_ {mj}) \)

(3) the regression tree \ (\ hat f (x) = f_M (x) = \ sum_ {m = 1} ^ M c_ {mj} I (x \ in R_ {mj}) \)
Output: Regression tree \ (\ hat f (x) \ )

3.2 Reflection:

  1. Boosting tree model to understand how heavy the weight of each tree? The average weight is assigned it?

3.3 GBDT common loss function

3.3.1 Classification Algorithm

For the classification algorithm, its loss function generally logarithmic and exponential loss function loss in two ways:

  1. Exponential loss function, loss function is expressed as
    \ [L (y, f ( x)) = exp (-yf (x)) \]
  2. Logarithmic loss function, is divided into two kinds of binary classification and multivariate classification.
3.3.2 regression algorithm

For regression algorithm, commonly used loss function has the following four kinds:

  1. Mean square error, review the most common loss function:
    \ [L (Y, F (X)) = (YF (X)) ^ 2 \]
  2. Absolute loss, this loss function is also common
    \ [L (y, f (
    x)) = | yf (x) | \] corresponding negative gradient error
    \ [sign (y_i- f (x_i )) \]
  3. Huber loss, it is the product of a compromise both variance and absolute loss for outliers away from the center, the absolute loss, and a point near the center of mean square deviation. This limit is generally used percentile point measure. Loss function as follows:

\[L(y,f(x))= \begin{cases} \frac12(y-f(x))^2, |y-f(x)|<\delta \\ \delta(|y-f(x)|-\frac{\delta}{2}), |y-f(x)|>\delta \end{cases}\]

Error corresponding negative gradient:

\[r(y_i, f(x_i)) = \begin{cases} y_i -f(x_i), |y_i - f(x_i)|\leq \delta \\ \delta sign(y_i, f(x_i)), |y_i-f(x_i)|>\delta \end{cases}\]

  1. Quantile loss. It is the loss of function of the corresponding quantile regression expression is

\[L(y, f(x)) = \sum_{y\geq f(x)} \theta|y-f(x)|+\sum_{y<f(x)}(1-\theta)|y-f(x)|\]

Where \ (\ theta \) is a quantile, we need to be specified before return. Error corresponding negative gradient:

\[r(y_i, f(x_i)) = \begin{cases} \theta, y_i\geq f(x_i) \\ \theta -1, y_i< f(x_i) \end{cases}\]

For Huber loss and quantile loss, mainly for robust regression, that is, to reduce the impact of outliers on the loss function.

3.4 gradient boosting tree regularization

And AdaBoost, we also need to GBDT be regularized, to prevent over-fitting. Regularization GBDT there are three main ways.

  • The first is similar AdaBoost and regularization term, i.e., step (learning rate). It is defined as \ (V \) , to the preceding iteration weak learner

\[f_k(x) = f_{k-1}(x)+h_k(x)\]

If we add the regularization term, there

\[f_k(x) = f_{k-1}(x)+vh_k(x)\]

\ (V \) ranges \ (0 <V \ Leq. 1 \) . For the same training set of learning outcomes, smaller \ (v \) means that we need more number of iterations of weak learners. Usually we use the maximum number of iteration steps and work together to determine the effect of fitting algorithm.

  • The second regularization is by sub-sampling ratio (subsample). Value of \ ((0,1] \) Note that where sub-sampling and random forest is not the same, using a random sampling with replacement forest, and here is a sample without replacement. If the value is 1, then all samples are used, the use of sub-sampling is not equal. If the value is less than 1, only the portion of the sample will do the tree fitting GBDT select 1 ratio is less than the variance can be reduced, i.e., to prevent over-fitting, but will increase the sample is to be bonded deviation, the value is not too low. recommended \ ([0.5, 0.8] \) between.

Using the sub-sampled GBDT sometimes called stochastic gradient boosting tree (Stochatis Gradient Boosting Tree, SGBT). Due to the use of sub-sampling, sampling programs can be distributed to different tasks to do boosting the iterative process, the final formation of new trees, thus reducing the weak learner difficult to parallelize weakness learning.

  • The third is for the weak learner that is, CART regression tree pruning regularization, refer to the principle of decision tree, not expanded detail here.

4. attention AIKaggle

The following is my number two-dimensional code images public, the public interpretation of this number depths machine learning algorithms, to share information and actual cases Kaggle competition, introduction of scientific data methodology, to share skills contest, share tensorflow, keras, pytorch, deep learning, machine learning real case, welcome attention.

5. Expresses its appreciation Kaggle combat machine learning

If you think to combat Kaggle machine learning to help you, welcome appreciation, have your support, Kaggle combat machine learning will be getting better and better!
We appreciate the code


Nice to meet you !

Guess you like

Origin www.cnblogs.com/AIKaggle/p/11529718.html