Data mining algorithm and practice (18): integrated learning algorithm (Boosting, Bagging)

Previously, we mainly introduced the understanding and use of single machine learning algorithms. In actual scenarios, integrated learning, that is, combined mining algorithms, are often used to achieve optimal results. There are two types of integrated learning: Boosting and Bagging. The former is obtained through multiple serial weak learners. Strong learners (GBDT, XoostGB, LightGBM), which implement the optimal model (random forest RF) through multiple decision tree voting in parallel, and generally use integrated learning directly in the competition, because the model performance can be guaranteed to the greatest extent;

Real scenes and competitions generally directly use integrated algorithms, because a single algorithm is like a single decision maker in reality. It is easy to cause errors. The integrated algorithm minimizes the final model error by sampling data/features or deliberately reducing the model error rate strategy, divided into 2 Types: ① There is a strong dependency between weak learners and must be serialized serialization method, representative: Boosting; ② There is no strong dependency between weak learners, parallelization method that can be generated at the same time, representative: Bagging and random forest (Random Forest), you can refer to the sklearn Chinese community: integrated learning algorithm details

1. Bagging and random forest

Bagging bagging method integrated learning, training weak learners in parallel through sampling samples/data sets, the strategy function is to minimize the mean square error, after training a series of weak learners, use a certain combination strategy (voting or averaging mechanism) to form a model As a result, random forest is to construct a parallel decision tree to form a model with higher accuracy by sampling data/features with replacement. The algorithm flow is as follows:

Bagging uses random sampling (bootsrap) to sample fixed m samples from the training set with replacement, and draw T times. The content of the m samples drawn each time is different. For 1 sample, the probability of being collected in random sampling is 1/m, the probability of not being collected is 1−1/m, if T samples are not collected, the probability is (1−1/m)^T, when T→∞, (1−1m)^ m→1/e≃0.368. That is to say, in each round of random sampling of bagging, about 36.8% of the data in the training set is not sampled. For this part, about 36.8% of the data that is not sampled is called Out Of Bag (OOB). These data are not involved in the fitting of the training set model, so it can be used to detect the generalization ability of the model.

    Like Adaboost, bagging has no restrictions on weak learners. Decision trees and neural networks are commonly used. At the same time, the combination strategy of bagging is relatively simple. The classification problem uses a simple voting method. The category or one of the categories with the most votes is the final model output. The regression problem uses the simple average method to arithmetic average the regression results obtained by T weak learners to obtain the final model output. Because the Bagging algorithm samples the training model, it has strong generalization ability and is very effective in reducing the variance of the model.

Random Forest (Random Forest) is an evolution of Bagging, similar to GBDT using CART decision tree as a weak learner. Secondly, RF uses the largest Gini coefficient feature of some features of the decision tree as the classification basis. Except for the above two points, RF and ordinary The bagging algorithm is no different;

2. Boosting and GBDT, XGBoost, LightGBM

The Boosting algorithm strategy is to first train a weak learner 1 with the initial weights, and update the weights of the training samples according to the learning error rate of the weak learning, so that the weights of the training sample points with the high learning error rate of the previous weak learner become higher, making these errors The points with high rates are given more attention in the later weak learners, and then the weak learners are trained based on the training set after adjusting the weights. Repeat this until the number of weak learners reaches the pre-specified number T, and finally these T The weak learner is integrated through a set strategy to obtain the final strong learner, which solves four problems: ① How to calculate the learning error rate? ②How to get the weight coefficient of weak learner? ③How to update the sample weight? ④ Which combination strategy is used?

GBDT is an iterative algorithm based on decision tree learners. Note that the decision trees in GBDT are regression trees instead of classification trees. Boost means "boosting". Generally, the Boosting algorithm is an iterative process. Training is to improve the results of the last time. The core of GBDT is that each tree learns the residuals of the sum of all previous tree conclusions. This residual is a cumulative amount that can get the true value after adding the predicted value. For example, the true age of A is 18 years old, but the predicted age of the first tree is 12 years old, and the difference is 6 years, that is, the residual is 6 years old. Then in the second tree, we set the age of A to 6 years old to learn. If the second tree can really divide A into 6-year-old leaf nodes, then the conclusion of adding two trees is the true age of A; If the conclusion of the second tree is 5 years old, then A still has a residual error of 1 year old, and the age of A in the third tree becomes 1 year old, continue learning;

Reference: detailed description

XGBoost University of Washington's algorithm: ① It adds pruning to GBDT to control the complexity of the model; ② Increases the selection of base classification; ③ Can support parallel;

# xgb参数
params = {
    'booster':'gbtree',
    'min_child_weight': 100,
    'eta': 0.02,
    'colsample_bytree': 0.7,
    'max_depth': 12,
    'subsample': 0.7,
    'alpha': 1,
    'gamma': 1,
    'silent': 1,
    'objective': 'reg:linear',
    'verbose_eval': True,
    'seed': 12
}

LightGBM: Microsoft's open source algorithm is an improved version of XGBoost. The main shortcomings of XGboost are: ① In each iteration, the entire training data needs to be traversed multiple times. If the entire training data is loaded into the memory, it will limit the size of the training data; if it is not loaded into the memory, it will consume a lot of time to read and write the training data repeatedly; ② The pre-sorting method consumes a lot of time and space;

LightGBM is optimized based on sparse features of histograms, which reduces parallel communication overhead and has lower data segmentation complexity;

# lgb的参数
params = {
    'task': 'train',
    'boosting_type': 'gbdt',  # 设置提升类型
    'objective': 'regression', # 目标函数
    'metric': {'l2', 'auc'},  # 评估函数
    'num_leaves': 31,   # 叶子节点数
    'learning_rate': 0.05,  # 学习速率
    'feature_fraction': 0.9, # 建树的特征选择比例
    'bagging_fraction': 0.8, # 建树的样本采样比例
    'bagging_freq': 5,  # k 意味着每 k 次迭代执行bagging
    'verbose': 1 # <0 显示致命的, =0 显示错误 (警告), >0 显示信息
}

Guess you like

Origin blog.csdn.net/yezonggang/article/details/112675370