机器学习-LightGBM算法分类器-附python代码

LightGBM官方地址：
https://lightgbm.readthedocs.io/en/v3.3.2/

1.1 原理解释

LightGBM，轻量的梯度提升机(Light Gradient Boosting Machine)，由微软开源的一款SOTA Boosting算法框架。LightGBM与XGBoost 算法类似，其基本思想都是对所有特征都按照特征的数值进行排序，找到一个特征上的最好分割点，将数据分裂成左右子节点。

两种算法都有很多的优点，比如更快的训练效率、更高的准确率、支持并行化学习、大规模数据的处理等，但XGBOOST也有一些明显的缺点，如在选择树的分隔节点时，需要遍历所有的特征值，计算量大，内存占用量也大，还有易产生过拟合等。

而LightGBM基于此还有几个方面的优化：

通过Leaf-wise策略选择分裂收益最大的节点进行分裂，如此递归进行，但也易过拟合，需限制树的深度。
引入Hitogram算法，通过遍历直方图就行了，减少了分裂点数量，减少了计算量，也降低了内存的占用。
通过GOSS、EFB等算法，减少样本的数量和特征的数量等。
直接支持类别特征，在对离散特征分裂时，每个取值都当作一个桶，分裂时的增益算的是判断是否属于某个分类的gain。

1.2 Parameters

核心参数

(1) task, default=train, type=enum, options=train, predict, convert_model

任务

(2) boosting, default=gbdt, type=enum, options=gbdt, rf, dart, goss, alias=boost, boosting_type

设置提升类型

(3) objective, default=regression, type=enum, options=regression, regression_l1, huber, fair, poisson, quantile, quantile_l2, binary, multiclass, multiclassova, xentropy, xentlambda, lambdarank, alias=objective, app , application

目标函数，默认回归

(4) num_leaves, default=31, type=int, alias=num_leaf

一棵树上的叶子数，控制了叶节点的数目，控制树模型复杂度的主要参数。

(5) learning_rate, default=0.1, type=double, alias=shrinkage_rate，

学习率

(6) num_iterations, default=100, type=int, alias=num_iteration, num_tree, num_trees, num_round, num_rounds, num_boost_round

boosting 的迭代次数

用于控制模型学习过程的参数

(1) max_depth, default=-1, type=int

树的最大深度限制，防止过拟合

(2) min_data_in_leaf, default=20, type=int, alias=min_data_per_leaf , min_data, min_child_samples

一个叶子上数据的最小数量. 可以用来处理过拟合.

(3) min_sum_hessian_in_leaf, default=1e-3, type=double, alias=min_sum_hessian_per_leaf, min_sum_hessian, min_hessian, min_child_weight

一个叶子上的最小 hessian 和. 类似于 min_data_in_leaf, 可以用来处理过拟合.

(4) min_data_in_leaf, default=20, type=int

叶子节点最小样本数，防止过拟合

(5) feature_fraction, default=1.0, type=double, 0.0 < feature_fraction < 1.0

随机选择特征比例，加速训练及防止过拟合

(6) feature_fraction_seed, default=2, type=int

随机种子数，保证每次能够随机选择样本的一致性

(7) bagging_fraction, default=1.0, type=double

类似随机森林，它将在不进行重采样的情况下随机选择部分数据，可以用来加速训练，也可以用来处理过拟合

(8) bagging_freq, default=0, type=int, alias=subsample_freq

bagging 的频率, 0 意味着禁用 bagging. k 意味着每 k 次迭代执行bagging，为了启用 bagging, bagging_fraction 设置适当

(9) early_stopping_round, default=0, type=int, alias=early_stopping_rounds, early_stopping

如果一个验证集的度量在 early_stopping_round 循环中没有提升, 将停止训练

(10) lambda_l1, default=0, type=double

L1正则

(11) lambda_l2, default=0, type=double

L2正则

(12) min_split_gain, default=0, type=double

最小切分的信息增益值

(13) top_rate, default=0.2, type=double

大梯度树的保留比例

(14) other_rate, default=0.1, type=int

小梯度树的保留比例

(15) min_data_per_group, default=100, type=int

每个分类组的最小数据量

(16) max_cat_threshold, default=32, type=int

分类特征的最大阈值

IO 参数

(1) max_bin, default=255, type=int

工具箱的最大数特征值决定了容量工具箱的最小数特征值可能会降低训练的准确性, 但是可能会增加一些一般的影响（处理过度学习）
LightGBM 将根据 max_bin 自动压缩内存。例如, 如果 maxbin=255, 那么 LightGBM 将使用 uint8t 的特性值

(2) save_binary, default=false, type=bool, alias=is_save_binary, is_save_binary_file

如果设置为 true LightGBM 则将数据集(包括验证数据)保存到二进制文件中。可以加快数据加载速度。

(3) verbosity, default=1, type=int, alias=verbose

一般默认是1，显示信息。# <0 显示致命的, =0 显示错误 (警告), >0 显示信息

度量参数

(1) metric, default={l2 for regression}, {binary_logloss for binary classification}, {ndcg for lambdarank}, type=multi-enum, options=l1, l2, ndcg, auc, binary_logloss, binary_error …

评估函数，二分类问题多用l2, auc, binary_logloss等

-

由于LightGBM参数众多，本文只是列举一些常用重要的参数，其他更多详细参数见官方文档。

-

3.代码示例

class lightGbmModel():
    def __int__(self):
        pass
    def loadData(self):
        # load and describle dataset
        initial_file = pd.read_excel('D:/program/card_clients.xls', engine='xlrd', skiprows = [0], index_col=[0])
        print(initial_file.head(), initial_file.describe(), initial_file.info())
        # split train and test
        x = initial_file.drop('default payment next month', axis=1)
        y = initial_file['default payment next month']
        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.3, random_state = 123)
        return x_train, x_test, y_train, y_test

    def trainModel(self,x_train, x_test, y_train, y_test):
        # training model
        lgb_train = lgb.Dataset(x_train, y_train)
        lgb_eval = lgb.Dataset(x_test, y_test, reference=lgb_train)
        evals_result = {
    
    }
        params = {
    
    
            'task': 'train',
            'boosting_type': 'gbdt', 
            'objective': 'binary', 
            'metric': {
    
    'l2', 'auc', 'binary_logloss'}, 
            'num_leaves': 50, 
            'max_depth': 6,
            'max_bin': 200,
            'learning_rate': 0.01,
            'feature_fraction': 1,
            'bagging_fraction': 0.7, 
            'bagging_freq': 5,
            'verbose': 1
        }
        lgbm = lgb.train(params, lgb_train, num_boost_round=1000, valid_sets=lgb_eval, evals_result=evals_result, early_stopping_rounds=100)
        joblib.dump(lgbm, 'lgbm_model.pkl')
        return lgbm, evals_result

    def plog_importance(lgbm):
        # plot feature importance
        fig, ax = plt.subplots(figsize=(5, 10))
        lgb.plot_importance(lgbm, max_num_features=10, ax=ax, importance_type='split')
        plt.show()

    def proba_to_label(self, y_pred):
        for i in range(0, len(y_pred)):
            if y_pred[i] >= 0.5:
               y_pred[i] = 1
            else:
               y_pred[i] = 0
        return y_pred

    def predict_metrics_score(self, y_true, y_pred):
        accuracy = metrics.accuracy_score(y_true, y_pred)
        auc = metrics.roc_auc_score(y_true, y_pred)
        return accuracy, auc

    def predict_evaluate(self):
        x_train, x_test, y_train, y_test = self.loadData()
        lgbm, evals_result = self.trainModel(x_train, x_test, y_train, y_test)
        # predict test
        y_train_pred = lgbm.predict(x_train, num_iteration=lgbm.best_iteration)
        y_test_pred = lgbm.predict(x_test, num_iteration=lgbm.best_iteration)
        # evaluate model
        train_result = self.predict_metrics_score(y_train, self.proba_to_label(y_train_pred))
        test_result = self.predict_metrics_score(y_test, self.proba_to_label(y_test_pred))
        return train_result, test_result

结果展示
在这里插入图片描述

4.调参小技巧：

(1) 提高准确率的参数：

使用更大的训练数据，数据越多误差越小，但达到一定水平会收敛
使用较大的 num_leaves （可能导致过拟合）
使用较大的 max_bin （学习速度可能变慢）
使用较大的 num_iterations（计算可能变慢）
使用较小的 learning_rate （计算可能变慢）

(2) 提高训练速度的参数：

使用较小的 max_bin
使用较小的 feature_fraction
使用较小的bagging_fraction
save_binary设置为 true

(3) 缓解过拟合的参数：

使用较小的 max_depth
使用较小的 max_bin
使用较小的 num_leaves
使用较小的 feature_fraction
使用较小的bagging_fraction
使用 min_data_in_leaf 和 min_sum_hessian_in_leaf
使用 lambda_l1, lambda_l2 和 min_gain_to_split 来使用正则