Machine Learning Series: LightGBM Visual Parameter Tuning

Hello everyone, in 100 Days to Get Machine Learning | Day63 Completely Master LightGBM In the article, I introduced the model principle of LightGBM and a minimalist example. Recently, I found that Huggingface and Streamlit seem to be more compatible, so I developed a simple LightGBM visual parameter adjustment tool, which aims to allow everyone to do it 更深入地理解 LightGBM.

URL:

huggingface.co/spaces/beih…

I just put a few parameters at random, and adjusting these parameters can see the changes in the model evaluation indicators in real time. I also put the code in the article, if you have good optimization ideas, you can leave a message. The implementation process is described in detail below:

Parameters of LightGBM

After the model construction is completed, the effect of the model must be evaluated, and the parameters, features or algorithms of the model must be adjusted according to the evaluation results to achieve satisfactory results.

LightGBM, there are core parameters, learning control parameters, IO parameters, target parameters, measurement parameters, network parameters, GPU parameters, model parameters, here I often modify the core parameters, learning control parameters, measurement parameters, etc.

Control Parameters meaning usage
max_depth maximum depth of tree When the model is overfitting, you can consider reducing max_depth first
min_data_in_leaf Minimum number of records a leaf may have Default 20, used when overfitting
feature_fraction For example, when it is 0.8, it means that 80% of the parameters are randomly selected to build the tree in each iteration Used when boosting is random forest
bagging_fraction The scale of data used at each iteration Used to speed up training and reduce overfitting
early_stopping_round If one metric on the validation data has not improved in the most recent early_stopping_round epochs, the model will stop training Accelerate analysis and reduce excessive iterations
lambda Specify regularization 0~1
min_gain_to_split The smallest gain that describes the split Useful splitting of control trees
max_cat_group Find split points on group boundaries When the number of categories is large, it is easy to overfit when finding segmentation points

CoreParameters 含义 用法
Task 数据的用途 选择 train 或者 predict
application 模型的用途 选择 regression: 回归时,binary: 二分类时,multiclass: 多分类时
boosting 要用的算法 gbdt, rf: random forest, dart: Dropouts meet Multiple Additive Regression Trees, goss: Gradient-based One-Side Sampling
num_boost_round 迭代次数 通常 100+
learning_rate 如果一次验证数据的一个度量在最近的 early_stopping_round 回合中没有提高,模型将停止训练 常用 0.1, 0.001, 0.003…
num_leaves 默认 31
device cpu 或者 gpu
metric mae: mean absolute error , mse: mean squared error , binary_logloss: loss for binary classification , multi_logloss: loss for multi classification

Faster Speed better accuracy over-fitting
将 max_bin 设置小一些 用较大的 max_bin max_bin 小一些
num_leaves 大一些 num_leaves 小一些
用 feature_fraction 来做 sub-sampling 用 feature_fraction
用 bagging_fraction 和 bagging_freq 设定 bagging_fraction 和 bagging_freq
training data 多一些 training data 多一些
用 save_binary 来加速数据加载 直接用 categorical feature 用 gmin_data_in_leaf 和 min_sum_hessian_in_leaf
用 parallel learning 用 dart 用 lambda_l1, lambda_l2 ,min_gain_to_split 做正则化
num_iterations 大一些,learning_rate 小一些 用 max_depth 控制树的深度

模型评估指标

以分类模型为例,常见的模型评估指标有一下几种:

混淆矩阵
混淆矩阵是能够比较全面的反映模型的性能,从混淆矩阵能够衍生出很多的指标来。

ROC曲线
ROC曲线,全称The Receiver Operating Characteristic Curve,译为受试者操作特性曲线。这是一条以不同阈值 下的假正率FPR为横坐标,不同阈值下的召回率Recall为纵坐标的曲线。让我们衡量模型在尽量捕捉少数类的时候,误伤多数类的情况如何变化的。

AUC
AUC(Area Under the ROC Curve)指标是在二分类问题中,模型评估阶段常被用作最重要的评估指标来衡量模型的稳定性。ROC曲线下的面积称为AUC面积,AUC面积越大说明ROC曲线越靠近左上角,模型越优;

Streamlit 实现

Streamlit我就不再多做介绍了,老读者应该都特别熟悉了。就再列一下之前开发的几个小东西:

核心代码如下,完整代码我放到Github,欢迎大家给个Star

github.com/tjxj/visual…

from definitions import *

st.set_option('deprecation.showPyplotGlobalUse', False)
st.sidebar.subheader("请选择模型参数:sunglasses:")

# 加载数据
breast_cancer = load_breast_cancer()
data = breast_cancer.data
target = breast_cancer.target

# 划分训练数据和测试数据
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2)

# 转换为Dataset数据格式
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# 模型训练
params = {'num_leaves': num_leaves, 'max_depth': max_depth,
            'min_data_in_leaf': min_data_in_leaf, 
            'feature_fraction': feature_fraction,
            'min_data_per_group': min_data_per_group, 
            'max_cat_threshold': max_cat_threshold,
            'learning_rate':learning_rate,'num_leaves':num_leaves,
            'max_bin':max_bin,'num_iterations':num_iterations
            }

gbm = lgb.train(params, lgb_train, num_boost_round=2000, valid_sets=lgb_eval, early_stopping_rounds=500)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)  
probs = gbm.predict(X_test, num_iteration=gbm.best_iteration)  # 输出的是概率结果  

fpr, tpr, thresholds = roc_curve(y_test, probs)
st.write('------------------------------------')
st.write('Confusion Matrix:')
st.write(confusion_matrix(y_test, np.where(probs > 0.5, 1, 0)))

st.write('------------------------------------')
st.write('Classification Report:')
report = classification_report(y_test, np.where(probs > 0.5, 1, 0), output_dict=True)
report_matrix = pd.DataFrame(report).transpose()
st.dataframe(report_matrix)

st.write('------------------------------------')
st.write('ROC:')

plot_roc(fpr, tpr)
复制代码

上传Huggingface

Huggingface 前一篇文章(腾讯的这个算法,我搬到了网上,随便玩!)我已经介绍过了,这里就顺便再讲一下步骤吧。

step1: Register a Huggingface account

step2: Create Space, remember to select Streamlit for SDK

step3: Clone the newly created space code, and then push the modified code to it

git lfs install 
git add .
git commit -m "commit from $beihai"
git push
复制代码

When pushing, you will enter the user name (that is, your registered email address) and password, to solve the problem that git always enters the user name and password:git config --global credential.helper store

The push is done, and you're done. Go back to the corresponding item on your space page, and you can see the effect.

https://huggingface.co/spaces/beihai/LightGBM-parameter-tuning

Guess you like

Origin juejin.im/post/7083060518184812551