Hello everyone, in 100 Days to Get Machine Learning | Day63 Completely Master LightGBM In the article, I introduced the model principle of LightGBM and a minimalist example. Recently, I found that Huggingface and Streamlit seem to be more compatible, so I developed a simple LightGBM visual parameter adjustment tool, which aims to allow everyone to do it 更深入地理解 LightGBM
.
URL:
I just put a few parameters at random, and adjusting these parameters can see the changes in the model evaluation indicators in real time. I also put the code in the article, if you have good optimization ideas, you can leave a message. The implementation process is described in detail below:
Parameters of LightGBM
After the model construction is completed, the effect of the model must be evaluated, and the parameters, features or algorithms of the model must be adjusted according to the evaluation results to achieve satisfactory results.
LightGBM, there are core parameters, learning control parameters, IO parameters, target parameters, measurement parameters, network parameters, GPU parameters, model parameters, here I often modify the core parameters, learning control parameters, measurement parameters, etc.
Control Parameters | meaning | usage |
---|---|---|
max_depth | maximum depth of tree | When the model is overfitting, you can consider reducing max_depth first |
min_data_in_leaf | Minimum number of records a leaf may have | Default 20, used when overfitting |
feature_fraction | For example, when it is 0.8, it means that 80% of the parameters are randomly selected to build the tree in each iteration | Used when boosting is random forest |
bagging_fraction | The scale of data used at each iteration | Used to speed up training and reduce overfitting |
early_stopping_round | If one metric on the validation data has not improved in the most recent early_stopping_round epochs, the model will stop training | Accelerate analysis and reduce excessive iterations |
lambda | Specify regularization | 0~1 |
min_gain_to_split | The smallest gain that describes the split | Useful splitting of control trees |
max_cat_group | Find split points on group boundaries | When the number of categories is large, it is easy to overfit when finding segmentation points |
CoreParameters | 含义 | 用法 |
---|---|---|
Task | 数据的用途 | 选择 train 或者 predict |
application | 模型的用途 | 选择 regression: 回归时,binary: 二分类时,multiclass: 多分类时 |
boosting | 要用的算法 | gbdt, rf: random forest, dart: Dropouts meet Multiple Additive Regression Trees, goss: Gradient-based One-Side Sampling |
num_boost_round | 迭代次数 | 通常 100+ |
learning_rate | 如果一次验证数据的一个度量在最近的 early_stopping_round 回合中没有提高,模型将停止训练 | 常用 0.1, 0.001, 0.003… |
num_leaves | 默认 31 | |
device | cpu 或者 gpu | |
metric | mae: mean absolute error , mse: mean squared error , binary_logloss: loss for binary classification , multi_logloss: loss for multi classification |
Faster Speed | better accuracy | over-fitting |
---|---|---|
将 max_bin 设置小一些 | 用较大的 max_bin | max_bin 小一些 |
num_leaves 大一些 | num_leaves 小一些 | |
用 feature_fraction 来做 sub-sampling | 用 feature_fraction | |
用 bagging_fraction 和 bagging_freq | 设定 bagging_fraction 和 bagging_freq | |
training data 多一些 | training data 多一些 | |
用 save_binary 来加速数据加载 | 直接用 categorical feature | 用 gmin_data_in_leaf 和 min_sum_hessian_in_leaf |
用 parallel learning | 用 dart | 用 lambda_l1, lambda_l2 ,min_gain_to_split 做正则化 |
num_iterations 大一些,learning_rate 小一些 | 用 max_depth 控制树的深度 |
模型评估指标
以分类模型为例,常见的模型评估指标有一下几种:
混淆矩阵
混淆矩阵是能够比较全面的反映模型的性能,从混淆矩阵能够衍生出很多的指标来。
ROC曲线
ROC曲线,全称The Receiver Operating Characteristic Curve,译为受试者操作特性曲线。这是一条以不同阈值 下的假正率FPR为横坐标,不同阈值下的召回率Recall为纵坐标的曲线。让我们衡量模型在尽量捕捉少数类的时候,误伤多数类的情况如何变化的。
AUC
AUC(Area Under the ROC Curve)指标是在二分类问题中,模型评估阶段常被用作最重要的评估指标来衡量模型的稳定性。ROC曲线下的面积称为AUC面积,AUC面积越大说明ROC曲线越靠近左上角,模型越优;
Streamlit 实现
Streamlit我就不再多做介绍了,老读者应该都特别熟悉了。就再列一下之前开发的几个小东西:
核心代码如下,完整代码我放到Github,欢迎大家给个Star
from definitions import *
st.set_option('deprecation.showPyplotGlobalUse', False)
st.sidebar.subheader("请选择模型参数:sunglasses:")
# 加载数据
breast_cancer = load_breast_cancer()
data = breast_cancer.data
target = breast_cancer.target
# 划分训练数据和测试数据
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2)
# 转换为Dataset数据格式
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
# 模型训练
params = {'num_leaves': num_leaves, 'max_depth': max_depth,
'min_data_in_leaf': min_data_in_leaf,
'feature_fraction': feature_fraction,
'min_data_per_group': min_data_per_group,
'max_cat_threshold': max_cat_threshold,
'learning_rate':learning_rate,'num_leaves':num_leaves,
'max_bin':max_bin,'num_iterations':num_iterations
}
gbm = lgb.train(params, lgb_train, num_boost_round=2000, valid_sets=lgb_eval, early_stopping_rounds=500)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
probs = gbm.predict(X_test, num_iteration=gbm.best_iteration) # 输出的是概率结果
fpr, tpr, thresholds = roc_curve(y_test, probs)
st.write('------------------------------------')
st.write('Confusion Matrix:')
st.write(confusion_matrix(y_test, np.where(probs > 0.5, 1, 0)))
st.write('------------------------------------')
st.write('Classification Report:')
report = classification_report(y_test, np.where(probs > 0.5, 1, 0), output_dict=True)
report_matrix = pd.DataFrame(report).transpose()
st.dataframe(report_matrix)
st.write('------------------------------------')
st.write('ROC:')
plot_roc(fpr, tpr)
复制代码
上传Huggingface
Huggingface 前一篇文章(腾讯的这个算法,我搬到了网上,随便玩!)我已经介绍过了,这里就顺便再讲一下步骤吧。
step1: Register a Huggingface account
step2: Create Space, remember to select Streamlit for SDK
step3: Clone the newly created space code, and then push the modified code to it
git lfs install
git add .
git commit -m "commit from $beihai"
git push
复制代码
When pushing, you will enter the user name (that is, your registered email address) and password, to solve the problem that git always enters the user name and password:git config --global credential.helper store
The push is done, and you're done. Go back to the corresponding item on your space page, and you can see the effect.