1. Overview of the LightGBM framework
GBDT (Gradient Boosting Decision Tree) is an enduring model in machine learning. Its main idea is to use weak classifiers (decision trees) to iteratively train to obtain the optimal model. The model has good training effect and is not easy to overfit. Etc. GBDT is not only widely used in the industry, but is usually used for tasks such as multi-classification, click-through rate prediction, search ranking, etc. It is also a deadly weapon in various data mining competitions. According to statistics, more than half of the championship schemes in Kaggle competitions are based on GBDT.
LightGBM (Light Gradient Boosting Machine) is a framework that implements the GBDT algorithm, which uses a tree-based learning algorithm. Has the following advantages:
1. Faster training speed and higher efficiency.
2. Reduce memory usage.
3. Better accuracy.
4. Support parallel, distributed and GPU learning.
5. Ability to process large-scale data.
Comparative experiments on public datasets show that LightGBM outperforms existing boosting frameworks in both efficiency and accuracy, and significantly reduces memory consumption. What's more, distributed learning experiments show that LightGBM can achieve linear speedup by using multiple machines to train under specific settings.
Use document address
pip install lightgbm
Related papers
2. Simple example
The example is based on the Kaggle table playground 2022Feb competition. Refer to the game situation.
import lightgbm as lgb
import pandas as pd
import pickle
print("LGB test")
clf = lgb.LGBMClassifier(
boosting_type='gbdt', num_leaves=55, reg_alpha=0.0, reg_lambda=1,
max_depth=15, n_estimators=6000, objective='binary',
subsample=0.8, colsample_bytree=0.8, subsample_freq=1,
learning_rate=0.06, min_child_weight=1, random_state=20, n_jobs=-1
)
X = pd.read_csv('data/train_data.csv')
label = pd.read_csv('data/train_label.csv')
y = label.target
clf.fit(X, y, callbacks=[lgb.log_evaluation(period=1, show_stdv=True)])
#pre=clf.predict(testdata)
# 保存模型
s=pickle.dumps(clf)
f=open('lightgbm_v2.model', "wb+")
f.write(s)
f.close()
test code
print("这是lightgbm")
f2 = open('lightgbm_v2.model', 'rb')
s2 = f2.read()
model1 = pickle.loads(s2)
test_X = pd.read_csv('data/test.csv')
predictions = model1.predict(test_X)
preds = []
for pred in predictions:
preds.append(week_day_dict[pred])
res = pd.DataFrame()
res['target'] = preds
res.to_csv("predict_lightgbm_v2.csv")
The data has not been processed, and submitted to kaggle after training, with a score of 0.95169. The score can only be said to be unsatisfactory and needs to be adjusted.