XGBoost与Lightgbm

本文主要参考自以下网站
https://cloud.tencent.com/developer/article/1389899
https://cloud.tencent.com/developer/article/1052678
https://cloud.tencent.com/developer/article/1052664

XGBoost
1、重要参数详解
booster[default=gbtree]： gbtree, gblinear
nthread: 线程数
eta[default=0.3]: 收缩步长，防止过拟合
max_depth[default=6]: 树的最大深度
min_child_weight: 孩子节点中最小的样本权重和
subsample[default=1]: 用于训练模型的子样本占整个样本集合的比例
lambda[default=0]:　L2正则的惩罚系数
alpha [default=0] ： L1 正则的惩罚系数
objective [ default=reg:linear ] ：定义学习任务及相应的学习目标
可选的目标函数如下：
“reg:linear” —— 线性回归。
“reg:logistic”—— 逻辑回归。
“binary:logistic”—— 二分类的逻辑回归问题，输出为概率。
“binary:logitraw”—— 二分类的逻辑回归问题，输出的结果为wTx。
“count:poisson”—— 计数问题的poisson回归，输出结果为poisson分布。在poisson回归中，max_delta_step的缺省值为0.7。
“multi:softmax” –让XGBoost采用softmax目标函数处理多分类问题，同时需要设置参数num_class（类别个数）
“multi:softprob” –和softmax一样，但是输出的是ndata * nclass的向量，可以将该向量reshape成ndata行nclass列的矩阵。没行数据表示样本所属于每个类别的概率。
“rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwise loss
eval_metric [ default according to objective ]：校验数据所需要的评价标准
“rmse”: root mean square error
“logloss”: negative log-likelihood
“error”: Binary classification error rate
“merror”: Multiclass classification error rate.
“mlogloss”: Multiclass logloss.
“auc”: Area under the curve for ranking evaluation.
“ndcg”:Normalized Discounted Cumulative Gain
“map”:Mean average precision

2、具体操作
a、加载数据
libsvm 格式的文本数据；
Numpy 的二维数组；
XGBoost 的二进制的缓存文件。加载的数据存储在对象 DMatrix 中。
train = xgb.DMatrix(‘train.txt’)
train = xgb.Dmatrix(data, label=label)
b、参数设计
采用key-value字典的方式存储参数。
params={‘booster’ : ‘gbtree’, ‘objective’ : ‘multi:softmax’, ‘num_class’: 10}
c、训练模型
bst = xgb.train(params, train, num_round,)
d、模型预测
test = bst.predict(test)
e、保存模型
bst.save_model(‘test.model’)
d、加载模型
bst = xgb.Booster({‘nthread’:4})
bst.load_model(“model.bin”)

3、实战
a、基于XGBoost原生接口的分类

from sklearn.datasets import load_iris import xgboost as xgb from xgboost import plot_importance from matplotlib import pyplot as plt from sklearn.model_selection import train_test_split

iris = load_iris()

X = iris.data y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234565)

params = {
    'booster': 'gbtree',
    'objective': 'multi:softmax',
    'num_class': 3,
    'gamma': 0.1,
    'max_depth': 6,
    'lambda': 2,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'min_child_weight': 3,
    'silent': 1,
    'eta': 0.1,
    'seed': 1000,
    'nthread': 4, }

plst = params.items()


dtrain = xgb.DMatrix(X_train, y_train) num_rounds = 500 model = xgb.train(plst, dtrain, num_rounds)

dtest = xgb.DMatrix(X_test) ans = model.predict(dtest)

cnt1 = 0 cnt2 = 0 for i in range(len(y_test)):
    if ans[i] == y_test[i]:
        cnt1 += 1
    else:
        cnt2 += 1

print("Accuracy: %.2f %% " % (100 * cnt1 / (cnt1 + cnt2)))

plot_importance(model) plt.show()

b、基于XGBoost原生接口的回归

import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

data = []
labels = []
labels2 = []
with open("lppz5.csv", encoding='UTF-8') as fileObject:
    for line in fileObject:
        line_split = line.split(',')
        data.append(line_split[10:])
        labels.append(line_split[8])

X = []
for row in data:
    row = [float(x) for x in row]
    X.append(row)

y = [float(x) for x in labels]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

params = {
    'booster': 'gbtree',
    'objective': 'reg:gamma',
    'gamma': 0.1,
    'max_depth': 5,
    'lambda': 3,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'min_child_weight': 3,
    'silent': 1,
    'eta': 0.1,
    'seed': 1000,
    'nthread': 4,
}

dtrain = xgb.DMatrix(X_train, y_train)
num_rounds = 300
plst = params.items()
model = xgb.train(plst, dtrain, num_rounds)

dtest = xgb.DMatrix(X_test)
ans = model.predict(dtest)

plot_importance(model)
plt.show()

c、基于Scikit-learn接口的分类

from sklearn.datasets import load_iris
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

iris = load_iris()

X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

model = xgb.XGBClassifier(max_depth=5, learning_rate=0.1, n_estimators=160, silent=True, objective='multi:softmax')
model.fit(X_train, y_train)

ans = model.predict(X_test)

cnt1 = 0
cnt2 = 0
for i in range(len(y_test)):
    if ans[i] == y_test[i]:
        cnt1 += 1
    else:
        cnt2 += 1

print("Accuracy: %.2f %% " % (100 * cnt1 / (cnt1 + cnt2)))

plot_importance(model)
plt.show()

d、基于Scikit-learn接口的回归
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

data = []
labels = []
labels2 = []
with open("lppz5.csv", encoding='UTF-8') as fileObject:
    for line in fileObject:
        line_split = line.split(',')
        data.append(line_split[10:])
        labels.append(line_split[8])

X = []
for row in data:
    row = [float(x) for x in row]
    X.append(row)

y = [float(x) for x in labels]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

model = xgb.XGBRegressor(max_depth=5, learning_rate=0.1, n_estimators=160, silent=True, objective='reg:gamma')
model.fit(X_train, y_train)

ans = model.predict(X_test)

plot_importance(model)
plt.show()

Lightgbm
1、重要参数
task：默认值=train，可选项=train，prediction；指定我们希望执行的任务，该任务有两种类型：训练和预测；
application：默认值=regression，type=enum，options=options；
regression：执行回归任务；
binary：二分类；
multiclass：多分类；
lambdarank：lambrank应用;
data： type=string;training data，LightGBM将从这些数据中进行训练；
num_iterations：默认值为100，类型为int。表示提升迭代次数，也就是提升树的棵树;
num_leaves：每个树上的叶子数，默认值为31，类型为int;
device：默认值=cpu；可选项：cpu，gpu。也就是我们使用什么类型的设备去训练我们的模型。选择GPU会使得训练过程更快;
min_data_in_leaf：每个叶子上的最少数据；
feature_fraction：默认值为1；指定每次迭代所需要的特征部分；
bagging_fraction：默认值为1；指定每次迭代所需要的数据部分，并且它通常是被用来提升训练速度和避免过拟合的。
min_gain_to_split：默认值为1；执行分裂的最小的信息增益；
max_bin：最大的桶的数量，用来装数值的；
min_data_in_bin：每个桶内最少的数据量；
num_threads：默认值为OpenMP_default，类型为int。指定LightGBM算法运行时线程的数量；
label：类型为string；指定标签列；
categorical_feature：类型为string；指定我们想要进行模型训练所使用的特征类别；
num_class：默认值为1，类型为int；仅仅需要在多分类的场合。

2、实战

for fold_, (train_idx, validate_idx) in enumerate(folds.split(train.values, target.values)):

train_data = lgb.Dataset(train.iloc[train_idx][features], label=target.iloc[train_idx])
validate_data = lgb.Dataset(train.iloc[validate_idx][features], label=target.iloc[validate_idx])
param = {
        'task': 'train',
        'boosting': 'goss',
        'objective': 'regression',
        'metric': 'rmse',
        'learning_rate': 0.01,
        'subsample': 0.9855232997390695,
        'max_depth': 7,
        'top_rate': 0.9064148448434349,
        'num_leaves': 63,
        'min_child_weight': 41.9612869171337,
        'other_rate': 0.0721768246018207,
        'reg_alpha': 9.677537745007898,
        'colsample_bytree': 0.5665320670155495,
        'min_split_gain': 9.820197773625843,
        'reg_lambda': 8.2532317400459,
        'min_data_in_leaf': 21,
        'verbose': -1,
        'seed': int(2 ** fold_),
        'bagging_seed': int(2 ** fold_),
        'drop_seed': int(2 ** fold_)
        }
num_round = 1000
clf = lgb.train(param, train_data, num_round, valid_sets=[train_data, validate_data], verbose_eval=-1, early_stopping_rounds=200)
  
submission['target'] = clf.predict(test, num_iteration=100)

3、参数调优
为了最好的拟合
num_leaves：这个参数是用来设置组成每棵树的叶子的数量。
min_data_in_leaf : 它也是一个用来解决过拟合的非常重要的参数。把它的值设置的特别小可能会导致过拟合，因此，我们需要对其进行相应的设置。因此，对于大数据集来说，我们应该把它的值设置为几百到几千。
max_depth: 它指定了每棵树的最大深度或者它能够生长的层数上限。
为了更快的速度
bagging_fraction : 它被用来执行更快的结果装袋；
feature_fraction : 设置每一次迭代所使用的特征子集；
max_bin : max_bin的值越小越能够节省更多的时间：当它将特征值分桶装进不同的桶中的时候，这在计算上是很便宜的。
为了更高的准确率
num_leaves : 把它设置得过大会使得树的深度更高、准确率也随之提升，但是这会导致过拟合。因此它的值被设置地过高不好。
max_bin : 该值设置地越高导致的效果和num_leaves的增长效果是相似的，并且会导致我们的训练过程变得缓慢。

猜你喜欢