python-XGBoost应用(分类)

以下内容笔记出自‘跟着迪哥学python数据分析与机器学习实战’,外加个人整理添加,仅供个人复习使用。


XGBoost为Boosting集成算法,这里为XGBoostClassifier举例。

import xgboost
from numpy import loadtxt
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

导入数据

dataset=pd.read_csv(r'PimaIndiansdiabetes.csv',
               )
dataset.head(2)

在这里插入图片描述

dataset.isnull().sum()

在这里插入图片描述

建立基础分类模型

X=dataset.iloc[:,0:8]
Y=dataset.iloc[:,8]

X_train,X_test,y_train,y_test=train_test_split(X,Y,
                                              test_size=0.33,
                                              random_state=7)
#建模
model=XGBClassifier()
model.fit(X_train,y_train)
#预测
y_pred=model.predict(X_test)
predictions=[round(value) for value in y_pred]

#准确率
accuracy=accuracy_score(y_test,predictions)
print('accu:%.2f%%' % (accuracy *100))

accu:74.02%

基础模型展示建模过程(加入过程)

model=XGBClassifier()
eval_set=[(X_test,y_test)]
model.fit(X_train,y_train,
         early_stopping_rounds=10,
         eval_metric='logloss',
         eval_set=eval_set,
         verbose=True)

y_pred=model.predict(X_test)
predictions=[round(value) for value in y_pred]

accuracy=accuracy_score(y_test,predictions)
print('accu: %.2f%%' % (accuracy*100))

在这里插入图片描述
特征重要性:

from xgboost import plot_importance
import  matplotlib.pyplot as plt

plot_importance(model)
plt.show()

在这里插入图片描述

调参示例

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

#参数集并转化为字典格式
#学习率调参
learning_rate=[0.0001,0.001,0.01,0.1,0.2,0.3]
param_grid=dict(learning_rate=learning_rate)

#交叉验证
kfold=StratifiedKFold(n_splits=10,shuffle=True,
                     random_state=7)

model=XGBClassifier()
grid_search=GridSearchCV(model,param_grid,
                        scoring='neg_log_loss',n_jobs=-1,
                        cv=kfold)
grid_result=grid_search.fit(X,Y)

print('best:%f using %s' % (grid_result.best_score_,
                           grid_result.best_params_))
means=grid_result.cv_results_['mean_test_score']
params=grid_result.cv_results_['params']
for mean,param in zip(means,params):
    print('mae %f with %r' % (mean,param))

在这里插入图片描述
参数说明:

- learnig rate :一般比较小
- tree
- - max_depth
- - min_child_weight
- - subsample: 选择样本时是不是随机选80%,就像随机森林一样,不随机就是1.0
- - colsample_bytree :选择特征时是否随机进行选择
- - gamma :叶子节点T前面的参数,影响模型复杂度
- 正则化参数
- - lambda
- - alpha
- objective : loss function,用什么损失函数需要指定(很多)

例子:

xgb1=XGBClassifier(
    learning_rate=0.1,
    n_estimators=1000,
    max_depth=5,
    min_child_weight=1,
    gamma=0,
    subsample=0.8,
    xolsample_bytree=0.8,
    objective='binary:logistic', 
    nthread=4,
    scale_pos_weight=1,
    seed=27)

猜你喜欢

转载自blog.csdn.net/qq_43165880/article/details/108575702