Summary

The XGBoost algorithm (eXtreme Gradient Boosting) is very popular in current competitions such as Kaggle, mathematical modeling, and big data applications. This article will explain in detail the XGBOOST algorithm principle, Python implementation, sensitivity analysis and practical application.

Table of contents

1. Material preparation

2. Algorithm principle

3. Algorithm implementation in Python

3.1 Data loading

3.2 Classify and code the categorical data of the target variable

3.3 Divide the data into training data and test data

3.4 Training XGBOOST model

3.5 Test model

3.6 Output model's prediction confusion matrix (result matrix)

3.7 Output model accuracy

3.8 Draw a confusion matrix

3.9 Complete code implementation

3.10 Result output example

4. Sensitivity analysis and practical application of XGBOOST algorithm

4.1 Sensitivity analysis

4.2 Algorithm application

V. Conclusion

6. Remarks

0 Introduction

In competitions such as data mining and mathematical modeling, in addition to the implementation of algorithms, it is also necessary to perform reasonable preprocessing of data, including missing value processing, outlier processing, categorical data feature encoding, and redundant feature deletion. By default, the reader's data has been pre-processed. If necessary, the method of data pre-processing will be published later.

1. Material preparation

Python compiler: Pycharm community edition or personal edition, etc.

Training data set: Here we use the attached data of Question C of the 2022 Shuwei Cup International College Student Mathematical Contest in Modeling as an example.

Data processing: After preliminary data cleaning and correlation analysis, the preliminary features are obtained, and the decision tree is used to analyze the importance of features, and the secondary feature dimensionality reduction is completed, and three automatic features of 'CDRSB_bl', 'PIB_bl', 'FBB_bl' are obtained. Variable features, DX_bl is a classification feature.

2. Algorithm principle

The XGBOOST algorithm is based on the ensemble method of decision trees, which mainly adopts the idea of Boosting, which is an extension of the Gradient Boosting algorithm, and uses gradient boosting technology to improve the accuracy and generalization ability of the model.

First, the base classifiers are superimposed layer by layer, and then each layer is given a higher weight to the samples misclassified by the previous base classifier during training. The objective function of XGBOOST is:

(1)

Among them, is the loss function; is the regular term, which is used to control the complexity of the tree; is the constant term, which is the predicted value of the new tree , which is the sum of the results of the number of trees.

3. Algorithm implementation in Python

3.1 Data loading

Import the data required for this article here, DataX is the independent variable data, and DataY is the target variable data (DX_bl).

import pandas as pd
X = pd.DataFrame(pd.read_excel('DataX.xlsx')).values  # 输入特征
y = pd.DataFrame(pd.read_excel('DataY.xlsx')).values  # 目标变量

3.2 Classify and code the categorical data of the target variable

Only 0-4 is used here to replace the five types of data, because it is only for prediction and does not involve other operations such as correlation analysis, so ordinary classification coding is fine. If it needs to be used for correlation analysis or other computational operations, it is recommended to use OneHot-Encoding.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
label_mapping = {0: 'AD', 1: 'CN', 2: 'EMCI', 3: 'LMCI', 4: 'SMC'}
#此处为了后续输出混淆矩阵时，用原始数据输出

3.3 Divide the data into training data and test data

In this paper, the original sample data is randomly shuffled, and 70% of the sample data is used as training data, and 30% of the sample data is used as test data. This is a more common split method, and readers can test the best accuracy and F1-score through different splits.

from sklearn.model_selection import train_test_split
# 将数据分为训练数据和测试数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, train_size=0.7, random_state=42)

3.4 Training XGBOOST model

Based on 70% of the sample data for training and modeling, python has a library of the XGBOOST algorithm, so it can be called very conveniently.

import xgboost as xgb
# 训练XGBoost分类器
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
#xgb.plot_tree(model)

3.5 Test model

Use the other 30% sample data to test model accuracy, precision, recall and F1 metrics.

# 使用测试数据预测类别
y_pred = model.predict(X_test)

3.6 Output model's prediction confusion matrix (result matrix)

The method of outputting the confusion matrix here is a bit different from the previous random forest and KNN algorithms, because the random Senla algorithm can directly call the random forest algorithm model without classifying and encoding the classified data.

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
cm = confusion_matrix(y_test, y_pred)
# 输出混淆矩阵
for i, true_label in enumerate(label_mapping.values()):
    row = ''
    for j, pred_label in enumerate(label_mapping.values()):
        row += f'{cm[i, j]} ({pred_label})\t'
    print(f'{row} | {true_label}')

# 输出混淆矩阵
print(classification_report(y_test, y_pred,target_names=['AD', 'CN', 'EMCI', 'LMCI', 'SMC']))  # 输出混淆矩阵

3.7 Output model accuracy

#此处的导库在上一个代码段中已引入
print("Accuracy:")
print(accuracy_score(y_test, y_pred))

3.8 Draw a confusion matrix

Draw and output the result map of the confusion matrix, which can be placed in the paper to improve the beauty and credibility of the paper.

import matplotlib.pyplot as plt
import numpy as np
label_names = ['AD', 'CN', 'EMCI', 'LMCI', 'SMC']
cm = confusion_matrix(y_test, y_pred)

# 绘制混淆矩阵图
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
ax.figure.colorbar(im, ax=ax)
ax.set(xticks=np.arange(cm.shape[1]),
       yticks=np.arange(cm.shape[0]),
       xticklabels=label_names, yticklabels=label_names,
       title='Confusion matrix',
       ylabel='True label',
       xlabel='Predicted label')

# 在矩阵图中显示数字标签
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(j, i, format(cm[i, j], 'd'),
                ha="center", va="center",
                color="white" if cm[i, j] > thresh else "black")

fig.tight_layout()
#plt.show()
plt.savefig('XGBoost_Conclusion.png', dpi=300)

The above code first calculates the confusion matrix, then uses the imshow function in the matplotlib library to visualize the confusion matrix, and finally adds numbers to the confusion matrix through the text function, and uses the show/savefig function to display the image. The resulting output is shown in Figure 3.1.

Figure 3.1 Confusion matrix result map

3.9 Complete code implementation

# 导入需要的库
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import numpy as np

le = LabelEncoder()
label_mapping = {0: 'AD', 1: 'CN', 2: 'EMCI', 3: 'LMCI', 4: 'SMC'}
X = pd.DataFrame(pd.read_excel('DataX.xlsx')).values  # 输入特征
y = pd.DataFrame(pd.read_excel('DataY.xlsx')).values  # 目标变量
y = le.fit_transform(y)
# 将数据分为训练数据和测试数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, train_size=0.7, random_state=42)
# 训练XGBoost分类器
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
#xgb.plot_tree(model)
# 使用测试数据预测类别
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
# 输出混淆矩阵
for i, true_label in enumerate(label_mapping.values()):
    row = ''
    for j, pred_label in enumerate(label_mapping.values()):
        row += f'{cm[i, j]} ({pred_label})\t'
    print(f'{row} | {true_label}')

# 输出混淆矩阵
print(classification_report(y_test, y_pred,target_names=['AD', 'CN', 'EMCI', 'LMCI', 'SMC']))  # 输出混淆矩阵
print("Accuracy:")
print(accuracy_score(y_test, y_pred))


# label_names 是分类变量的取值名称列表
label_names = ['AD', 'CN', 'EMCI', 'LMCI', 'SMC']
cm = confusion_matrix(y_test, y_pred)

# 绘制混淆矩阵图
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
ax.figure.colorbar(im, ax=ax)
ax.set(xticks=np.arange(cm.shape[1]),
       yticks=np.arange(cm.shape[0]),
       xticklabels=label_names, yticklabels=label_names,
       title='Confusion matrix',
       ylabel='True label',
       xlabel='Predicted label')

# 在矩阵图中显示数字标签
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(j, i, format(cm[i, j], 'd'),
                ha="center", va="center",
                color="white" if cm[i, j] > thresh else "black")

fig.tight_layout()
#plt.show()
plt.savefig('XGBoost_Conclusion.png', dpi=300)
# 上面的代码首先计算混淆矩阵，然后使用 matplotlib 库中的 imshow 函数将混淆矩阵可视化，最后通过 text 函数在混淆矩阵上添加数字，并使用 show/savefig 函数显示图像。

3.10 Result output example

Figure 3.2 Example of result output

4. Sensitivity analysis and practical application of XGBOOST algorithm

4.1 Sensitivity analysis

Sensitivity analysis, also called stability analysis, can be based on statistical ideas, through hundreds of tests, record the data of accuracy, precision, recall and F1-Score, and count the median, average, maximum and minimum And other data, so as to carry out the corresponding sensitivity analysis. The results show that the original model is established, and the sensitivity analysis is passed. The same is true for the previous random forest algorithm and KNN algorithm.

4.2 Algorithm application

The XGBOOST algorithm can be applied to big data analysis, forecasting, etc., especially in big data competitions (Kaggle, Ali Tianchi and other competitions), and it is also the best algorithm I currently think.

V. Conclusion

Based on the XGBOOST algorithm, this article makes a specific analysis from data preprocessing, algorithm principle, algorithm implementation, sensitivity analysis and algorithm application, which can be applied to most beginners of machine learning algorithms.

6. Remarks

This article is an original article, reprinting is prohibited, and offenders will be investigated. If you need the original data, you can like + bookmark, then chat with the author privately or leave your email in the comment area, and you can get a copy of the training data.

Python implementation of XGBOOST algorithm (nanny level)

0 Introduction

1. Material preparation

2. Algorithm principle

3. Algorithm implementation in Python

3.1 Data loading

3.2 Classify and code the categorical data of the target variable

3.3 Divide the data into training data and test data

3.4 Training XGBOOST model

3.5 Test model

3.6 Output model's prediction confusion matrix (result matrix)

3.7 Output model accuracy

3.8 Draw a confusion matrix

3.9 Complete code implementation

3.10 Result output example

4. Sensitivity analysis and practical application of XGBOOST algorithm

4.1 Sensitivity analysis

4.2 Algorithm application

V. Conclusion

6. Remarks

Guess you like