Machine Learning: XGBoost Classification Prediction Based on Heart Disease Dataset

Table of contents

 

1. Introduction

principle:

2. Practical exercises

1. Data preparation

2. Data reading/loading

3. Data preprocessing

4. Visual processing

 5. Coding discrete variables

6. Model training and prediction

 7. Feature Selection

8. Get better results by adjusting parameters

Core parameter tuning

 Grid tuning method


1. Introduction

XGBoost (eXtreme Gradient Boosting) is an implementation of a gradient boosting decision tree (Gradient Boosting Decision Tree, GBDT). It is one of the most popular machine learning algorithms and is widely used in various tasks, such as classification, regression and sorting, etc. . It was developed by Chen Tianqi in 2016 and is a member of the Boosting algorithm family. It can train the model incrementally and gradually improve the accuracy of the model.

Different from the traditional decision tree, XGBoost uses an optimization algorithm, Gradient Boosting. The gradient boosting algorithm is a serial ensemble method that progressively trains multiple weak classifiers (i.e. decision trees), making them progressively more powerful. In each iteration, it calculates the negative gradient of the loss function as a new training target, and then trains a weak classifier to fit this target. Finally, all weak classifiers are combined to form a strong classifier.

The advantage of XGBoost lies in its efficiency and accuracy. It can handle large-scale datasets and high-dimensional feature spaces, and it also performs well when dealing with sparse data. In addition, XGBoost's model training speed is fast, it can handle large-scale data sets, and it has won the first place many times in the competition.

In short, XGBoost is a powerful and efficient machine learning algorithm that is widely used in various fields, especially in competitions and in actual business.

principle:

The underlying layer of XGBoost implements the GBDT algorithm, and makes a series of optimizations for the GBDT algorithm:

  1. The second-order expansion shown by Taylor is performed on the objective function, which can fit the error more efficiently.
  2. An algorithm for estimating the split point is proposed to speed up the construction process of CART tree, and at the same time, it can deal with sparse data.
  3. A tree parallel strategy is proposed to speed up iteration.
  4. Underlying optimization is carried out for the distributed algorithm of the model.

XGBoost is an integrated model based on the CART tree. Its idea is to connect multiple decision tree models to make decisions together.

So how to connect it? XGBoost adopts the method of iterative prediction error concatenation. To give a popular example, we now need to predict that a car is worth 3,000 yuan. We built decision tree 1 and predicted it to be 2600 yuan after training. We found that there is an error of 400 yuan. Then the training target of decision tree 2 is 400 yuan, but the prediction result of decision tree 2 is 350 yuan. If there is still an error of 50 yuan, we will hand it over Give the third tree... and so on, each tree is used to estimate the error of all previous trees, and the sum of the prediction results of all trees is the final prediction result!

The base model of XGBoost is the CART regression tree, which has two characteristics: (1) The CART tree is a binary tree. (2) Regression tree, the final fitting result is a continuous value.

Specifically, XGBoost uses decision trees as base classifiers, and each decision tree is trained by a gradient boosting algorithm. During the training process, XGBoost will calculate the negative gradient of the loss function, and use this negative gradient to train a new decision tree. Through continuous iteration, a strong classifier with strong generalization ability is finally obtained.

In order to prevent overfitting, XGBoost introduces regularization techniques, including L1 regularization and L2 regularization. L1 regularization can make the model more sparse, while L2 regularization can prevent the model from being too heavy, thereby avoiding overfitting.

In addition, XGBoost also uses some optimization techniques, such as cache access technology, data compression technology, multi-threaded parallel computing, etc., which makes XGBoost highly efficient in terms of training and prediction speed.

2. Practical exercises

1. Data preparation

Download a weather data set provided by Alibaba Cloud, and run the following code in pycharm to download and save it (the original text is based on weather forecast, and the data set of heart disease is used for learning by analogy )

import requests

url = 'https://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/7XGBoost/train.csv'
response = requests.get(url)
with open('train.csv', 'wb') as f:
    f.write(response.content)

 The top are: age, anemia, creatine phosphokinase, diabetes, ejection fraction, hypertension, platelet serum, creatine serum_sodium, sex, smoking, time, death.

The original text is to predict whether it will rain tomorrow, and here it is to predict death.

2. Data reading/loading

Put it in the same directory and read it directly

##  基础函数库
import numpy as np 
import pandas as pd

## 绘图函数库
import matplotlib.pyplot as plt
import seaborn as sns

## 我们利用Pandas自带的read_csv函数读取并转化为DataFrame格式

data = pd.read_csv('heart.csv')

 Can be printed and viewed

## 利用.info()查看数据的整体信息
data.info()

 Basically integer and floating point.

3. Data preprocessing

There is nothing wrong with the heart disease data. I will not demonstrate it here. The following is the explanation:

Simply look at the data, and if there are missing (NaN), fill them with -1.

## 进行简单的数据查看,我们可以利用 .head() 头部.tail()尾部
data.head()

data = data.fillna(-1)
data.tail()

If the number of negative samples in the data set is much larger than the number of positive samples, this common problem is called "data imbalance" problem, and some special handling is required in some cases. (For example, if my negative sample dies as 96 or not, it is 203, so there is no need to deal with it)

print(pd.Series(data['DEATH_EVENT']).value_counts())

## 对于特征进行一些统计描述
data.describe()

4. Visual processing

For convenience, first record the numeric features and non-numeric features:

numerical_features = [x for x in data.columns if data[x].dtype == np.float]
category_features = [x for x in data.columns if data[x].dtype != np.float and x != 'DEATH_EVENT']
## 选取三个特征与标签组合的散点可视化
sns.pairplot(data=data[['age',
'creatinine_phosphokinase',
'ejection_fraction'] + ['DEATH_EVENT']], diag_kind='hist', hue= 'DEATH_EVENT')
plt.show()

 From the above figure, we can find that in the 2D case, different feature combinations have a scatter distribution for the death of heart disease patients, as well as the approximate ability to distinguish them. I think the combination of ejection_fraction and other features is more discriminative (I don't really know how to see it)

for col in data[numerical_features].columns:
    if col != 'DEATH_EVENT':
        sns.boxplot(x='DEATH_EVENT', y=col, saturation=0.5, palette='pastel', data=data)
        plt.title(col)
        plt.show()

 print boxplot

The difference in the distribution of different categories on different features can be obtained.

Can conduct data analysis, such as analyzing the relationship between smoking and death

tlog = {}
for i in category_features:
    tlog[i] = data[data['DEATH_EVENT'] == 1][i].dropna().value_counts()

flog = {}
for i in category_features:
    flog[i] = data[data['DEATH_EVENT'] == 0][i].dropna().value_counts()



plt.figure(figsize=(10,2))
plt.subplot(1,2,1)
plt.title('DEATH')
sns.barplot(x = pd.DataFrame(tlog['smoking'][:2]).sort_index()['smoking'], y = pd.DataFrame(tlog['smoking'][:2]).sort_index().index, color = "red")
plt.subplot(1,2,2)
plt.title('Not DEATH')
sns.barplot(x = pd.DataFrame(flog['smoking'][:2]).sort_index()['smoking'], y = pd.DataFrame(flog['smoking'][:2]).sort_index().index, color = "blue")
plt.show()

 5. Coding discrete variables

Since XGBoost cannot handle string type data, we need some way to convert string data into data. One of the simplest methods is to encode all the features of the same category into the same value, such as female = 0, male = 1, dog = 2, so the final encoded feature value is in [0, number of features −1] Integer between. In addition, there are one-hot encoding, summation encoding, leave-one-out encoding and other methods to obtain better results.

The code is as follows, but the heart disease data sets used in this article are all integer and floating point, so there is no need to deal with them.

## 把所有的相同类别的特征编码为同一个值
def get_mapfunction(x):
    mapp = dict(zip(x.unique().tolist(),
         range(len(x.unique().tolist()))))
    def mapfunction(y):
        if y in mapp:
            return mapp[y]
        else:
            return -1
    return mapfunction
for i in category_features:
    data[i] = data[i].apply(get_mapfunction(data[i]))

6. Model training and prediction

## 为了正确评估模型性能,将数据划分为训练集和测试集,并在训练集上训练模型,在测试集上验证模型性能。
from sklearn.model_selection import train_test_split

## 选择其类别为0和1的样本 (不包括类别为2的样本)
data_target_part = data['RainTomorrow']
data_features_part = data[[x for x in data.columns if x != 'RainTomorrow']]

## 测试集大小为20%, 80%/20%分
x_train, x_test, y_train, y_test = train_test_split(data_features_part, data_target_part, test_size = 0.2, random_state = 2020)
#查看标签数据
print(y_train[0:2],y_test[0:2])

# 打印修改后的结果
print(y_train[0:2],y_test[0:2])

Import XGBoost model

## 导入XGBoost模型
from xgboost.sklearn import XGBClassifier
## 定义 XGBoost模型 
clf = XGBClassifier(use_label_encoder=False)
# 在训练集上训练XGBoost模型
clf.fit(x_train, y_train)

Note: When the console imports and downloads, turn off the ladder!

Otherwise, there will be this kind of error: WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', timeout ('_ssl.c:1112: The handshake operation timed out'))': /pypi/web/simple/xgboost/

## 在训练集和测试集上分布利用训练好的模型进行预测
train_predict = clf.predict(x_train)
test_predict = clf.predict(x_test)
from sklearn import metrics

## 利用accuracy(准确度)【预测正确的样本数目占总预测样本数目的比例】评估模型效果
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

## 查看混淆矩阵 (预测值和真实值的各类情况统计矩阵)
confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)

# 利用热力图对于结果进行可视化
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

 7. Feature Selection

The feature selection of XGBoost belongs to the embedded method in feature selection. In XGboost, the attribute feature_importances_ can be used to view the importance of features.

plt.figure(figsize=(8, 6))
sns.barplot(y=data_features_part.columns, x=clf.feature_importances_)
plt.show()

 From the figure, we can find that the time of illness is the most important factor in determining whether to die.

In addition to the first time, we can also use the following important attributes in XGBoost to evaluate the importance of features.

  • weight: is evaluated by the number of times the feature is used
  • gain: the evaluation Gini index when using features to divide
  • cover: Use the average value of the second derivative of the index covering the sample (the specific principle is not clear to be explored) to divide.
  • total_gain: total Gini index
  • total_cover: total coverage

 acc= 0.7833333333333333

 These plots can also help us better understand other important features.

8. Get better results by adjusting parameters

The following are several important parameters

1. learning_rate: sometimes called eta, the system default value is 0.3. The step size of each iteration is very important. If it is too large, the running accuracy will not be high, if it is too small, the running speed will be slow.
2. subsample: The system defaults to 1. This parameter controls the proportion of random sampling for each tree. Decrease the value of this parameter, the algorithm will be more conservative and avoid overfitting, and the value ranges from zero to one.
3. colsample_bytree: The system default value is 1. We generally set it to around 0.8. Used to control the ratio of the number of columns randomly sampled per tree (each column is a feature).
4. max_depth: The system default value is 6, and we often use numbers between 3-10. This value is the maximum depth of the tree. This value is used to control overfitting. The larger the max_depth, the more specific the model learns.

Core parameter tuning

1. eta [default 0.3]
improves the robustness of the model by adding weight to each tree.
Typical values ​​are 0.01-0.2.

2. min_child_weight [default 1]
determines the minimum sum of leaf node sample weights.
This parameter can avoid overfitting. When its value is large, it can prevent the model from learning local special samples.
But if this value is too high, it will lead to underfitting of the model.

3. max_depth [default 6]
This value is also used to avoid overfitting. The larger max_depth, the model will learn more specific and local samples.
Typical value: 3-10

4. max_leaf_nodes
The maximum number of nodes or leaves on the tree.
Can replace the role of max_depth.
The definition of this parameter causes the max_depth parameter to be ignored.

5. Gamma [default 0]
When a node splits, the node will only be split if the value of the loss function decreases after the split. Gamma specifies the minimum loss function drop value required for node splits.
The larger the value of this parameter, the more conservative the algorithm. The value of this parameter is closely related to the loss function.

6. max_delta_step [default 0]
This parameter limits the maximum step size of each tree weight change. If the value of this parameter is 0, it means that there is no constraint. If it is given some positive value, it makes the algorithm more conservative.
But it is very helpful for classification problems when the samples of each category are very unbalanced.

7. subsample [default 1]
This parameter controls the proportion of random sampling for each tree.
Decreasing the value of this parameter will make the algorithm more conservative and avoid overfitting. However, if this value is set too small, it may cause underfitting.
Typical value: 0.5-1

8. colsample_bytree [default 1]
is used to control the ratio of the number of columns randomly sampled by each tree (each column is a feature).
Typical value: 0.5-1

9. colsample_bylevel [default 1]
is used to control each split of each level of the tree, and the proportion of sampling to the number of columns.
The subsample parameter and the colsample_bytree parameter can play the same role, and are generally not used.

10. L2 regularization term for lambda [default 1] weights.
(Similar to Ridge regression).
This parameter is used to control the regularization part of XGBoost. Although most data scientists rarely use this parameter, this parameter can still find more uses in reducing overfitting.

11. alpha [default 1]
L1 regularization term for weights. (Similar to Lasso regression).
It can be applied in very high-dimensional situations, making the algorithm faster.

12. scale_pos_weight [default 1]
When the samples of each category are very unbalanced, setting this parameter to a positive value can make the algorithm converge faster.

 Grid tuning method

Methods for adjusting model parameters include greedy algorithm, grid parameter tuning, Bayesian parameter tuning, etc. Here we use the grid parameter tuning, its basic idea is exhaustive search: in all candidate parameter selections, through loop traversal, try every possibility, the parameter with the best performance is the final result

## 从sklearn库中导入网格调参函数
from sklearn.model_selection import GridSearchCV

## 定义参数取值范围
learning_rate = [0.1, 0.3,]
subsample = [0.8]
colsample_bytree = [0.6, 0.8]
max_depth = [3,5]

parameters = { 'learning_rate': learning_rate,
              'subsample': subsample,
              'colsample_bytree':colsample_bytree,
              'max_depth': max_depth}
model = XGBClassifier(n_estimators = 20)

## 进行网格搜索
clf = GridSearchCV(model, parameters, cv=3, scoring='accuracy',verbose=1,n_jobs=-1)

clf = clf.fit(x_train, y_train)
## 在训练集和测试集上分布利用最好的模型参数进行预测

## 定义带参数的 XGBoost模型 
clf = XGBClassifier(colsample_bytree = 0.6, learning_rate = 0.3, max_depth= 8, subsample = 0.9)
# 在训练集上训练XGBoost模型
clf.fit(x_train, y_train)

train_predict = clf.predict(x_train)
test_predict = clf.predict(x_test)

## 利用accuracy(准确度)【预测正确的样本数目占总预测样本数目的比例】评估模型效果
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

## 查看混淆矩阵 (预测值和真实值的各类情况统计矩阵)
confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)

# 利用热力图对于结果进行可视化
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

 For more parameter tuning skills, please refer to: [Machine Learning Notes] [Random Forest] [Adjusting Parameters on Breast Cancer Data]_n_estimators_桜キャンドル彦's Blog-CSDN Blog


Original: A. Machine learning entry algorithm (6) XGBoost classification prediction based on weather data set_Ting, artificial intelligence blog-CSDN blog

Guess you like

Origin blog.csdn.net/m0_62237233/article/details/130176412