Regression of Diabetes Dataset Based on Random Forest Algorithm

about the author

Li Yige, female, School of Electronic Information, Xi'an Polytechnic University, 2021 graduate student
Research direction: EEG Emotion Recognition
Email: [email protected]

Meng Liping, female, School of Electronic Information, Xi'an Polytechnic University, 2021 graduate student, Zhang Hongwei's artificial intelligence research group
Research direction: machine vision and artificial intelligence
Email: [email protected]

1. Principle of random forest algorithm

1.1 Decision tree and Bagging

(1) Decision tree Decision
tree algorithm is a supervised machine learning method. In the classification problem, the essence of the decision tree algorithm is to divide the samples into different categories by summarizing the classification rules contained in the data. Decision trees are based on dendrograms and classify instances by their features. The structure of the decision tree is shown in Figure 1-1. A root node, a considerable number of internal nodes and leaf nodes together form a complete decision tree, in which the root node (triangle) is located at the top level of the tree, which is the starting point of data set division and represents the collection point of data; The inner node (circle) in the middle belongs to the attribute node, indicating the discrimination of a certain attribute; the leaf node (square) belongs to the category node, indicating the final decision result of the data.
The construction of decision tree is to use recursive method to select the optimal features, and divide the training data according to the selected features. The steps are as follows:
1) Construct the root node. The training samples are all located at the root node;
2) Determine the optimal feature. The training samples need to be divided into subunits using an optimal feature to ensure that each subunit has the best classification result. At the same time, two situations must be considered: if the classification result of a subunit satisfies the index, the corresponding leaf node will be constructed; if a subunit has not yet reached the index, the subunit will continue to divide;
3) Recursively conduct training , until all training samples are well classified, or there are no suitable features.

insert image description here

(2) Bagging algorithm
Bagging algorithm (Bootstrap aggregating, bootstrap aggregation algorithm), also known as bagging algorithm, is a group learning algorithm in the field of machine learning. Originally proposed by Leo in 1996. The Bagging algorithm can be combined with other classification and regression algorithms to improve its accuracy and stability, and at the same time reduce the variance of the results to avoid overfitting. The basic flow of the bagging algorithm is shown in Figure 1-2.

insert image description here

The bagging algorithm extracts training samples with replacement, and finds that some samples in the sample set may appear repeatedly or never appear. This means that certain data points can be excluded from a particular dataset generated even if the resulting dataset is the same size as the original dataset. Each generated dataset is different, and this is how diversity is generated in the ensemble model. The formula for calculating the probability that a data point is not sampled is: (1-1/N)^N
where N is the number of samples in the original data set. According to the formula, when N is large enough, about 36.8% of the samples will not be selected. These samples are called out-of-bag data and can be used as a validation set for the "out-of-bag estimation" of generalization performance.

1.2 Random Forest Algorithm

Random Forest (RF for short) is the most representative algorithm in Bagging algorithm. Random Forest introduces random attribute selection into the training process of decision tree based on the Bagging ensemble constructed by the decision tree-based learner. Specifically, in the random forest, for each node of the basic decision tree, it first randomly selects a subset from the attribute set of the node, and then selects an optimal attribute from the subset to divide.
The basic principle of random forest is to generate multiple different data subsets by sampling the data set, and train a classification tree on each data subset. Finally, combined with the prediction results of each classification tree, it is determined by the principle of majority voting. Prediction results of random forests.
The training process of the random forest algorithm is summarized as follows:
1) Assuming the size of the training set is N, when constructing each tree, N training samples are randomly and replaced from the training set as the training set of the tree;
2) Assuming that the feature dimension of each sample is M, randomly select m features (m<<M) to form a new feature set to generate a decision tree. Each time the tree is split, it can be realized according to the out-of-bag error rate. 3) Repeat steps 1) and 2) k times to obtain k
decision trees;
4) Use the random forest classifier formed by k decision trees to vote to determine the final classification result.

insert image description here

Figure 1-3 shows the modeling process of random forest. As can be seen from the above description, the randomness of random forest is mainly manifested in the randomness of each tree training sample and the splitting attribute of each node. The main advantages of random forest are: for a variety of data, it can generate high-precision classifiers with good performance; when building the forest, the generalization error is unbiased estimation, and the generalization ability of the model is strong; Constructs do not need to be pruned to achieve good results.

2. Experimental procedure

2.1 Diabetes dataset

The diabetes dataset used in this experiment is the diabetes dataset in sklearn. There are 442 samples in this dataset, and each sample has ten features, which are ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4' , 's5', 's6'], corresponding to age, gender, body mass index, average blood pressure, and S1 to S6 are disease progression indicators after one year. Targets is a quantitative measure of disease after one year, with values ​​ranging from 25 to 346.

insert image description here

2.2 Experimental process

a. Import some modules needed for the experiment

import matplotlib.pyplot as plt #导入绘图工具
import numpy as np #导入numpy数据库
from sklearn.metrics import r2_score #使用拟合优度r2_score对实验结果进行评估
from sklearn.model_selection import train_test_split #划分数据集
from sklearn.ensemble import RandomForestRegressor #导入随机森林训练模型
from sklearn.datasets import load_diabetes #导入糖尿病数据集

b. Read the diabetes dataset and perform data division
Considering that the smaller the data in the test set, the less accurate the estimation of the model generalization error will be, so the data set is divided in the ratio of test set:training set=4:6.

#数据格式
data=load_diabetes()
X =data.data  #数据
y=data.target #结果


#划分训练集与测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
#测试集与训练集的比例为46 如果测试集的数据越小,对模型的泛化误差的估计将会越不准确
n = np.arange(0, X_test.shape[0], 1)
print('shape of X_test{}'.format(n.shape))#显示特征与目标的形状
print('shape of X_test{}'.format(X_test.shape))#显示特征与目标的形状
print('shape of y_test{}'.format(y_test.shape))#显示特征与目标的形状

c. Import the random forest training model

#随机森林训练模型
regressor = RandomForestRegressor(n_estimators=100, max_depth=3, random_state=0)
#随机森林回归,并使用100个决策树,深度为3
#这里的random_state就是为了保证程序每次运行都分割一样的训练集和测试集
regressor.fit(X_train, y_train)  # 拟合模型
result = regressor.predict(X_test)
print('score:{}'.format(r2_score(y_test,result)))#显示训练结果与测试结果的拟合优度

d. Draw the fitted curve

#画出拟合曲线
plt.figure(figsize=(8, 5))#设置图片格式
plt.plot(n , result, c='r', label='prediction', lw=2)  # 画出拟合曲线
plt.plot(n , y_test, c='b', label='true', lw=2)  # 画出拟合曲线
plt.axis('tight') #使x轴与y轴限制在有数据的区域
plt.title("RandomForestRegressor" )
plt.show()

2.3 Display of experimental results

insert image description here

As shown in the figure above, the score indicates how well the feature model predicts the feature samples, that is, the coefficient of determination. The closer the value is to 1, the better the prediction result.
According to the comparison chart between the predicted value and the actual value, it can be seen that the score is about 0.422, indicating that the correlation between the predicted value and the actual value is 0.422.

2.4 Complete experimental code

import matplotlib.pyplot as plt #导入绘图工具
import numpy as np #导入numpy数据库
from sklearn.metrics import r2_score #使用拟合优度r2_score对实验结果进行评估
from sklearn.model_selection import train_test_split #划分数据集
from sklearn.ensemble import RandomForestRegressor #导入随机森林训练模型
from sklearn.datasets import load_diabetes #导入糖尿病数据集

#数据格式
data=load_diabetes()
X =data.data  #数据
y=data.target #结果


#划分训练集与测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
#测试集与训练集的比例为46 如果测试集的数据越小,对模型的泛化误差的估计将会越不准确
n = np.arange(0, X_test.shape[0], 1)
print('shape of X_test{}'.format(n.shape))#显示特征与目标的形状
print('shape of X_test{}'.format(X_test.shape))#显示特征与目标的形状
print('shape of y_test{}'.format(y_test.shape))#显示特征与目标的形状

#随机森林训练模型
regressor = RandomForestRegressor(n_estimators=100, max_depth=3, random_state=0)
#随机森林回归,并使用100个决策树,深度为3
#这里的random_state就是为了保证程序每次运行都分割一样的训练集和测试集
regressor.fit(X_train, y_train)  # 拟合模型
result = regressor.predict(X_test)
print('score:{}'.format(r2_score(y_test,result)))#显示训练结果与测试结果的拟合优度

#画出拟合曲线
plt.figure(figsize=(8, 5))#设置图片格式
plt.plot(n , result, c='r', label='prediction', lw=2)  # 画出拟合曲线
plt.plot(n , y_test, c='b', label='true', lw=2)  # 画出拟合曲线
plt.axis('tight') #使x轴与y轴限制在有数据的区域
plt.title("RandomForestRegressor" )
plt.show()

Guess you like

Origin blog.csdn.net/m0_37758063/article/details/123645447