[SkLearn classification, regression algorithm] DecisionTreeRegressor regression tree



DecisionTreeRegressor regression tree

class sklearn.tree.DecisionTreeClassifier (criterion=’gini’, splitter=’best’, max_depth=None,
min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None,
random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None,
presort=False)

Almost all parameters, attributes and interfaces are exactly the same as the classification tree. It should be noted that in the regression tree, there is no question of whether the label distribution is balanced, so there is no parameter such as class_weight.


① Important parameters, attributes and interfaces

  • The regression tree measures the quality of branches. There are three supported standards:

    • 1) Enter " mse" Using the 均方误差mean squared error(MSE)difference between the mean square error, the parent node of the leaf node and will be used as a standard feature selection, by this method using the mean value of the leaf node to minimize the loss of L2
      Insert picture description here
      wherein N是样本数量, i是每一个数据样本, fi是模型回归出的数值, yi是样本点i实际的数值标签. and soThe essence of MSE is the difference between the real data of the sample and the regression result. In the regression tree, MSE is not only our branch quality measurement indicator, but also our most commonly used indicator to measure the regression quality of the regression tree. When we are using cross-validation or other methods to obtain the results of the regression tree, we often choose the average The square error is used as our evaluation (in the classification tree, this indicator is the prediction accuracy represented by score). In the return, we pursued that the smaller the MSE, the better. However,回归树的接口 score返回的是R²,并不是MSE . R-squared is defined as follows:
      Insert picture description here
      wherein u是残差平方和(MSE*N), v是总平方和, N是样本数量, i是每一个数据样本, fi是模型回归出的数值, yi是样本点i实际的数值标签. y帽是真实数值标签的平均数. R-squared can be positive or negative (if the model's residual sum of squares is much greater than the model's total sum of squares, the model is very bad, R-squared will be negative), and the mean square error will always be positive.
      ★ It is worth mentioning that although the mean square error is always positive, when the mean square error is used as the criterion in sklearn, the negative mean square error "(neg mean_squared_error) is calculated. This is because when sklearn calculates the model evaluation index, it will Considering the nature of the indicator itself, the mean square error itself is an error, so it is classified as a loss of the model by sklearn. Therefore, in sklearn, it is expressed as a negative number. The value of the true mean square error MSE is actually neg meansquared_error remove the negative number.

    • 2) Enter " friedman mse"Use 费尔德曼均方误差, this indicator uses Friedman's improved mean square error for problems in potential branches

    • 3) Enter " mae"Use 绝对平均误差mae(mean absolute error), this indicator uses the median value of the leaf node to minimize the L1 loss

  • The most important attribute is still feature_importances_, the interface is still apply、fit 、predict、 scorethe core.

Back to top


② Cross validation

交叉验证是用来观察模型的稳定性的一种方法, We divide the data into n parts, use one of them as the test set and the other n-1 parts as the training set, and calculate the accuracy of the model multiple times to evaluate the average accuracy of the model. The division of training set and test set will interfere with the results of the model, so the average value obtained by the results of n times of cross-validation is a better measure of the effect of the model.
Insert picture description here
See the blog post: Machine Learning | Cross-Validation

def cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None,
                    n_jobs=None, verbose=0, fit_params=None,
                    pre_dispatch='2*n_jobs', error_score=np.nan):
estimator:估计方法对象(分类器) --- 训练的模型
X:数据特征(Features)
y:数据标签(Labels)
soring:调用方法(包括accuracy和mean_squared_error等等)
cv:几折交叉验证
n_jobs:同时工作的cpu个数(-1代表全部)

♦ Simple to use

from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor

# 加载数据集
boston = load_boston()
x = boston.data
y = boston.target

# 构建模型
regressor = DecisionTreeRegressor(random_state=0)

# 交叉验证简单使用
cross_score = cross_val_score(regressor,x,y,cv=10,scoring="neg_mean_squared_error")
cross_score

交叉验证结果:
array([-16.41568627, -10.61843137, -18.30176471, -55.36803922,
       -16.01470588, -44.70117647, -12.2148    , -91.3888    ,
       -57.764     , -36.8134    ])

Back to top


③ Example: image drawing of sine one-dimensional regression

Ideal perfect sine data set

import numpy as np
from matplotlib import pyplot as plt
from sklearn.tree import DecisionTreeRegressor

# 生成随机种子
rng = np.random.RandomState(1)
# 利用随机种子始终生成相同的随机数 (0-5,80个)
# 生成的时候将其升维处理 ---- sklearn不支持一维数据
# 进行图像绘制时对x进行排序(从小往大开始绘图)
x = np.sort(5*rng.rand(80,1),axis=0)
# 理想正弦下的取值
# 绘图时作为value,应当转为一维数据
# ravel() 展开多维数组
y = np.sin(x).ravel()

plt.figure(figsize=(10,6))
plt.scatter(x,y,s=20,edgecolor='black',c='darkorange',label='data')
plt.show()

Insert picture description here

Add noise to the data set

# 步长为5取出一个y值增加噪声
y[::5] += 3 * (0.5 - rng.rand(16))

# 再次绘图
plt.figure(figsize=(10,6))
plt.scatter(x,y,s=20,edgecolor='black',c='darkorange',label='data')
plt.show()

By adding noise to the data set, comparing the two images, it is obvious that there is a distinction

Insert picture description here

Create test set

[:,np.newaxis]Used to increase the dimension of a one-dimensional array
[np.newaxis,:]Used to increase the dimension of a one-dimensional array and transpose

# 创建测试集
xtest = np.arange(0,5,0.01)[:,np.newaxis]
xtest

Insert picture description here

Make predictions on both models

  • Use to predict接口make predictions
y_1 = regr_1.predict(xtest)
y_2 = regr_2.predict(xtest)

Drawing

# 绘图
plt.figure(figsize=(10,6))
plt.scatter(x,y,s=20,edgecolor='black',c='darkorange',label='data')
plt.plot(xtest,y_1,c='red',linestyle='-',linewidth=1.5)
plt.plot(xtest,y_2,c='green',linestyle='-.',linewidth=1.5)
plt.legend(['regr_1','regr_2','origin'])
plt.title('Decision Tree Regression')
plt.show()
  • It can be seen from the image that for 树深为2的模型而言整体拟合趋势是sin()函数, and for 树深为5的模型而言虽然整体趋势相似,但是存在过拟合.
    Insert picture description here
    It can be seen that the regression tree learns a local linear regression that approximates a sine curve. We can see that if the maximum depth ( 由maxdepth参数控制) of the tree is set too high, the decision tree learns too finely. It learns a lot of details from the training data, including the presentation of noise, which makes the model deviate from the true sinusoidal curve and form a Fitting.

Back to top


Guess you like

Origin blog.csdn.net/qq_45797116/article/details/113547952