[Machine Learning] Integrated Learning (Practice)

Integrated learning (practical)

The practical part will be combined with the theoretical part to help understand and strengthen the practical operation (the following code will be based on jupyter notebook).

1. Preparatory work (setting the font size style in jupyter notebook, etc.)

import numpy as np
import os
%matplotlib inline
import matplotlib
import matplotlib. pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import warnings
warnings.filterwarnings('ignore')
np. random. seed(43)

2. The basic idea of the integrated algorithm

Use multiple classifiers together to accomplish the same task during training:

The basic idea of the ensemble algorithm

When testing, select different classifiers for the samples to be tested, and then summarize the final results:

insert image description here

3. Simple implementation of integrated algorithm: hard voting and soft voting

Hard Voting: Aggregate the results of each classifier in a manner similar to a minority-majority strategy
Soft voting: weighted average of the results of each classifier (each classifier is required to obtain a probability value)

1. Build a test data set

# 导入切分数据集的库
from sklearn.model_selection import train_test_split

# 导入“双月牙”数据集库
from sklearn.datasets import make_moons

# 构建测试数据
X, y = make_moons(n_samples = 500, noise = 0.3, random_state = 43)

# 划分训练集与测试集
X_train, X_test,y_train, y_test = train_test_split(X, y ,random_state = 43)

# 画图展示构建的数据集
plt.plot(X[:,0][y==0],X[:,1][y==0],'yo', alpha = 0.7)
plt.plot(X[:,0][y==1],X[:,1][y==1],'bs', alpha = 0.7)

[Out]
insert image description here

2. Hard voting

# 导入分类器模型以及一个投票器
from sklearn.ensemble import RandomForestClassifier,VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# 创建三个不同的分类器
log_clf = LogisticRegression(random_state = 6)
rnd_clf = RandomForestClassifier(random_state = 6)
svm_clf = SVC(random_state = 6)

# 将三个分类器放入投票器中，并指定投票方式：Hard or Soft
# 此时，就可以视 voting_clf 为一个集成模型
voting_clf = VotingClassifier(estimators =[('lr', log_clf),('rf' ,rnd_clf), ('svc' , svm_clf)], voting='hard')

# 导入用于评估分类问题的库
from sklearn.metrics import accuracy_score

# 分别查看各分类器以及构建的集成模型的得分
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    
    # 训练模型
    clf.fit(X_train, y_train)
    
    # 测试模型
    y_pred = clf.predict(X_test)
    
    # 查看预测结果
    print("分类器 {} 得分为：{}".format(clf.__class__.__name__, accuracy_score(y_test, y_pred)))

[Out]
	分类器 LogisticRegression 得分为：0.864
	分类器 RandomForestClassifier 得分为：0.896
	分类器 SVC 得分为：0.92
	分类器 VotingClassifier 得分为：0.912

The results illustrate: Hard voting tries to get better classification results at the expense of time, but in this example, the improvement is not particularly large (even slightly lower than SVM).

3. Soft voting

# 导入分类器模型以及一个投票器
from sklearn.ensemble import RandomForestClassifier,VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# 创建三个不同的分类器
log_clf = LogisticRegression(random_state = 6)
rnd_clf = RandomForestClassifier(random_state = 6)
# 软投票要求每个分类器都能给出概率值，因此这里必须让 SVM 返回一个概率，需调整一下参数
svm_clf = SVC(probability = True, random_state = 6)

# 将三个分类器放入投票器中，并指定投票方式：Hard or Soft
# 此时，就可以视 voting_clf 为一个集成模型
voting_clf = VotingClassifier(estimators =[('lr', log_clf),('rf' ,rnd_clf), ('svc' , svm_clf)], voting='soft')

# 导入用于评估分类问题的库
from sklearn.metrics import accuracy_score

# 分别查看各分类器以及构建的集成模型的得分
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    
    # 训练模型
    clf.fit(X_train, y_train)
    
    # 测试模型
    y_pred = clf.predict(X_test)
    
    # 查看预测结果
    print("分类器 {} 得分为：{}".format(clf.__class__.__name__, accuracy_score(y_test, y_pred)))

[Out]
	分类器 LogisticRegression 得分为：0.864
	分类器 RandomForestClassifier 得分为：0.896
	分类器 SVC 得分为：0.92
	分类器 VotingClassifier 得分为：0.896

Explanation of the results: In theory, soft voting should achieve better results than hard voting, but in this example, the soft voting strategy did not show its advantages.

4. Integrated learning: Bagging model

First, the training data set is sampled multiple times to ensure that the sampled data obtained each time is different.
Train multiple homogeneous models, such as tree models, separately.
When predicting, it is necessary to obtain the prediction results of all models and then integrate them.

insert image description here

1. Experiment: compare the difference between the Bagging model and the traditional algorithm

# 引入 Bagging 分类器的库
from sklearn.ensemble import BaggingClassifier

# 引入决策树的库
from sklearn.tree import DecisionTreeClassifier

# 构建 Bagging 分类器
# 参数一：以决策树作为基学习器
# 参数二：基学习器的数量
# 参数三：单个基学习器最多传入多少样本
# 参数四：对样本数据是否进行有放回的采样
# 参数五：是否进行多线程（设置参数为 -1 表示启用全部 GPU）
bag_clf = BaggingClassifier(DecisionTreeClassifier(),
                  n_estimators = 500,
                  max_samples = 100,
                  bootstrap = True,
                  n_jobs = -1,
                random_state = 42
)

# 训练分类器
bag_clf.fit(X_train, y_train)

# 执行预测
y_pred = bag_clf.predict(X_test)

# 查看分类效果
accuracy_score(y_test,y_pred)

[Out]
	0.912

# 定义一个树模型（用于对比）
tree_clf = DecisionTreeClassifier(random_state = 42)

# 训练模型
tree_clf.fit(X_train, y_train)

# 执行预测
y_pred_tree = tree_clf.predict(X_test)

# 查看分类效果
accuracy_score(y_test, y_pred_tree)

[Out]
	0.872

Result description: The above results show that the ensemble algorithm has a good improvement effect compared with a single base learner (traditional algorithm).

The following shows the difference between Bagging and traditional algorithms in a visual way (drawing the decision boundary):

# 导入与颜色相关的库
from matplotlib.colors import ListedColormap

# 参数一：分类器
# 参数二：数据集（特征值 X）
# 参数三：数据集（标签值 y）
# 参数四：绘制的图像取值范围
# 参数五：透明程度
# 参数六：是否展示轮廓并进行填充
def plot_decision_boundary(clf, X, y, axes = [-1.5, 2.5, -1, 1.5], alpha = 0.5, contour = True):
    
    # 构建棋盘数据
    x1s=np.linspace (axes[0], axes[1],100)
    x2s=np.linspace (axes[2], axes[3],100)
    x1,x2 = np.meshgrid(x1s,x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    
    # 得到对指定特征的预测值
    y_pred = clf.predict(X_new).reshape(x1.shape)
    
    # 构建轮廓参数
    custom_cmap = ListedColormap(['#fafab0', '#9898ff', '#a0faa0'])
    plt.contourf(x1,x2,y_pred,cmap = custom_cmap,alpha = 0.3)
    
    # 判断是否展示轮廓
    if contour :
        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f', '#507d50'])
        plt.contour(x1,x2,y_pred, cmap = custom_cmap2, alpha=0.8)
        plt.plot(X[:, 0][y==0],X[:,1][y==0], 'yo', alpha = 0.6)
        
    # 绘制原始数据
    plt.plot(X[:,0][y==0],X[:,1][y==1], 'bs',alpha = 0.6)
    plt.axis(axes)
    plt.xlabel('x1')
    plt.ylabel('x2')

Note: For the setting of color parameters, please refer to this blog: https://blog.csdn.net/zhaogeng111/article/details/78419015

# 绘图展示
plt.figure(figsize = (12,5))
plt. subplot(121)
plot_decision_boundary (tree_clf,X, y)
plt.title('Decision Tree')
plt.subplot(122)
plot_decision_boundary (bag_clf,X, y)
plt.title('Decision Tree With Bagging')

Effect comparison

Results: In the figure above, the decision boundary drawn by the decision tree is very complex, indicating that it has a certain degree of overfitting phenomenon; while the decision boundary drawn by the Bagging model is simpler and more stable, indicating that its fitting effect is also better.

2. OOB strategy (out of bag)

In the Bagging method, Bootstrap has a certain proportion of samples that do not appear in the sample collection it collects each time, and of course it does not participate in the establishment of the decision tree. At this time, this part of the data can be considered to replace the test set Test, and this part of the data is called out-of-bag data OOB (Out of Bag) .

# 构建 Bagging 分类器（将 oob_score 参数置为 True）
bag_clf = BaggingClassifier(DecisionTreeClassifier(),
                  n_estimators =500,
                  max_samples = 100,
                  bootstrap = True,
                  n_jobs = -1,
                  random_state = 42,
                  oob_score = True
)

# 训练分类器
bag_clf.fit(X_train, y_train)

#  查看基于包外数据进行测试而得到的得分
bag_clf.oob_score_

[Out]
	0.8933333333333333

# 利用训练好的分类器对测试数据进行预测
y_pred = bag_clf.predict(X_test)

# 查看基于测试数据而得到的得分
accuracy_score(y_test,y_pred)

[Out]
	0.912

# 可以通过 oob_decision_function_ 属性来查看每个数据属于各分类的概率值
bag_clf.oob_decision_function_

insert image description here

3. Random Forest

Random forest is a typical representative of Bagging algorithm. It has a very important attribute that can check the "feature importance" of the data set. The following will experiment with the iris data set.

# 导入随机森林的库
from sklearn. ensemble import RandomForestClassifier

# 导入鸢尾花数据集
from sklearn.datasets import load_iris

# 加载数据集
iris = load_iris()

# 建立一个基于随机森林的分类器
rf_clf = RandomForestClassifier(n_estimators=500,n_jobs=-1)

# 训练数据
rf_clf.fit(iris['data'],iris['target'])

# 查看各个特征的重要性程度
for name, score in zip(iris['feature_names'],rf_clf.feature_importances_):
    print(name, score)

[Out]
	sepal length (cm) 0.10786529772603491
	sepal width (cm) 0.026114910898121808
	petal length (cm) 0.44377248611730075
	petal width (cm) 0.42224730525854254

The above shows the absolute proportion of the importance of each feature (the sum is 1). It can be seen that in the iris data set, the features: petal lengthand petal widthare relatively important.

In order to view the impact factors of each feature more clearly, the heat map below shows the Mnistmore important features (pixels) in the data set.

# 手写数据集 Mnist 的导入有问题
# from sklearn.datasets import fetch_openml
# mnist = fetch_openml('MNIST original')

# from sklearn.datasets import fetch_openml
# mnist = fetch_openml('mnist_784')
# x = mnist.data
# y = mnist.target

# 导入本地下载好的 Mnist 数据（该数据集中，每个图片的规格为 28 × 28 = 784）
import scipy.io
mnist = scipy.io.loadmat('./resources/mnist-original.mat')

# 建立一个基于随机森林的分类器
rf_clf = RandomForestClassifier(n_estimators=500,n_jobs=-1)

# 训练分类器
rf_clf.fit(mnist['data'].T,mnist['label'].T)

insert image description here

# 查看 Mnist 数据集的 feature_importances_ 规格
rf_clf.feature_importances_.shape

[Out]
	(784,)

This attribute feeds back the proportion of the importance of each pixel in the Mnist image data. Next we revert this data to a 28 × 28 size and draw a heatmap based on these data values.

# 定义一个函数
def plot_digit(data):
    # 重设数据规格
    image = data.reshape (28,28)
    
    # 绘制指定数据的图像，第二个参数指定的是选择绘制热度图
    plt.imshow(image, cmap = matplotlib.cm.hot)
    
    # 去除坐标轴
    plt.axis('off')
    
# 调用定义的的函数进行图像绘制
plot_digit(rf_clf.feature_importances_)

# 绘制 colorbar（说明深色和浅色各自代表的含义）
colorbar = plt.colorbar(ticks=[rf_clf.feature_importances_.min(),rf_clf.feature_importances_.max()])

# 对前面绘制的 colorbar 进行解释
colorbar.ax.set_yticklabels(['Not important', 'Very important'])

[Out]

Feature importance heat map

5. Integrated learning: Boosting model

1. AdaBoost algorithm

The data that was misclassified last time needs to be focused on next (just like when we were in school, our wrong question book).
That is: In the current ensemble model, the weight of observations with wrong predictions will be increased, while the weight of observations with correct predictions will be decreased.

insert image description here

Let's take SVM as an example to demonstrate the algorithm flow of Adaboost (SVM is also a machine learning algorithm, it doesn't matter if you don't know its details here, and a blog will be updated to explain it later, as long as you know it is a classifier here, it's OK up):

# 导入 SVM 的库
from sklearn.svm import SVC

# 获取训练数据的规格
m = len(X_train)

# 画图展示集成策略每步做了什么工作
plt.figure(figsize=(14,5))

# 循环
for subplot,learning_rate in ((121,1),(122,0.5)):
    
    # 设置权重项：算法开始，将全部样本的权重都设为相同值
    sample_weights = np.ones(m) 
    
    # 绘制子图
    plt.subplot(subplot)
    
    # 构建 5 次模型（绘制 5 条决策边界曲线）
    for i in range(5):
        
        # 设置 SVM 分类器的核函数为 高斯核、软间隔（控制过拟合）为0.05
        svm_clf = SVC(kernel = 'rbf', C=0.05, random_state = 43)
        
        # 训练分类器
        svm_clf.fit(X_train,y_train,sample_weight = sample_weights)
        
        # 预测
        y_pred = svm_clf.predict(X_train)
        
        # 更新权重参数
        sample_weights[y_pred != y_train] *= (1+learning_rate)
        
        # 绘制决策边界
        plot_decision_boundary(svm_clf, X, y, alpha=0.2)
        
    # 绘制图像标题
    plt.title('learning_rate = {}'.format(learning_rate))
    
    # 展示每条线对应模型的第几次构建
    if subplot == 121:
        plt.text(-0.5,-0.65,"1", fontsize=15)
        plt.text(-0.6,-0.30,"2", fontsize=15)
        plt.text(-0.5,0.10,"3", fontsize=15)
        plt.text(-0.4,0.55,"4", fontsize=15)
        plt.text(-0.3,0.90,"5", fontsize=15)

Its output is as follows:

Implementing AdaBoost Algorithm with SVM

The above process demonstrates how to manually implement the AdaBoost algorithm with the base learner. Below we directly implement it with the encapsulated function:

# 接下来直接调用 AdaBoost 的库
from sklearn. ensemble import AdaBoostClassifier

# 构建基于 AdaBoost 模型的分类器
# max_depth：模型的深度
# n_estimators：模型的迭代次数
# learning_rate：学习率
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth = 2),
            n_estimators = 200,
            learning_rate = 0.5,random_state = 42
)

# 训练模型
ada_clf.fit(X_train,y_train)

# 绘制决策边界
plot_decision_boundary(ada_clf,X,y)

Directly using the AdaBoost method

2. Gradient Boosting Algorithm

Add up the results of all learners to get the final conclusion!

insert image description here

(1) Algorithm flow of Gradient Boosting

The following uses a decision regression tree as an example to demonstrate the algorithm flow of Gradient Boosting:

# 构建新的样本数据集
np.random.seed(42)
X = np.random.rand(100,1) - 0.5
y =3*X[:,0]**2 + 0.05*np.random.randn(100)

# 导入决策回归树的库
from sklearn.tree import DecisionTreeRegressor

# 建立第一棵决策回归树
tree_reg1 = DecisionTreeRegressor(max_depth = 2)

# 训练模型
tree_reg1.fit(X,y)

[Out]
insert image description here

# 接下来算出残差
y2 = y - tree_reg1.predict(X)

# 然后利用 y2 建立第二棵决策回归树
tree_reg2 = DecisionTreeRegressor(max_depth = 2)

# 训练模型
tree_reg2.fit(X, y2)

[Out]
insert image description here

# 继续计算残差
y3 = y2 - tree_reg2.predict(X)

# 建立第三棵决策回归树
tree_reg3 = DecisionTreeRegressor(max_depth = 2)

# 训练模型
tree_reg3.fit(X, y3)

[Out]
insert image description here

# 接下来构建一个测试数据
X_new = np.array([[0.25]])

# 基于前面 3 棵树模型对该测试数据进行预测
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1,tree_reg2,tree_reg3))

# 查看预测结果
print("{} 对应的真实值大致为 {}，预测值为 {}".format(X_new[0],3*X_new[0]**2,y_pred))

[Out]
	[0.25] 对应的真实值大致为 [0.1875]，预测值为 [0.17052257]

The above experiment demonstrates the workflow of Gradient Boosting. In order to view the process more intuitively, it is visualized below.

(2) Visually display the Gradient Boosting process

# 定义绘图函数
def plot_predictions(regressors,X, y, axes,label=None,style="r-",data_style="b.", data_label=None):
    
    # 定义样本数据点（X 轴）
    x1 = np.linspace(axes[0],axes[1],500)
    
    # 得到最终的预测值
    y_pred = sum(regressor.predict(x1.reshape(-1,1)) for regressor in regressors)
    
    # 绘制样本数据点
    plt.plot(X[:,0], y, data_style, label=data_label)
    
    # 绘制预测值
    plt.plot(x1,y_pred,style,linewidth=2,label=label)
    
    # 绘制图像标签
    if label or data_label:
        plt.legend(loc="upper center", fontsize=16)
        
    # 绘制图像的坐标轴
    plt.axis(axes)

# 下面进行画图展示
plt.figure(figsize=(11,11))

plt.subplot(321)
plot_predictions([tree_reg1],X, y, axes=[-0.5,0.5,-0.1,0.8],label="$h_1(x_1)$", style="g-",data_label="Training set")
plt.ylabel("$y$", fontsize=16, rotation=0)
plt.title("Residuals and tree predictions", fontsize=16)

plt.subplot(322)
plot_predictions([tree_reg1],X, y, axes=[-0.5,0.5,-0.1,0.8], label="$h(x_1) = h_1(x_1)$",data_label="Training set")
plt.ylabel ("$y$",fontsize=16, rotation=0)
plt.title("Ensemble predictions", fontsize=16)

plt.subplot(323)
plot_predictions([tree_reg2],X, y2, axes=[-0.5,0.5,-0.5,0.5],label="$h_2(x_1)$", style="g-" , data_style ="k+ ", data_label= "Residuals")
plt.ylabel("$y - h_1(x_1)$",fontsize=16)

plt.subplot(324)
plot_predictions([tree_reg1,tree_reg2],X, y, axes=[-0.5,0.5,-0.1,0.8],label="$h(x_1) = h_1(x_1) + h_2(x_1)$")
plt.ylabel("$y$", fontsize=16,rotation=0)

plt.subplot (325)
plot_predictions([tree_reg3],X, y3,axes=[-0.5,0.5,-0.5,0.5],label="$h_3(x_1)$",style="g-",data_style="k+", data_label= "Residuals")
plt.ylabel("$y - h_1(x_1) - h_2(x_1)$",fontsize=16)
plt.xlabel("$x_1$", fontsize=16)

plt.subplot(326)
plot_predictions([tree_reg1,tree_reg2,tree_reg3],X,y,axes=[-0.5,0.5,-0.1,0.8],label = "$h(x_1) = h_1(x_1) + h_2(x_1) + h_3(x_1)$")
plt.xlabel("$x_1$", fontsize=16)
plt.ylabel("$y$", fontsize=16, rotation=0)

plt.show ()

[Out]
Visually display the Gradient Boosting process

In the above figure, the data predicted by the decision tree obtained by the first fitting roughly conforms to the distribution of the original data curve (Figure 1); while the second tree is based on the first tree, fitting the residual data, The fitting curve obtained also roughly conforms to the distribution of the residual data (Figure 3); the third tree is based on the first and second trees, and is fitting the residual data, and the fitting curve obtained also roughly conforms to Distribution of residual data (Figure 5). Figures 2, 4, and 6 show the integrated models obtained after superimposing the currently constructed decision trees.

In reality, there are many ready-made frameworks that implement the work of Gradient Boosting, and related libraries can be directly called when needed. like:

The first generation of sklearn-GBDT (not commonly used, past tense)
The second generation of Xgboost
The third generation lightgbm
……

The ready-made GBDT model in the following is selected sklearnfor demonstration

(3) Experiment: practice the ready-made GBDT model in sklearn

# 导入 GBDT 库
from sklearn.ensemble import GradientBoostingRegressor

# 构建一个 GBDT 模型
# max_depth：树的最大深度
# n_estimators：子树的数量
# learning_rate：学习率（这个学习率和梯度下降中的不一样，这里的学习率主要是控制每棵树的所占权重）
# 这些参数和树模型的参数类似
gbdt = GradientBoostingRegressor(max_depth = 2,
                         n_estimators = 3,
                         learning_rate = 1.0,
                         random_state = 42
                         )

# 训练模型
gbdt.fit(X,y)

[Out]
insert image description here

# 为了便于做对比试验，接下来再建立两个 GBDT 模型
gbdt_slow_1 = GradientBoostingRegressor(max_depth = 2,
                         n_estimators = 3,
                         learning_rate = 0.1,
                         random_state = 42
                         )

# 训练模型
gbdt_slow_1.fit(X,y)

# 构建模型
gbdt_slow_2 = GradientBoostingRegressor(max_depth = 2,
                         n_estimators = 200,
                         learning_rate = 0.1,
                         random_state = 42
                         )

# 训练模型
gbdt_slow_2.fit(X,y)

[Out]
insert image description here

# 对比试验 1：探究在相同迭代次数下，不同学习率的差异
plt.figure(figsize = (11,4))
plt.subplot(121)
plot_predictions([gbdt],X,y,axes=[-0.5,0.5,-0.1,0.8],label='GBDT Predictions')
plt.title('learning_rate={}, n_estimators={}'.format(gbdt.learning_rate,gbdt.n_estimators))

plt.subplot(122)
plot_predictions([gbdt_slow_1],X,y,axes=[-0.5,0.5,-0.1,0.8],label='GBDT Predictions')
plt.title('learning_rate={}, n_estimators={}'.format(gbdt_slow_1.learning_rate,gbdt_slow_1.n_estimators))

[Out]
insert image description here

The above figure shows that when the number of subtrees is small, the learning rate should be as large as possible to obtain a better fitting effect. However, in actual use, we usually make the learning rate low (to prevent overfitting), and set more subtrees to improve the fitting effect.

# 对比试验 2：探究在相同学习率的条件下，不同子树数量的差异
plt.figure(figsize = (11,4))
plt.subplot(121)
plot_predictions([gbdt_slow_2],X,y,axes=[-0.5,0.5,-0.1,0.8],label='GBDT Predictions')
plt.title('learning_rate={}, n_estimators={}'.format(gbdt_slow_2.learning_rate,gbdt_slow_2.n_estimators))

plt.subplot(122)
plot_predictions([gbdt_slow_1],X,y,axes=[-0.5,0.5,-0.1,0.8],label='GBDT Predictions')
plt.title('learning_rate={}, n_estimators={}'.format(gbdt_slow_1.learning_rate,gbdt_slow_1.n_estimators))

[Out]
insert image description here

The above figure shows that when the learning rate is low (and the depth of each subtree is small), the more the number of subtrees, the better the fitting effect.

(4) Early stop strategy

In GDBT, as n_estimators(the number of subtrees in the integrated model) increases, the value of the loss function does not necessarily always decrease strictly; on the other hand, the model with a large number of subtrees in the integrated model does not necessarily always It is more cost-effective than a model with fewer subtrees (that is, more training time does not change the training effect with a large enough gap). Therefore, we must find an appropriate time to terminate the algorithm and reduce the time overhead.

# 引入计算均方误差的库
from sklearn.metrics import mean_squared_error

# 切分数据集得到新的训练集和测试集
X_train,X_val,y_train,y_val = train_test_split(X, y, random_state=49)

# 构建新的 GDBT 模型
gbdt = GradientBoostingRegressor(max_depth = 2,
                         n_estimators = 120,
                         random_state = 42
                         )

# 训练数据
gbdt.fit(X_train,y_train)

# 查看模型的训练结果（这里查看的结果是分阶段的，即模型在测试数据规模为：1、2、3、…… 时的）
# 这里需要用到一个分阶段预测的函数：staged_predict
errors = [mean_squared_error(y_val,y_pred) for y_pred in gbdt.staged_predict(X_val)]

# 取出均方误差最小的的那一次（即在测试集上取得最佳效果）的子树数量
bst_n_estimators = np.argmin(errors) + 1
min_error = np.min(errors)
print("在 GBDT 模型中，当总的子树数量为 {} 时，\n取得最佳效果的子树数量为 {}，\n其在测试集上的均方误差为 {}。".format(len(errors),bst_n_estimators,min_error))

[Out]
	在 GBDT 模型中，当总的子树数量为 120 时，
	取得最佳效果的子树数量为 56，
	其在测试集上的均方误差为 0.002712853325235463。

# 接下来绘图展示最佳子树数量的 GBDT 模型的拟合效果
# 构建最佳子树数量的 GBDT 模型（控制其他参数不变）
gbdt_best = GradientBoostingRegressor(max_depth = 2,
                         n_estimators = bst_n_estimators,
                         random_state = 42
                         )

# 训练新模型
gbdt_best.fit(X_train, y_train)

# 绘图展示
# 设置图像规格
plt.figure(figsize = (11,4))

plt.subplot(121)

# 绘制原始 GBDT 模型分阶段的均方误差
plt.plot(errors,'b-')

# 绘制最佳子树数量的虚线
plt.plot([bst_n_estimators,bst_n_estimators],[0,min_error],'k--')
plt.plot([0,120],[min_error,min_error],'k--')
plt.axis([0,120,0,0.01])
plt.title('Val Error')

# 绘制最佳 GBDT 模型的拟合情况
plt.subplot(122)
plot_predictions([gbdt_best],X,y,axes=[-0.5,0.5,-0.1,0.8])
plt.title('Best Model(%d trees)'%bst_n_estimators)

[Out]
insert image description here

How to implement early stopping?

A more intuitive way is as demonstrated above: train all subtrees, and finally count the number of subtrees with the smallest mean square error on the test set in stages, and then use this number as a parameter to train the model n_estimators. However, this method essentially trains the entire number of subtrees. The best way is: only train one subtree each time, and then count the effect of the GBDT model on the test set; when training next time, still only train one subtree, and then add the subtree to the previously constructed Then test it in a good integration model. Such superimposed training actually only needs to add a hot start parameter when building the GBDT model warm_start:

# 通过 “热启动” 参数实现提前停止
gbdt_auto = GradientBoostingRegressor(max_depth = 2,
                         # 打开热启动参数（此时，就不需要 n_estimators 参数了，而是在循环中自行寻找并动态调整）
                         warm_start = True,
                         random_state = 42
                         )

# 最大上浮次数
MAX_FLAW = 5

# 当前上浮次数
error_going_up = 0

# 设置用于记录“在验证集上取到的最小均方误差”，初始值要足够大
min_val_error = float('inf')

# 设置取得最小均方误差时的 n_estimators 值
min_val_error_estimators = 0

# 循环查找最佳子树数量
for n_estimators in range(1,120):
    
    # 动态设置 GBDT 模型的 n_estimators 参数
    gbdt_auto.n_estimators = n_estimators
    
    # 训练模型
    gbdt_auto.fit(X_train,y_train)
    
    # 对验证集进行预测
    y_pred = gbdt_auto.predict(X_val)
    
    # 计算在验证集上的均方误差
    val_error = mean_squared_error(y_val, y_pred)
    
    # 如果该误差值低于指定值则记录下该值
    if val_error < min_val_error:
        min_val_error = val_error
        min_val_error_estimators = n_estimators
        error_going_up = 0
        
    # 否则说明：此次新加入一棵子树，反而使得整个集成模型的预测能力降低
    # 这种情况暗示接下来继续加入新的子树是在“梯度上升”
    # 此时，我们可以对这种情况进行统计（并规定：当上升次数超过最大上浮次数 MAX_FLAW 时就退出循环）    
    else:
        error_going_up += 1
        if(error_going_up == MAX_FLAW):
            break

# 输出最终得到的最佳子树数量 n_estimators
print("在验证集上取得最小均方误差时，子树数量为 {}，均方误差为 {}。".format(min_val_error_estimators,min_val_error))

[Out]
	在验证集上取得最小均方误差时，子树数量为 56，均方误差为 0.002712853325235463。

6. Integrated learning: Stacking model

The Stacking strategy has two main stages in prediction:

Give the original data set to L heterogeneous weak learners for prediction;
The prediction results of L heterogeneous weak learners are used as input and then sent to a meta-model for aggregation, and the final result is output by the meta-model.

insert image description here

Phase 1: Training a heterogeneous learner

# 导入用于加载本地资源的库
import scipy.io

# 加载本地 Mnist 数据集（十分类任务）
# 没有该数据集的小伙伴可以去此专栏下 “ 【机器学习】Sklearn导入手写数字数据集 Mnist 失败的解决办法 ” 中下载
mnist = scipy.io.loadmat('./resources/mnist-original.mat')

# 划分训练集与验证集
X_train,X_val, y_train,y_val = train_test_split (
    mnist['data'].T,mnist['label'].T,test_size=10000,random_state=42)

# 选择几种不同的分类器（导入库）
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier

# 构建异质弱学习器
random_forest_clf = RandomForestClassifier(random_state = 42)
extra_trees_clf = ExtraTreesClassifier(random_state = 42)
svm_clf = LinearSVC(random_state = 42)
mlp_clf = MLPClassifier(random_state = 42)

# 将异质弱学习器加入一个列表
estimators = [random_forest_clf,extra_trees_clf,svm_clf,mlp_clf]

# 接下来分别训练这些分类器
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train, y_train)

[Out]
	Training the RandomForestClassifier(random_state=42)
	Training the ExtraTreesClassifier(random_state=42)
	Training the LinearSVC(random_state=42)
	Training the MLPClassifier(random_state=42)

Phase 2: Using the predictions of heterogeneous learners as input, train a meta-model that combines them

# 设置用于存放“不同学习器的预测结果”的数组
X_val_predictions = np.empty((len(X_val),len(estimators)), dtype=np.float32)

# 获取不同学习器的预测结果（基于验证集，这里一定要用与前面训练阶段不同的数据集）
for index,estimator in enumerate(estimators):
    X_val_predictions[:,index] = estimator.predict(X_val)
    
# 查看不同学习器的预测结果
X_val_predictions

[Out]
	array([[7., 7., 7., 7.],
	       [8., 8., 8., 8.],
	       [6., 6., 6., 6.],
	       ...,
	       [9., 9., 9., 9.],
	       [1., 1., 1., 1.],
	       [6., 6., 6., 6.]], dtype=float32)

# 构建用于组合异质学习器预测结果的元模型（这里选择的是随机森林）
rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True,random_state=42)

# 基于对验证集的预测结果（另一组数据），训练用于汇总的元模型
rnd_forest_blender.fit(X_val_predictions,y_val)

# 查看 OOB 指标
rnd_forest_blender.oob_score_

[Out]
	0.9701