scikit-learn integrated learning code annotations and related exercises

1. Code comments

Code from: https://scikit-learn.org/stable/auto_examples/ensemble/plot_adaboost_twoclass.html#sphx-glr-auto-examples-ensemble-plot-adaboost-twoclass-py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_gaussian_quantiles
from sklearn.datasets import make_classification

# 生成数据集
# X1, y1 = make_gaussian_quantiles(cov=2.0, n_samples=200, n_features=2, n_classes=2, random_state=1)
# X2, y2 = make_gaussian_quantiles(mean=(3, 3), cov=1.5, n_samples=300, n_features=2, n_classes=2, random_state=1)
# X = np.concatenate((X1, X2))
# y = np.concatenate((y1, -y2 + 1))
X, y = make_classification(n_samples=1000, n_features=2,  n_redundant=0, random_state=6)

# 生明AdaBoostClassifier预估器
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), algorithm="SAMME", n_estimators=3000)
# bdt = AdaBoostClassifier(SGDClassifier(loss='hinge'), algorithm="SAMME", n_estimators=1000)
# bdt = AdaBoostClassifier(LogisticRegression(), algorithm="SAMME", n_estimators=3000)
bdt.fit(X, y)

plot_colors = "br"
plot_step = 0.02
class_names = "AB"
plt.figure(figsize=(10, 5))

plt.subplot(121)
# 这里画图的方法与Knn的一样
# 画出决策边界,用不同颜色表示
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
# np.meshgrid:生成网格点坐标矩阵(因为pcolormesh需要这样使用)
# np.arange:起始,终点,步长
# xx,yy分别为两个特征
# 这里的意思就是为了有底色,每个x和y都都进行组合计算,算出它呢个点的底色。上面X和y相当于是训练集,这里差不多可以算作是没有结果的测试集
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step))
# 用ravel()方法将数组拉成一维数组
# np.c 就是按列叠加两个矩阵,就是把两个矩阵左右组合,要求行数相等。
Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# contourf:画轮廓图(与pcolormesh图形效果类似)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.axis("tight")

# 补充训练数据点
for i, n, c in zip(range(2), class_names, plot_colors):
    idx = np.where(y == i)
    plt.scatter(
        X[idx, 0],
        X[idx, 1],
        c=c,
        cmap=plt.cm.Paired,
        s=20,
        edgecolor="k",
        label="Class %s" % n,
    )
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc="upper right")
plt.xlabel("x")
plt.ylabel("y")
plt.title("Decision Boundary")

# 画出两类的分别决定成绩
twoclass_output = bdt.decision_function(X)
plot_range = (twoclass_output.min(), twoclass_output.max())
plt.subplot(122)
for i, n, c in zip(range(2), class_names, plot_colors):
    plt.hist(
        twoclass_output[y == i],
        bins=10,
        range=plot_range,
        facecolor=c,
        label="Class %s" % n,
        alpha=0.5,
        edgecolor="k",
    )
x1, x2, y1, y2 = plt.axis()
plt.axis((x1, x2, y1, y2 * 1.2))
plt.legend(loc="upper right")
plt.ylabel("Samples")
plt.xlabel("Score")
plt.title("Decision Scores")
plt.tight_layout()
plt.subplots_adjust(wspace=0.35)
plt.show()

2. Change the number of machine learning

The principle of AdaBoostClassifier is to train a weak learner model first, and then evaluate its results. For the correct problems in this model, we will reduce its attention, and for the wrong problems, we will increase its attention. Therefore, in the subsequent new model, we are more focused on overcoming the difficulties that the previous model cannot solve. Finally, when we integrate all the models together to form a large framework, there are models that deal with simple problems in the large framework. There are also models that deal with difficult problems, making the overall performance of the large framework somewhat improved.
According to its principle, the number of base learners needs to be modified in the task. At first I thought it was to put more models in the base_estimator of AdaBoostClassifier, like staking, but AdaBoostClassifier cannot do this. Some netizens in Stack Overflow also replied , it is theoretically possible, but AdaBoostClassifier only requires weak learners to be better than random ones, so it often just uses the same classifier.
insert image description here
The number of modified base learners in the task should refer to the size of modified n_estimators, which is the maximum number of iterations of the weak learner, and it can also be said to be the largest number of weak learners. If the test in the figure below is too small, it is easy to underfit, and the classification surface is relatively regular. If it is too large, it is easy to overfit. About 1000 is more appropriate.
insert image description here

3. Change the type of machine learner

base_estimator: Defaults to DecisionTreeClassifier. In theory, you can choose any classification or regression learner, but you need to support sample weights. For example, Knn, MLP does not support it, and it will report "xxx doesn't support sample_weight.". I chose the logistic regression model afterwards, but I kept reporting "BaseClassifier in AdaBoostClassifier ensemble is worse than random, ensemble can not be fit." There is almost no explanation for this reason on the Internet. Not only logistic regression, but also SGD. After repeated attempts, it is found that if the method of generating data in the following figure is changed to the method of make_classification, the code will be successfully executed. The reason is not clear.
insert image description here
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/Fishermen_sail/article/details/131855136