《机器学习公式推导与代码实现》chapter10-AdaBoost

《机器学习公式推导与代码实现》学习笔记，记录一下自己的学习过程，详细的内容请大家购买作者的书籍查阅。

AdaBoost

将多个单模型组合成一个综合模型的方式早已成为现代机器学习模型采用的主流方法-集成模型(ensemble learning)。AdaBoost是集成学习中Boosting框架的一种经典代表。

1 Boosting

集成学习将多个弱分类器组合成一个强分类器，这个强分类器能取所有弱分类器之所长，达到相对最优性能。我们将Boosting理解为一类将弱分类器提升为强分类器的算法，Boosting算法也叫提升算法。简单来说，Boosting就是串行训练一系列弱分类器，使得被先前弱分类器分类错误的样本在后续得到更多关注，最后将这些分类器组合成最优强分类器的过程。

2 AdaBoost算法原理

AdaBoost全称Adaptive Boosting，翻译为自适应提升算法。AdaBoost是一种通过改变训练样本权重来学习多个弱分类器并线性组合成强分类器的Boosting算法。
Boosting方法的两个关键问题：一是在训练过程中如何改变样本的权重或者概率分布，二是如何将多个弱分类器组合成一个强分类器。
AdaBoost一是提高前一轮被弱分类器分类错误的样本的权重，而降低分类正确的样本的权重，二是对多个弱分类器进行线性组合，提高分类效果好的弱分类器的权重，降低分类误差率高的弱分类器的权重。
在这里插入图片描述

AdaBoost是以加性模型为模型、指数函数为损失函数、前向分步为算法的分类学习模型。
加性模型(additive model)是由多个基模型求和的形式构造起来的模型。

3 AdaBoost算法实现

首先需要先定义基分类器，一般可用一棵决策树或者决策树桩(dicision stump)作为基分类器，决策树桩是一种仅具有单层决策结构的决策树，它仅在一个特征上进行分类决策。而决策树则是一种多层决策结构的树形分类模型，它可以在多个特征上进行分类决策。

# 定义决策树桩类
class DecisionStump: # 作为AdaBoost的弱分类器
    def __init__(self):
        self.label = 1 # 基于划分阈值决定将样本分类为1还是-1
        self.feature_index = None # 特征索引
        self.threshold = None # 特征划分阈值
        self.alpha = None # 基分类器的权重

AdaBoost的经典版算法流程，包括权重初始化、训练弱分类器、计算当前分类误差、计算弱分类器的权重和更新训练样本权重。

import numpy as np
class Adaboost:

    def __init__(self, n_estimators=5): # 弱分类器个数
        self.n_estimators = n_estimators
    
    def fit(self, X, y): # adaboost拟合函数
        m, n = X.shape
        w = np.full(m, (1/m)) # (1)初始化权重分布为均匀分布1/N
        self.estimators = [] # 初始化基分类器列表
        for _ in range(self.n_estimators):
            estimator = DecisionStump() # (2.a) 训练一个弱分类器：决策树桩
            min_error = float('inf') # 设定一个最小化误差
            for i in range(n): # 遍历数据集特征，根据最小分类误差率选择最优特征
                unique_values = np.unique(X[:, i])
                for threshold in unique_values: # 尝试将每一个特征值作为分类阈值
                    p = 1
                    pred = np.ones(np.shape(y)) # 初始化所有预测值为1
                    pred[X[:, i] < threshold] = -1 # 小于分类阈值的预测值为-1
                    error = sum(w[y != pred]) # (2.b) 计算分类误差率
                    if error > 0.5: # 如果分类误差率大于0.5，则进行正负预测反转，例如error = 0.6 => (1 - error) = 0.4
                        error = 1 - error
                        p = -1
                    if error < min_error: # 一旦获得最小误差，则保存相关参数配置
                        estimator.label = p
                        estimator.threshold = threshold
                        estimator.feature_index = i
                        min_error = error
                        
            estimator.alpha = 0.5 * np.log((1.0 - min_error) / (min_error + 1e-9)) # (2.c)计算基分类器的权重
            preds = np.ones(np.shape(y)) # 初始化所有预测值为1
            negative_idx = (estimator.label * X[:, estimator.feature_index] < estimator.label * estimator.threshold) # 获取所有小于阈值的负类索引
            preds[negative_idx] = -1 # 将负类设为-1
            w *= np.exp(-estimator.alpha * y * preds) # (10-5)
            w /= np.sum(w)
            self.estimators.append(estimator) # 保存该弱分类器

    def predict(self, X): # 定义adaboost预测函数
        m = len(X)
        y_pred = np.zeros((m, 1))
        for estimator in self.estimators: # 计算每个基分类器的预测值
            predictions = np.ones(np.shape(y_pred)) # 初始化所有预测值为1
            negative_idx = (estimator.label * X[:, estimator.feature_index] < estimator.label * estimator.threshold) # 获取所有小于阈值的负类索引
            predictions[negative_idx] = -1 # 将负类设为-1
            y_pred += estimator.alpha * predictions # 对每个基分类器的预测结果进行加权
        y_pred = np.sign(y_pred) # 返回最终预测结果
        return y_pred

from sklearn.model_selection import train_test_split
from sklearn.datasets._samples_generator import make_blobs # 导入模拟二分类数据生成模块
from sklearn.metrics import accuracy_score
X, y = make_blobs(n_samples=150, n_features=2, centers=2, cluster_std=1.2, random_state=40) # 生成模拟二分类数据集
y_ = y.copy()
y_[y_==0] = -1
y_ = y_.astype(float)
X_train, X_test, y_train, y_test = train_test_split(X, y_, test_size=0.3, random_state=40)
clf = Adaboost(n_estimators=5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

0.9777777777777777

4.基于sklearn实现AdaBoost算法

AdaBoost分类模型在sklearn的ensemble的AdaBoostClassifier模块下调用。

from sklearn.ensemble import AdaBoostClassifier
clf_ = AdaBoostClassifier(n_estimators=5, random_state=40)
y_pred_ = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

0.9777777777777777

使用最广泛的弱分类器是决策树和神经网络，决策树使用CART。
笔记本_Github地址