"Machine Learning Formula Derivation and Code Implementation" chapter10-AdaBoost

"Machine Learning Formula Derivation and Code Implementation" study notes, record your own learning process, please buy the author's book for detailed content.

AdaBoost

The method of combining multiple single models into a comprehensive model has long been the mainstream method adopted by modern machine learning models - 集成模型( ensemble learning). AdaBoost is a classic representative of the Boosting framework in integrated learning.

1 Boosting

集成学习Combining multiple 弱分类器classes into one 强分类器, this strong classifier can take advantage of all weak classifiers to achieve relatively optimal performance. We will Boostingunderstand it as a class of algorithms that promote weak classifiers to strong classifiers, also known as Boosting algorithms 提升算法. In simple terms, Boosting is the process of serially training a series of weak classifiers, so that the samples misclassified by the previous weak classifiers will receive more attention in the follow-up, and finally combine these classifiers into the best strong classifier.

2 AdaBoost algorithm principle

AdaBoostFull name Adaptive Boosting, translated as 自适应提升算法. AdaBoost is a Boosting algorithm that learns multiple weak classifiers and linearly combines them into strong classifiers by changing the weights of training samples.
Two key issues in the Boosting method : one is how to change the weight or probability distribution of samples during training, and the other is how to combine multiple weak classifiers into a strong classifier.
AdaBoostOne is to increase the weight of samples misclassified by the weak classifier in the previous round, and reduce the weight of the correctly classified samples; the other is to linearly combine multiple weak classifiers to increase the weight of weak classifiers with good classification effects and reduce The weight of weak classifiers with high classification error rate.
insert image description here
insert image description here
AdaBoostIt is a classification learning model 加性模型for the model, 指数函数the loss function, 前向分步and the algorithm.
加性模型( additive model) is a model constructed from the sum of multiple base models.
insert image description here
insert image description here

3 AdaBoost algorithm implementation

First, you need to define the base classifier. Generally, a decision tree or 决策树桩( dicision stump) can be used as the base classifier. The decision tree stump is a decision tree with only a single-layer decision structure, which only performs classification decisions on one feature. The decision tree is a tree classification model with a multi-layer decision structure, which can make classification decisions on multiple features.

# 定义决策树桩类
class DecisionStump: # 作为AdaBoost的弱分类器
    def __init__(self):
        self.label = 1 # 基于划分阈值决定将样本分类为1还是-1
        self.feature_index = None # 特征索引
        self.threshold = None # 特征划分阈值
        self.alpha = None # 基分类器的权重

AdaBoostThe classic version of the algorithm flow, including 权重初始化, 训练弱分类器, 计算当前分类误差, 计算弱分类器的权重and 更新训练样本权重.

import numpy as np
class Adaboost:

    def __init__(self, n_estimators=5): # 弱分类器个数
        self.n_estimators = n_estimators
    
    def fit(self, X, y): # adaboost拟合函数
        m, n = X.shape
        w = np.full(m, (1/m)) # (1)初始化权重分布为均匀分布1/N
        self.estimators = [] # 初始化基分类器列表
        for _ in range(self.n_estimators):
            estimator = DecisionStump() # (2.a) 训练一个弱分类器:决策树桩
            min_error = float('inf') # 设定一个最小化误差
            for i in range(n): # 遍历数据集特征,根据最小分类误差率选择最优特征
                unique_values = np.unique(X[:, i])
                for threshold in unique_values: # 尝试将每一个特征值作为分类阈值
                    p = 1
                    pred = np.ones(np.shape(y)) # 初始化所有预测值为1
                    pred[X[:, i] < threshold] = -1 # 小于分类阈值的预测值为-1
                    error = sum(w[y != pred]) # (2.b) 计算分类误差率
                    if error > 0.5: # 如果分类误差率大于0.5,则进行正负预测反转,例如error = 0.6 => (1 - error) = 0.4
                        error = 1 - error
                        p = -1
                    if error < min_error: # 一旦获得最小误差,则保存相关参数配置
                        estimator.label = p
                        estimator.threshold = threshold
                        estimator.feature_index = i
                        min_error = error
                        
            estimator.alpha = 0.5 * np.log((1.0 - min_error) / (min_error + 1e-9)) # (2.c)计算基分类器的权重
            preds = np.ones(np.shape(y)) # 初始化所有预测值为1
            negative_idx = (estimator.label * X[:, estimator.feature_index] < estimator.label * estimator.threshold) # 获取所有小于阈值的负类索引
            preds[negative_idx] = -1 # 将负类设为-1
            w *= np.exp(-estimator.alpha * y * preds) # (10-5)
            w /= np.sum(w)
            self.estimators.append(estimator) # 保存该弱分类器

    def predict(self, X): # 定义adaboost预测函数
        m = len(X)
        y_pred = np.zeros((m, 1))
        for estimator in self.estimators: # 计算每个基分类器的预测值
            predictions = np.ones(np.shape(y_pred)) # 初始化所有预测值为1
            negative_idx = (estimator.label * X[:, estimator.feature_index] < estimator.label * estimator.threshold) # 获取所有小于阈值的负类索引
            predictions[negative_idx] = -1 # 将负类设为-1
            y_pred += estimator.alpha * predictions # 对每个基分类器的预测结果进行加权
        y_pred = np.sign(y_pred) # 返回最终预测结果
        return y_pred
from sklearn.model_selection import train_test_split
from sklearn.datasets._samples_generator import make_blobs # 导入模拟二分类数据生成模块
from sklearn.metrics import accuracy_score
X, y = make_blobs(n_samples=150, n_features=2, centers=2, cluster_std=1.2, random_state=40) # 生成模拟二分类数据集
y_ = y.copy()
y_[y_==0] = -1
y_ = y_.astype(float)
X_train, X_test, y_train, y_test = train_test_split(X, y_, test_size=0.3, random_state=40)
clf = Adaboost(n_estimators=5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)
0.9777777777777777

4. Implement AdaBoost algorithm based on sklearn

AdaBoostThe classification model is called under the module sklearnof .ensembleAdaBoostClassifier

from sklearn.ensemble import AdaBoostClassifier
clf_ = AdaBoostClassifier(n_estimators=5, random_state=40)
y_pred_ = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)
0.9777777777777777

The most widely used weak classifiers are decision trees and neural networks, and decision trees use CART.
Notebook_Github address

Guess you like

Origin blog.csdn.net/cjw838982809/article/details/131231711