Boosting and AdaBoost for machine learning

1 Introduction to Boosting and AdaBoost

1.1 Integrated learning

The basic idea of ​​the Ensemble Learning algorithm is to combine multiple classifiers to achieve an integrated classifier with better prediction effect.

  • Ensemble learning solves a single prediction problem by building several models. It works by generating multiple classifiers/models that learn and make predictions independently. These predictions are finally combined into combined predictions, which are therefore better than any one single-class prediction.
  • Integrated learning is an idea, not an algorithm

 1.1.1 Classification of ensemble learning

Integration algorithms can be roughly divided into: Bagging , Boosting and Stacking and other types.

  • bagging: parallel, multiple learners are independent of each other and can be trained in parallel
  • boosting: serial, the latter learner depends on the previous learner
  • stacking: the output of multiple learners is used as the input of the next learner

1.1.2 Boosting和Bagging

Both Boosting and Bagging are popular ensemble learning techniques used to improve the performance of machine learning models through the combination of multiple base learners. Although both methods aim to reduce overfitting and increase generalization ability, they have different methods and characteristics.

(1) Boosting

Boosting is an iterative ensemble technique in which base learners are trained sequentially and each subsequent model focuses on correcting the mistakes of the preceding models. It gives higher weight to misclassified instances in the training set, forcing the next model to focus on these intractable cases.

  • Base learners: Typically, boosting algorithms use weak learners, and these models perform only slightly better than random guessing. For example, use a limited depth decision tree (often called a "stump") or a simple linear model.
  • Weighted Voting: At prediction time, each base learner is weighted according to its performance and importance in previous iterations. In subsequent iterations, misclassified instances get higher weights.

AdaBoost (Adaptive Boosting) and Gradient Boosting Machine (GBM) are common boosting algorithms.

(2) Bagging (bag):

Bagging is an ensemble technique that uses parallel training of multiple base learners, each trained independently on a randomly selected subset of the training data, sampled with replacement. The final prediction is obtained by averaging (regression problems) or voting (classification problems) the predictions of all base learners.

  • Base learner: Bagging typically uses the same base learning algorithm for each model, each trained independently on a different subset of the data.
  • Bootstrap sampling: For each base learner, randomly sample a subset of the training data with replacement. This means that some instances may appear multiple times in the subset, while others may not appear at all. 

Random forest is a common bagging algorithm that uses decision trees as base learners.

(3) The difference between the two technologies

Diversity of base learners:

  • Boosting: Typically using a sequence of weak learners, each model corrects the mistakes of its predecessors. This sequentiality introduces diversity into the integration.
  • Bagging: Through random sampling of data, each model is trained independently to obtain integrated diversity.

Weighted voting vs simple average/vote:

  • Boosting: The final prediction is obtained by combining the predictions of all models and weighting them according to their performance. More accurate models have more influence in the final prediction.
  • Bagging: All models contribute equally to the final prediction because they are just averaged (regression problems) or voted (classification problems).

Robustness against overfitting:

  • Boosting: Boosting is more prone to overfitting if the weak learner is too complex to memorize the training data. Careful parameter tuning, limiting the complexity of weak learners, and early stopping are required to prevent overfitting.
  • Bagging: Bagging helps reduce overfitting by averaging the bias and error across models. Usually provides more stable and reliable forecast results.

performance:

  • Boosting: If weak learners are well chosen and trained properly, often achieve higher accuracy than bagging.
  • Bagging: While not necessarily as accurate as boosting, generally provides more consistent and reliable results.

Computational complexity:

  • Boosting: Computationally more expensive compared to bagging since the model is trained sequentially with weighted instances.
  • Bagging: Models can be trained in parallel and thus are more efficient when dealing with large datasets.

Both boosting and bagging are effective ensemble methods, and which one to choose depends on the specific problem, the nature of the data, and the trade-off between accuracy and computational resources. Boosting focuses more on correcting errors, potentially leading to higher accuracy; while bagging is simpler and more computationally efficient, providing more stable predictions that are less prone to overfitting.

1.2 Boosting algorithm

The boosting algorithm is a kind of integrated learning algorithm that promotes a weak learner to a strong learner. It learns multiple classifiers by changing the weight of training samples, and linearly combines these classifiers to improve generalization performance.

The working mechanism of the Boosting series of algorithms is similar. The general idea is: first train a base learner from the initial training set, and then adjust the distribution of training samples according to the performance of the base learner, so that the training samples that the previous base learner made mistakes in Follow-up receives more attention, and then trains the next base learner based on the adjusted sample distribution; this is repeated until the number of base learners reaches the value specified in advance or the accuracy of the combined learner reaches 100%, and finally these base learners The linear combination of the learners is obtained to obtain the final learner.

From the working mechanism of Boosting, we can find that there are two key problems to be solved for the boosting method:

  • How to change the weight distribution of training data in each round of training
  • How to combine weak learners into strong learners

1.3 Adaboost Algorithm

Adaboost (adaptive boosting) , also known as adaptive boosting algorithm, is the most famous Boosting algorithm. The classic Adaboost algorithm can only be used for binary classification problems. Therefore, we only discuss the application of the adaboost algorithm in binary classification tasks here. The basic idea is: at the beginning, the weight of the training data is initialized to an equal value, and the first weak classifier is trained (the weak learner here is generally a single-layer decision tree, that is, a decision tree stump, which passes a judgment condition Divide the data into two directly), and calculate the error rate of the classifier. Starting from the second training, the weight of each sample will be readjusted according to the effect of the weak classifier obtained in the previous training. The specific method is to reduce the weight of the last paired sample and increase the weight of the last wrongly classified sample. , and then train the next weak classifier. This cycle continues until the number of weak classifiers reaches a given value or the prediction accuracy of the integrated model reaches 100%, and then these weak classifiers are combined, and the combination strategy assigns a larger weight to the weak classifier with a small classification error rate. A weak classifier with a large classification error rate is assigned a smaller weight to form a stronger final classifier (strong classifier).

According to the basic idea of ​​Adaboost algorithm we can know:

  • Adaboost uses the reassignment method to change the weight distribution of training samples, increasing the weights of those samples that were misclassified by the previous round of weak classifiers, and reducing the weights of those samples that were correctly classified by the previous round of weak classifiers. In this way, those data that are not correctly classified will receive more attention from the weak classifier in the next round due to the increase in weight. Thus, the classification problem is "divide and conquer" by a series of weak classifiers.
  • Adaboost adopts linear combination to combine weak learners into strong learners, giving a larger weight to weak classifiers with small classification error rates, so that they can play a greater role in the final vote, and give weak classifications with large classification error rates A smaller weight of the controller makes it play a smaller role in the voting.

1.3.1 AdaBoost classification problem

Taking binary classification as an example, suppose a training data set for binary classification is given \chi = \left \{ (x_{1}, y_{1}), (x_{2}, y_{2}),...,(x_{n}, y_{n})\right \}, which x_{i}represents the sample point and y_{i}the category corresponding to the sample, and its possible values ​​are {-1, 1}. The AdaBoost algorithm uses the following algorithm to serially learn a series of weak learners from the training data, and linearly combine these weak learners into a strong learner. The AdaBoost algorithm is described as follows:

Input: training dataset\chi = \left \{ (x_{1}, y_{1}), (x_{2}, y_{2}),...,(x_{n}, y_{n})\right \}

Output: final strong classifier G(x)

(1) Initialize the weight distribution value of the training data : ( D_{m} represents the weight of the sample point of the mth weak learner), that is to say, the initialized weights are all (1/n)

 D_{1}=(w_{11},...,w_{1i},...,w_{1N}),       w_{1i}=1/N,     i=1,2,...,N

(2) For M weak learners , m=1, 2,..., M:

        a) Use the training data set with weight distribution D{_{m}}to learn, and obtain the basic classifier  G{_{m}}(x) , whose output value is {-1, 1};

        b) Calculate the classification error rate of the weak classifier G{_{m}}(x)on the training data set e{_{m}}, the smaller the value of the base classifier, the greater the role in the final classifier

Among them, I(G{_{m}}(x{_{i}})\neq y{_{i}})the value is 0 or 1, 0 means the classification is correct, and 1 means the classification is wrong. Which is equal to the weight of the classification error * the number of classification errors

G{_{m}}(x)c) Calculate the weight coefficient of         the weak classifier \alpha {_{m}}: (the logarithm here is the natural logarithm)

In general, the value of em should be less than 0.5, because if learning random classification is not performed, the error rate of the binary classification is equal to 0.5, and when learning, the error rate should be slightly lower than 0.5. When em decreases, the value of am increases, and what we hope to get is that the weight of the weak classifier with the smaller classification error rate is greater, and the impact on the final prediction is greater, so the weak classifier Intuitively, it is reasonable to set the weight of am to this equation. Please read on for the specific proof that am is the above formula.

        d) Update the sample weight distribution of the training data set:

G{_{m}}(x)For the binary classification, the output value of the weak classifier is {-1, 1}, y{_{i}}and the value of the weak classifier is {-1, 1}, so for the correct classification  y{_{i}}G{_{m}}(x)>0, for the wrong classification y{_{i}}G{_{m}}(x)<0, since the sample weight value is in [0, 1], the value is smaller when the classification is correct w{_{m+1,i}}, and the value is larger when the classification is wrong w{_{m+1,i}}, and what we hope to get is that the training sample points with high weight values ​​will get more attention in the subsequent weak learners.

Among them, Z{_{m}}is the normalization factor, the main function is W{_{mi}}to normalize the value of to between 0-1, so that \sum_{i=1}^{N}{W{_{mi}}} = 1.

(3) Above we introduced how to calculate the weight coefficient α of the weak learner, how to update the weight coefficient W of the sample, and how to calculate the error rate e of learning. Next is the last question, which combination strategy is adopted by each weak learner , AdaBoost's combined strategy for classification problems is the weighted average method. As follows, the linear combination of basic classifiers is constructed using the weighted average method:

Get the final classifier:

1.3.2 AdaBoost regression problem

There are many variants of the AdaBoost regression problem, here we take the AdaBoost R2 algorithm as the standard

(1) Let's first look at the error rate problem of the regression problem. For the mth weak learner, calculate its maximum error on the training set (that is, after each training, calculate the maximum value of the difference between the predicted value and the real value ):

                       E{_{m}}=max|y{_{i}}-G{_{m}}(x{_{i}})|    i=1,2,...,N

Then calculate the relative error of each sample: (the purpose of calculating the relative error is to normalize the error to [0, 1])

  e{_{mi}}=\frac{|y{_{i}}-G{_{m}}(x{_{i}})|}{E{_{m}}}  , obviously 0\leq e{_{mi}}\leq 1

Here is the case when the error loss is linear, if we use squared error, then e{_{mi}}=\frac{|y{_{i}}-G{_{m}}(x{_{i}})|}{E{_{m}}^2}^2, if we use exponential error, thene{_{mi}}=1-exp(\frac{-y{_{i}}+G{_{m}}(x{_{i}})}{E{_{m}}})

Finally, the error rate of the kth weak learner is obtained:  e{_{m}}=\sum_{i=1}^{N}{w{_{mi}}e{_{mi}}}, which means that the sum of the weighted errors of each sample point is the error of the weak learner.

(2) Let's take a look at the weight coefficient α of the weak learner, as follows:

                   \alpha {_{m}}=\frac{e{_{m}}}{1-e{_{m}}}

(3) For how to update the sample weight of the regression problem, the sample weight coefficient of the k+1th weak learner is:

                w{_{m+i,i}}=\frac{w{_{mi}}}{Z{_{m}}}\alpha {_{m}}^{1-e{_{mi}}}

where Z{_{k}}is the normalization factor:Z{_{m}}=\sum_{i=1}^{N}{w{_{mi}}\alpha {_{m}}^{1-e{_{mi}}}}

(4) The last is the combination strategy. Unlike the classification problem, the combination strategy of the regression problem uses the method of taking the median of the weighted weak learner. The final strong regressor is: , where is the median of G(x)=\sum_{m=1}^{M}{g(x)ln\frac{1}{\alpha {_{m}}}}all ( g(x)m \alpha {_{m}}G{_{m}}(x)=1,2,...,M).

This is the introduction to the algorithm of the AdaBoost regression problem. There is still a problem that has not been solved, that is, how the weight coefficients of our weak learners are strictly \alpha {_{m}}derived through calculation in the classification problem.

1.3.3 AdaBoost forward step-by-step algorithm

In the previous two sections, we introduced the classification and regression problems of AdaBoost, but there is still an unsolved problem in the classification problem, which is how the weight coefficient of the weak learner is derived through the formula \alpha {_{m}}=\frac{1}{2}ln{\frac{1-e{_{m}}}{e{_{m}}}}. The forward step-by-step algorithm is mainly used here, and we will introduce the algorithm next.

From another perspective, the AdaBoost algorithm is a classification problem when the model is an additive model, the loss function is an exponential function, and the learning algorithm is a forward step-by-step algorithm. Among them, the additive model means that our final strong classifier is obtained by the weighted average of several weak classifiers, as follows:

f(x)=\sum_{m=1}^{M}{\alpha {_{m}}G{_{m}}(x)}

The loss function is an exponential function, as follows:

L(y,f(x))=exp(-yf(x))

The learning algorithm is a forward step-by-step algorithm. Let's introduce how AdaBoost uses the forward distribution algorithm to learn:

(1) Assume that the forward distribution algorithm has been obtained after m-1 rounds of iteration f{_{m-1}}(x):

f{_{m-1}}=f{_{m-2}}(x)+\alpha {_{m-1}}G{_{m-1}}(x)

                    =\alpha {_{1}}G{_{1}}(x)+...+\alpha {_{m-1}}G{_{m-1}}(x)

In the m-th round of iteration \alpha {_{m}}, G{_{m}}(x)and f{_{m}}(x).

          f{_{m}}(x)=f{_{m-1}}(x)+\alpha {_{m}}G{_{m}}(x)

The goal is to make the sum obtained by the forward distribution algorithm \alpha {_{m}}minimize G{_{m}}(x)the f{_{m}}(x)exponential loss on the training data set T, that is

         (\alpha {_{m}},G{_{m}}(x))=arg \underset{\alpha ,G}{min}\sum_{i=1}^{N}exp(-y{_{i}}(f{_{m-1}}(x{_{i}})+\alpha G(x{_{i}})))

                               =arg\underset{\alpha ,G}{min}\sum_{i=1}^{N}\bar{w}{_{mi}}exp(-y{_{i}}\alpha G(x{_{i}}))          {\color{Red} (1)}

The above formula is the loss function obtained by using the forward step-by-step learning algorithm. Among them, \bar{w}{_{mi}}=exp(-y{_{i}}f{_{m-1}}(x{_{i}})). Since \bar{in}{_{mi}}neither depends \alphanor depends on G, it is irrelevant to minimize in the m-th iteration. But \bar{in}{_{mi}}depends f{_{m-1}}(x), changes with each round of iterations.

The sum obtained when the above formula reaches the minimum is the sum \alpha ^{_{*}}{_{m}}obtained G ^{_{*}}{_{m}}(x)by the AdaBoost algorithm .\alpha {_{m}}G{_{m}}(x)

(2) First find the classifier G ^{_{*}}{_{m}}(x):

We know that the output values ​​​​of the classifier G(x) for the two classifications are -1 and 1, y{_{i}}\neq G{_{m}}(x{_{i}})which means that the prediction is wrong and y{_{i}}= G{_{m}}(x{_{i}})that it is correct. Each sample point has a weight value, so the output of a weak classifier is: G{_{m}}(x)=\sum_{i=1}^{N}\bar{w}{_{mi}}I(y{_{i}}\neq G{_{m}}(X{_{i}})), our goal is to minimize the loss, so our mth weak classifier with loss minimization is:

G^{_{*}}{_{m}}(x)=arg \underset{G}{min}\sum_{i=1}^{N}\bar{w}{_{mi}}I(y{_{i}}\neq G(x{_{i}}))   in,\bar{w}{_{mi}}=exp(-y{_{i}}f{_{m-1}}(x{_{i}}))

Why use it I(y{_{i}}\neq G(x{_{i}}))to represent the output of a weak classifier? Because our AdaBoost does not limit the type of weak learner, its actual expression depends on the type of weak learner used.

This classifier is the basic classifier of the Adaboost algorithm G{_{m}}(x), because it is the basic classifier that minimizes the classification error rate of the m-th round of weighted training data.

(3) Then come to ask \alpha ^{_{*}}{_{m}},

Substituting G ^{_{*}}{_{m}}(x)it into the loss function (1) formula, we get:

\alpha {_{m}}=\sum_{i=1}^{N}\bar{w}{_{mi}}exp(-y{_{i}}\alpha{_{m}} G^{_{*}}{_{m}}(x{_{i}}))

Our goal is to minimize the above formula and find the corresponding \alpha ^{_{*}}{_{m}}.

              \sum_{i=1}^{N}\bar{w}{_{mi}}exp(-y{_{i}}\alpha{_{m}} G^{_{*}}{_{m}}(x{_{i}}))

         =\sum_{y{_{i}}=G{_{m}}(x{_{i}})}\bar{w{_{mi}}}e^{_{-\alpha }}+\sum_{y{_{i}}\neq G{_{m}}(x{_{i}})}\bar{w{_{mi}}}e^{_{\alpha }}

        =e^{_{-\alpha }}\sum_{y{_{i}}=G{_{m}}(x{_{i}})}\bar{w}{_{mi}}+e^{_{\alpha }}\sum_{y{_{i}}\neq G{_{m}}(x{_{i}})}\bar{w}{_{mi}}

        =e^{_{-\alpha }}(\sum_{i=1}^{N}\bar{w}{_{mi}}-\sum_{y{_{i}}\neq G{_ {m}}(x{_{i}})}\bar{w}{_{mi}})+e^{_{\alpha}}\sum_{y{_{i}}\neq G{ _{m}}(x{_{i}})}\bar{w}{_{mi}}

        =(e^{_{\alpha}}-e^{_{-\alpha}})\sum_{y{_{i}}\neq G{_{m}}(x{_{i}} )}\bar{w}{_{mi}}+e^{_{-\alpha }}\sum_{i=1}^{N}\bar{w}{_{mi}}

        =(e^{_{\alpha }}-e^{_{-\alpha }})\sum_{i=1}^{N}\bar{w}{_{mi}}I(y{_ {i}}\neq G{_{m}}(x{_{i}}))+e^{_{-\alpha}}\sum_{i=1}^{N}\bar{w} {_{my}}        {\color{Red} (2)}

because,e{_{m}}=\frac{\sum_{i=1}^{N}\bar{w}{_{mi}}I(y{_{i}}\neq G{_{m}}(x{_{i}}))}{\sum_{i=1}^{N}\bar{w}{_{mi}}}

Note: Here our sample point weight coefficients \bar{in}{_{mi}}are not normalized, so\sum_{i=1}^{m}\bar{w}{_{mi}}\neq

(2) formula is:   (e^{_{\alpha }}-e^{_{-\alpha }})e{_{m}}\sum_{i=1}^{N}\bar{w}{_{mi}}+e^{_{-\alpha }}\sum_{i=1}^{N}\bar{w}{_{mi}}

Then our goal is to find:

         \alpha {_{m}}=arg\underset{\alpha }{min}(e^{_{\alpha }}-e^{_{-\alpha }})e{_{m}}\sum_{i=1}^{N}\bar{w}{_{mi}}+e^{_{-\alpha }}\sum_{i=1}^{N}\bar{w}{_{mi}}

\alphaCalculate the partial derivative of the above formula , and make the partial derivative equal to 0, we get:

       (e^{_{\alpha }}+e^{_{-\alpha }})e{_{m}}\sum_{i=1}^{N}\bar{w}{_{mi}}-e^{_{-\alpha }}\sum_{i=1}^{N}\bar{w}{_{mi}}=0, and then get:

       (e^{_{\alpha }}+e^{_{-\alpha }})e{_{m}}-e^{_{-\alpha }}=0, get: \alpha ^{_{*}}{_{m}}=\frac{1}{2}ln\frac{1-e{_{m}}}{e{_{m}}}, where e{_{m}}is the error rate:

       e{_{m}}=\frac{\sum_{i=1}^{N}\bar{w}{_{mi}}I(y{_{i}}\neq G{_{m}}(x{_{i}}))}{\sum_{i=1}^{N}\bar{w}{_{mi}}}=\sum_{i=1}^{N}w{_{mi}}I(y{_{i}}\neq G(x{_{i}}))

(4) Finally, look at the update of the sample weight.

Using what I said earlier f{_{m}}(x)=f{_{m-1}}(x)+\alpha {_{m}}G{_{m}}(x), and the weight \bar{w}{_{mi}}=exp(-y{_{i}}f{_{m-1}}(x))

The following formula can be obtained:

       \bar{w}{_{m+1,i}}=\bar{w}{_{mi}}exp(-y{_{i}}\alpha {_{m}}G{_{m}}(x{_{i}}))

In this way, the sample weight update formula we mentioned earlier is obtained.

1.3.4 Regularization of AdoBoost algorithm

In order to prevent overfitting, a regularization term is also added to the AdaBoost algorithm. This regularization term is called the step size or the learning rate. Defined as v, for the iteration of the previous weak learner:

f{_{m}}(x)=f{_{m-1}}(x)+\alpha {_{m}}G{_{m}}(x)

Adding the regularization term, it becomes as follows:

f{_{m}}(x)=f{_{m-1}}(x)+v\alpha {_{m}}G{_{m}}(x)

The value range of v is (0, 1]. For the same training set learning effect, a smaller v means that we need more iterations of the weak learner. Usually we use the learning rate and the maximum number of iterations to determine Algorithm fit.

2 Advantages and disadvantages of AdaBoost

AdaBoost (Adaptive Boosting) is an ensemble learning method that combines weak learners (usually decision trees) to create a strong classifier. It focuses on hard-to-classify instances by giving more weight to misclassified data points in each iteration. AdaBoost has some advantages and disadvantages:

2.1 Advantages of AdaBoost

  • Improved Accuracy: AdaBoost is generally able to achieve better classification accuracy than individual learning algorithms. By combining multiple weak learners, it can effectively handle complex classification problems.

  • Diversity: It can be used with a variety of learning algorithms as a weak learner, thus allowing flexibility in choosing an appropriate base model for the problem at hand.

  • Reduce overfitting: AdaBoost reduces the risk of overfitting because it focuses more on misclassified samples, allowing the algorithm to generalize better on unseen data.

  • No need to tune many hyperparameters: Unlike some other complex models, AdaBoost typically has fewer hyperparameters to tune, making it simpler to use and implement.

  • Efficiently handles imbalanced datasets: AdaBoost performs well even on imbalanced datasets where one class is significantly more prevalent than the others.

2.2 Disadvantages of AdaBoost

  • Sensitive to noisy data: AdaBoost is sensitive to noisy data and outliers. Noisy data can lead to overfitting, reducing the performance of the model.

  • High computational complexity: Since the algorithm needs to iteratively train multiple weak learners, it may be computationally expensive.

  • Requires sufficient training data: AdaBoost's performance may degrade if there is insufficient training data or if the weak learner cannot outperform random guesses.

  • May lead to overfitting: Although AdaBoost reduces overfitting to a certain extent, if the weak learner is too complex or the number of iterations (boosting rounds) is too high, overfitting may still occur.

  • Preference for complex models: AdaBoost tends to favor complex weak learners, which can lead to long training times and can lead to overfitting if not properly controlled.

AdaBoost is a powerful ensemble learning technique that can significantly improve classification accuracy by combining multiple weak learners. However, it requires careful handling of noisy data, tuning the number of boosting rounds, and choosing an appropriate weak learner to achieve the best results.

3 Application Scenarios of AdaBoost

AdaBoost performs well in many fields and application scenarios, especially when dealing with classification problems. The following are some common application scenarios of AdaBoost:

  • Image recognition and computer vision: AdaBoost can be used for computer vision tasks such as image classification, object detection, and face recognition, by combining multiple weak classifiers to improve the accuracy of image recognition.

  • Natural Language Processing (NLP): In NLP tasks such as text classification, sentiment analysis, and spam filtering, AdaBoost can be used to improve the performance of text classification.

  • Medical diagnosis: In medical image processing and diagnosis, AdaBoost can be used to assist doctors in diagnosis, such as tumor detection and disease classification.

  • Financial field: AdaBoost can be applied to problems in the financial field such as credit risk assessment, fraud detection, and stock market forecasting.

  • Recommendation system: In e-commerce and online recommendation systems, AdaBoost can be used for personalized recommendation and user behavior prediction.

  • Speech recognition: In speech recognition applications, AdaBoost can be used for tasks such as voiceprint recognition and speech classification.

  • Bioinformatics: AdaBoost can play an important role in bioinformatics problems such as analyzing biological data, DNA sequence classification and protein structure prediction.

  • Remote sensing image analysis: AdaBoost can be used for remote sensing image classification and object recognition, such as land use classification and environmental monitoring.

  • Behavior recognition: In behavior analysis and behavior recognition applications, AdaBoost can be used to recognize actions, behaviors, and activities.

Although AdaBoost performs well in many scenarios, it is not suitable for all problems. In practical applications, it is necessary to select the appropriate machine learning algorithm and parameter adjustment method according to the characteristics of the specific problem and the characteristics of the data set.

4 AdaBoost code implementation

4.1 Dataset Introduction

The breast cancer data set is a classic two-category data set in sklearn:

Contains 569 samples, 30 features per sample, 357 positive samples, 212 negative samples

4.2 Code implementation

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import numpy as np
from tqdm import tqdm


class CancerAdaboost:
    def __init__(self, n_estimators):
        self.n_estimators = n_estimators
        self.clfs = [lambda x: 0 for i in range(self.n_estimators)]
        self.alphas = [0 for i in range(self.n_estimators)]
        self.weights = None

    # 构造弱分类器的决策函数g(X)
    def _G(self, fi, fv, direct):
        assert direct in ["positive", "nagetive"]

        def _g(X):
            if direct == "positive":
                predict = (X[:, fi] <= fv) * -1  # which <= value assign -1 else 0
            else:
                predict = (X[:, fi] > fv) * -1  # which > value assign 0 else -1
            predict[predict == 0] = 1
            return predict

        return _g

    # 选择最佳的划分点,即求出fi和fv
    def _best_split(self, X, y, w):
        best_err = 1e10
        best_fi = None
        best_fv = None
        best_direct = None
        for fi in range(X.shape[1]):
            series = X[:, fi]
            for fv in np.sort(series):
                predict = np.zeros_like(series, dtype=np.int32)
                # direct = postive
                predict[series <= fv] = -1
                predict[series > fv] = 1
                err = np.sum((predict != y) * 1 * w)
                #                 print("err = {} ,fi={},fv={},direct={}".format(err,fi,fv,"postive"))
                if err < best_err:
                    best_err = err
                    best_fi = fi
                    best_fv = fv
                    best_direct = "positive"

                # direct = nagetive
                predict = predict * -1
                err = np.sum((predict != y) * 1 * w)
                if err < best_err:
                    best_err = err
                    best_fi = fi
                    best_fv = fv
                    best_direct = "nagetive"
        #                 print("err = {} ,fi={},fv={},direct={}".format(err,fi,fv,"nagetive"))
        return best_err, best_fi, best_fv, best_direct

    def fit(self, X_train, y_train):
        self.weights = np.ones_like(y_train) / len(y_train)
        for i in tqdm(range(self.n_estimators)):
            err, fi, fv, direct = self._best_split(X_train, y_train, self.weights)

            # 计算G(x)的系数alpha
            alpha = 0.5 * np.log((1 - err) / err) if err != 0 else 1
            #             print("alpha:",alpha)
            self.alphas[i] = alpha

            # 求出G
            g = self._G(fi, fv, direct)
            self.clfs[i] = g

            if err == 0: break

            # 更新weights
            self.weights = self.weights * np.exp(-1 * alpha * y_train * g(X_train))
            self.weights = self.weights / np.sum(self.weights)

    def predict(self, X_test):
        y_p = np.array([self.alphas[i] * self.clfs[i](X_test) for i in range(self.n_estimators)])
        y_p = np.sum(y_p, axis=0)
        y_predict = np.zeros_like(y_p, dtype=np.int32)
        y_predict[y_p >= 0] = 1
        y_predict[y_p < 0] = -1
        return y_predict

    def score(self, X_test, y_test):
        y_predict = self.predict(X_test)
        return np.sum(y_predict == y_test) / len(y_predict)


if __name__ == "__main__":
    breast_cancer = load_breast_cancer()
    X = breast_cancer.data
    y = breast_cancer.target
    y[y == 0] = -1

    # 划分数据
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    print(X_train.shape, X_test.shape)

    clf = CancerAdaboost(200)
    clf.fit(X_train, y_train)
    print(clf.score(X_test, y_test))

4.3 Result display

(1) n_estimators=50, the classification result reached 93% accuracy 

  0%|          | 0/50 [00:00<?, ?it/s](426, 30) (143, 30)
100%|██████████| 50/50 [00:24<00:00,  2.04it/s]
0.9300699300699301

 (2) n_estimators=100, the classification result reached 94% accuracy 

  0%|          | 0/100 [00:00<?, ?it/s](426, 30) (143, 30)
100%|██████████| 100/100 [00:54<00:00,  1.83it/s]
0.9440559440559441

(3) When n_estimators=200, the classification result reached 97% accuracy 

(426, 30) (143, 30)
100%|██████████| 200/200 [02:09<00:00,  1.54it/s]
0.972027972027972

The larger the number of n_estimators, the longer the calculation time and the higher the accuracy

Guess you like

Origin blog.csdn.net/lsb2002/article/details/131843024