This article explains the best boosting methods in machine learning: Boosting and AdaBoost

Ensemble methods are very popular in Kaggle and other machine learning tasks, and they are incredibly powerful, be it random forests. The author of this paper introduces the basic Boosting concept to the AdaBoost algorithm in detail, and shows how to implement AdaBoost. These are the stepping stones to the family of integrated methods.

Recently, Boosting techniques have become popular in Kaggle competitions and other predictive analytics tasks. This article will introduce relevant concepts about Boosting and AdaBoost in as much detail as possible. If you like this article, remember to bookmark, like, and follow.

This article will cover:

  • A quick review of bagging

  • Limitations of bagging

  • Conceptual details of Boosting

  • Computational efficiency of boosting

  • code example

Limitations of Bagging

Next, let's consider a binary classification problem. We classify an observation as 0 or 1. Although this is not the purpose of this article, let's review the concept of bagging for clarity.

Bagging refers to a technique called "Bootstrap Aggregating". Its essence is to select T bootstrap samples, install a classifier on each sample, and then train the model in parallel. Typically, in random forests, decision trees are trained in parallel. Then, average the results of all the classifiers to get a bagging classifier:

picture

Formula for Bagging Classifier

The process can be illustrated in the following way. Let's consider 3 classifiers that produce a classification result that may be true or false. If we plot the results of the 3 classifiers, there will be some areas that represent that the results of the classifier are wrong. In the image below, such areas are shown in red:

picture

_Example of Bagging Application
Scenarios_

This example serves as a good illustration where one classifier is wrong and the other two are correct. By voting on the classifier, you can get a very high classification accuracy. But as you might guess, the bagging mechanism doesn't work very well sometimes, when all the classifiers get wrong classification results in the same area.

picture

For this reason, the intuitive idea behind the boosting method is:

  • We need to train the model serially, not in parallel.

  • Each model needs to focus on where the previous classifier underperformed.

**Introduction to Boosting
**

concept

The above idea can be interpreted as:

  • Train the model h1 on the entire dataset

  • weight data from regions where h1 performs poorly, and train model h2 on those data

  • weight the data for regions where h1 ≠ h2 and train the model h3 on these data

Instead of parallel training, we can train these models serially. This is the essence of Boosting!

Boosting methods train a series of low-performing algorithms, called weak learners, over time by adjusting the error metric. Weak learners are those algorithms with a slightly less than 50% error rate, as shown in the following figure:

picture

_Weak classifier with slightly less than 50% error
rate_

weighted error

How can we implement such a classifier? In fact, we do it by weighting the errors throughout the iterations. In this way, we will give more weight to the regions where the previous classifier performed poorly.

Think of data points on a 2D image. Some points will be classified well, some will not. Typically, when calculating the error rate, each error is weighted 1/n, where n is the number of data points to be classified.

picture

_Unweighted
error_

Now let's weight the error!

picture

_weighted
error_

By now, you may have noticed that we gave higher weights to data points that were not well classified. The weighting process is shown in the following figure:

picture

_Example of weighting
process_

Ultimately, we want to build a strong classifier like the one shown below:

picture

_Strong
Classifier_

decision stump

You may ask, how many classifiers do we need to implement for the whole boosting system to work well? How to choose a classifier at each step?

The answer is the so-called "decision stump"! A decision stump is a single-level decision tree. The main idea is that at each step we have to find the best stump (i.e. get the best data partition) that minimizes the overall error. You can think of a stump as a test, where we assume that all data points on one side of the stump belong to class 1 and all data points on the other side belong to class 0.

There are many possible combinations of decision stumps. Next, let's see how many combinations of tree stumps are there in this simple example?

picture

_3 data points to be
divided_

In fact, there are 12 stump combinations in this example! This may seem surprising, but it's actually quite easy to understand.

picture


_12 decision stumps_

We can do 12 possible "tests" for the above situation. The number "2" on the side of each dividing line simply represents the fact that all points on one side of the dividing line may belong to either class 0 or class 1. Therefore, there are 2 "tests" embedded in each dividing line.

In each iteration t, we will select the weak classifier ht that best divides the data, which minimizes the overall error rate. Recall that the error rate here is a weighted corrected version of the error rate that takes into account what was introduced earlier.

Find the best division

As described above, the optimal split is found by identifying the best weak classifier ht (usually a decision tree (decision stump) with 1 node and 2 leaves) in each iteration t. Suppose we are trying to predict whether a person who wants to borrow money will be a good repayer:

picture

_Find the best
division_

In this case, the optimal division at time t is to use the "payment history" as the stump, since this division has the smallest weighted error.

Just note that, in practice, a decision tree classifier like this may have a deeper structure than a simple tree stump. This will be a hyperparameter.

fusion classifier

Naturally, the next step should be to fuse these classifiers into a symbolic classifier. A data point is classified as 0 or 1 depending on which side of the dividing line it is on. This process can be achieved as follows:

picture

_Fusion
Classifier_

Did you discover a possible way to improve the performance of the classifier?

By weighting each classifier, you can avoid giving different classifiers the same importance.

picture

_AdaBoost
_

Summary

Let's summarize what we've covered so far in this article in a little pseudocode.

picture

_pseudocode_
_

The key points to remember are:

  • Z is a constant that normalizes the weights so that they add up to 1!

  • α_t is the weight applied to each classifier

You're done! This algorithm is called "AdaBoost". If you want to fully understand all boosting methods, then this is the most important algorithm you need to understand.

calculate

The Boosting algorithm trains very fast, which is awesome. But we considered all the stump possibilities and used a recursive method to calculate the exponent, why does it train so fast?

Now, here comes the magic place! If we choose appropriate α_t and Z, the weights that should change at each step will be reduced to the following simple form:

picture

_Weights obtained after choosing appropriate α and
Z_

This is a very strong conclusion that does not contradict the claim that the weights should change with iterations. Because the number of misclassified training samples is reduced, their total weight is still 0.5!

  • No need to calculate Z

  • No need to calculate alpha

  • No need to calculate exponents

Another little trick: any classifier that tries to divide two data points that are already well classified will not be optimal. We don't even need to calculate it.

Let's try programming!

Now, this article will take the reader through a quick code example to see how to use Adaboost for handwritten digit recognition in a Python environment.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split
from sklearn.model_selection import learning_curve

from sklearn.datasets import load_digits

First, load the data:

dataset = load_digits()
X = dataset['data']
y = dataset['target']

X contains arrays of length 64 that represent simple 8x8 flat images. The purpose of using this dataset is to complete the task of handwritten digit recognition. The following image is an example of a given handwritten number:

picture

If we insist on using a decision tree classifier of depth 1 (decision stump), here is how the AdaBoost classifier is implemented in this case:

reg_ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1))
scores_ada = cross_val_score(reg_ada, X, y, cv=6)
scores_ada.mean()

The result of the classification accuracy obtained in this way should be about 26%, and there is still a lot of room for improvement. One of the key parameters is the depth of the sequential decision tree classifier. So, how can the depth of the decision tree be changed to improve the classification accuracy?

score = []
for depth in [1,2,10] : 
    reg_ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=depth))
    scores_ada = cross_val_score(reg_ada, X, y, cv=6)
    score.append(scores_ada.mean())

In this simple example, the classifier achieved the highest classification accuracy of 95.8% when the depth of the decision tree was 10.

Epilogue

Researchers have explored whether AdaBoost will overfit. Recently, AdaBoost has been shown to overfit at some point, and users should be aware of this. At the same time, Adaboost can also be used as a regression algorithm.

Reference link: https://towardsdatascience.com/boosting-and-adaboost-clearly-explained-856e21152d3e_

recommended article

Technology Exchange

Welcome to reprint, collect, like and support!

insert image description here

At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends

  • Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
  • Method ②, add micro-signal: dkl88191 , note: from CSDN
  • Method ③, WeChat search public account: Python learning and data mining , background reply: add group

long press follow

Guess you like

Origin blog.csdn.net/weixin_38037405/article/details/124052835