Ensemble methods are very popular in Kaggle and other machine learning tasks, and they are incredibly powerful, be it random forests. The author of this paper introduces the basic Boosting concept to the AdaBoost algorithm in detail, and shows how to implement AdaBoost. These are the stepping stones to the family of integrated methods.
Recently, Boosting techniques have become popular in Kaggle competitions and other predictive analytics tasks. This article will introduce relevant concepts about Boosting and AdaBoost in as much detail as possible. If you like this article, remember to bookmark, like, and follow.
This article will cover:
-
A quick review of bagging
-
Limitations of bagging
-
Conceptual details of Boosting
-
Computational efficiency of boosting
-
code example
Limitations of Bagging
Next, let's consider a binary classification problem. We classify an observation as 0 or 1. Although this is not the purpose of this article, let's review the concept of bagging for clarity.
Bagging refers to a technique called "Bootstrap Aggregating". Its essence is to select T bootstrap samples, install a classifier on each sample, and then train the model in parallel. Typically, in random forests, decision trees are trained in parallel. Then, average the results of all the classifiers to get a bagging classifier:
Formula for Bagging Classifier
The process can be illustrated in the following way. Let's consider 3 classifiers that produce a classification result that may be true or false. If we plot the results of the 3 classifiers, there will be some areas that represent that the results of the classifier are wrong. In the image below, such areas are shown in red:
_Example of Bagging Application
Scenarios_
This example serves as a good illustration where one classifier is wrong and the other two are correct. By voting on the classifier, you can get a very high classification accuracy. But as you might guess, the bagging mechanism doesn't work very well sometimes, when all the classifiers get wrong classification results in the same area.
For this reason, the intuitive idea behind the boosting method is:
-
We need to train the model serially, not in parallel.
-
Each model needs to focus on where the previous classifier underperformed.
**Introduction to Boosting
**
concept
The above idea can be interpreted as:
-
Train the model h1 on the entire dataset
-
weight data from regions where h1 performs poorly, and train model h2 on those data
-
weight the data for regions where h1 ≠ h2 and train the model h3 on these data
-
…
Instead of parallel training, we can train these models serially. This is the essence of Boosting!
Boosting methods train a series of low-performing algorithms, called weak learners, over time by adjusting the error metric. Weak learners are those algorithms with a slightly less than 50% error rate, as shown in the following figure:
_Weak classifier with slightly less than 50% error
rate_
weighted error
How can we implement such a classifier? In fact, we do it by weighting the errors throughout the iterations. In this way, we will give more weight to the regions where the previous classifier performed poorly.
Think of data points on a 2D image. Some points will be classified well, some will not. Typically, when calculating the error rate, each error is weighted 1/n, where n is the number of data points to be classified.
_Unweighted
error_
Now let's weight the error!
_weighted
error_
By now, you may have noticed that we gave higher weights to data points that were not well classified. The weighting process is shown in the following figure:
_Example of weighting
process_
Ultimately, we want to build a strong classifier like the one shown below:
_Strong
Classifier_
decision stump
You may ask, how many classifiers do we need to implement for the whole boosting system to work well? How to choose a classifier at each step?
The answer is the so-called "decision stump"! A decision stump is a single-level decision tree. The main idea is that at each step we have to find the best stump (i.e. get the best data partition) that minimizes the overall error. You can think of a stump as a test, where we assume that all data points on one side of the stump belong to class 1 and all data points on the other side belong to class 0.
There are many possible combinations of decision stumps. Next, let's see how many combinations of tree stumps are there in this simple example?
_3 data points to be
divided_
In fact, there are 12 stump combinations in this example! This may seem surprising, but it's actually quite easy to understand.
_12 decision stumps_
We can do 12 possible "tests" for the above situation. The number "2" on the side of each dividing line simply represents the fact that all points on one side of the dividing line may belong to either class 0 or class 1. Therefore, there are 2 "tests" embedded in each dividing line.
In each iteration t, we will select the weak classifier ht that best divides the data, which minimizes the overall error rate. Recall that the error rate here is a weighted corrected version of the error rate that takes into account what was introduced earlier.
Find the best division
As described above, the optimal split is found by identifying the best weak classifier ht (usually a decision tree (decision stump) with 1 node and 2 leaves) in each iteration t. Suppose we are trying to predict whether a person who wants to borrow money will be a good repayer:
_Find the best
division_
In this case, the optimal division at time t is to use the "payment history" as the stump, since this division has the smallest weighted error.
Just note that, in practice, a decision tree classifier like this may have a deeper structure than a simple tree stump. This will be a hyperparameter.
fusion classifier
Naturally, the next step should be to fuse these classifiers into a symbolic classifier. A data point is classified as 0 or 1 depending on which side of the dividing line it is on. This process can be achieved as follows:
_Fusion
Classifier_
Did you discover a possible way to improve the performance of the classifier?
By weighting each classifier, you can avoid giving different classifiers the same importance.
_AdaBoost
_
Summary
Let's summarize what we've covered so far in this article in a little pseudocode.
_pseudocode_
_
The key points to remember are:
-
Z is a constant that normalizes the weights so that they add up to 1!
-
α_t is the weight applied to each classifier
You're done! This algorithm is called "AdaBoost". If you want to fully understand all boosting methods, then this is the most important algorithm you need to understand.
calculate
The Boosting algorithm trains very fast, which is awesome. But we considered all the stump possibilities and used a recursive method to calculate the exponent, why does it train so fast?
Now, here comes the magic place! If we choose appropriate α_t and Z, the weights that should change at each step will be reduced to the following simple form:
_Weights obtained after choosing appropriate α and
Z_
This is a very strong conclusion that does not contradict the claim that the weights should change with iterations. Because the number of misclassified training samples is reduced, their total weight is still 0.5!
-
No need to calculate Z
-
No need to calculate alpha
-
No need to calculate exponents
Another little trick: any classifier that tries to divide two data points that are already well classified will not be optimal. We don't even need to calculate it.
Let's try programming!
Now, this article will take the reader through a quick code example to see how to use Adaboost for handwritten digit recognition in a Python environment.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split
from sklearn.model_selection import learning_curve
from sklearn.datasets import load_digits
First, load the data:
dataset = load_digits()
X = dataset['data']
y = dataset['target']
X contains arrays of length 64 that represent simple 8x8 flat images. The purpose of using this dataset is to complete the task of handwritten digit recognition. The following image is an example of a given handwritten number:
If we insist on using a decision tree classifier of depth 1 (decision stump), here is how the AdaBoost classifier is implemented in this case:
reg_ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1))
scores_ada = cross_val_score(reg_ada, X, y, cv=6)
scores_ada.mean()
The result of the classification accuracy obtained in this way should be about 26%, and there is still a lot of room for improvement. One of the key parameters is the depth of the sequential decision tree classifier. So, how can the depth of the decision tree be changed to improve the classification accuracy?
score = []
for depth in [1,2,10] :
reg_ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=depth))
scores_ada = cross_val_score(reg_ada, X, y, cv=6)
score.append(scores_ada.mean())
In this simple example, the classifier achieved the highest classification accuracy of 95.8% when the depth of the decision tree was 10.
Epilogue
Researchers have explored whether AdaBoost will overfit. Recently, AdaBoost has been shown to overfit at some point, and users should be aware of this. At the same time, Adaboost can also be used as a regression algorithm.
Reference link: https://towardsdatascience.com/boosting-and-adaboost-clearly-explained-856e21152d3e_
recommended article
-
Li Hongyi's "Machine Learning" Mandarin Course (2022) is here
-
Someone made a Chinese version of Mr. Wu Enda's machine learning and deep learning
-
I'm addicted, and recently I gave the company a big visual screen (with source code)
-
So elegant, 4 Python automatic data analysis artifacts are really fragrant
Technology Exchange
Welcome to reprint, collect, like and support!
At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends
- Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
- Method ②, add micro-signal: dkl88191 , note: from CSDN
- Method ③, WeChat search public account: Python learning and data mining , background reply: add group