Machine Learning Practical Tutorial (13): Integrated Learning

Introduction

Ensemble learning is a machine learning method that aims to improve overall prediction accuracy by combining the predictions of multiple individual learners (called base classifiers or base learners).

Ensemble learning can be seen as a method of "multiple people working together to do things". Each base classifier is an independent learner that is trained on training data and produces a prediction. These base classifiers can use different algorithms, different parameter settings, or different training data. Finally, the prediction results of all base classifiers are summed up, and the final prediction result is obtained through a certain combination method (such as voting, weighted voting, etc.).

Compared with a single classifier, ensemble learning can significantly improve the accuracy and generalization ability of the classifier. This is because ensemble learning can effectively reduce the bias and variance of the classifier, thereby avoiding overfitting and underfitting problems. In addition, ensemble learning can also increase the robustness of the classifier, making it more tolerant to noise and outliers.

At present, ensemble learning has been widely used in various fields, such as image recognition, natural language processing, financial risk assessment, etc. Common ensemble learning methods include Bagging, Boosting, Stacking, etc.

In ensemble learning, several concepts are usually involved, including:

  1. Base Classifier: refers to a separate, independent learner, and their prediction results will be combined to generate the final prediction result. Commonly used base classifiers in integrated learning include decision trees, support vector machines, and logic. Regression, Naive Bayes, Neural Networks, etc. Different base classifiers may show different performance in different data sets and tasks, so in practical applications, it is necessary to select an appropriate base classifier according to the specific situation. .

  2. Ensemble Classifier: refers to a classifier composed of multiple base classifiers. The ensemble classifier can be regarded as a "meta-classifier", which can combine the prediction results of multiple base classifiers to obtain more accurate prediction results.

  3. Bagging (Bootstrap Aggregating): It is an integrated learning method based on self-service sampling method. It generates multiple training sets by sampling the original training set multiple times with replacement, and uses each training set to generate a base classifier. Finally, the prediction results of all base classifiers are voted or averaged to obtain the final prediction result.

  4. Boosting: It is an iterative and integrated learning method that gradually improves the performance of the base classifier. It weights the training set so that the base classifier pays more attention to those misclassified samples, thereby improving the accuracy of the classifier. There are many Boosting methods, such as AdaBoost, Gradient Boosting, etc.

  5. Stacking: It is an ensemble learning method that takes the prediction results of multiple base classifiers as input and then trains a "meta-classifier". The stacking method can be regarded as a two-level learning method, which uses the prediction results of the base classifier as new features and then trains them to obtain more accurate prediction results.

These concepts are very basic and important content in integrated learning. Understanding them can help us better understand and apply integrated learning algorithms.

Integrated classification method

Commonly used ensemble classification methods include the following:

  1. Bagging: A method based on bootstrap sampling by training multiple independent base classifiers and then voting or averaging their outputs.

  2. Boosting: By gradually training a series of weak classifiers, each round of training will adjust the sample weight according to the error of the previous round of classifiers, so that the misclassified samples receive more attention, thereby improving the performance of the classifier.

The following is an ensemble classifier implementation for both ensemble methods

  1. Random Forest: It is a bagging integration method based on decision trees. It generates multiple decision trees by randomly selecting features and samples, and then votes on their outputs.

  2. AdaBoost: It is an integration method based on Boosting. It gradually trains a series of weak classifiers. Each round of training will adjust the sample weight according to the error of the previous round of classifiers, and assign each classifier during the training process. A weight, and then their outputs are weighted average.

  3. Gradient Boosting Decision Tree (GBDT): It is an integration method based on Boosting that improves the performance of the classifier by gradually training a series of decision trees. Each decision tree is trained based on the residual of the previous tree. The outputs of all decision trees are then weighted average.

These ensemble classifiers may show different performances in different data sets and tasks, so in practical applications, it is necessary to select an appropriate ensemble classifier according to the specific situation.

Bagging

Bootstrap aggregating method, also known as bagging method. Bagging uses bootstrap sampling for training data, that is, sampling data with replacement. The main idea is:

  • A training set is extracted from the original sample set. In each round, n training samples are extracted from the original sample set using the Bootstrapping method (in the training set, some samples may be extracted multiple times, while some samples may not be selected once). A total of k rounds of extraction are performed to obtain k training sets. (The k training sets are independent of each other)
  • Each time a training set is used to obtain a model, k training sets are used to obtain a total of k models. (Note: There is no specific classification algorithm or regression method here. We can use different classification or regression methods according to specific problems, such as decision trees, perceptrons, etc.)
  • For classification problems: use k models obtained in the previous step to vote to obtain the classification results; for regression problems, calculate the mean of the above models as the final result. (all models have the same importance)
    Insert image description here

Boosting

Boosting is a technology very similar to Bagging. The idea of ​​Boosting is to use the re-weighting method to iteratively train the base classifier. The main idea is:

  • Each round of training data samples is assigned a weight, and the weight distribution of each round of samples depends on the classification results of the previous round. That is to say, the weight of the current sample is affected by the weight of the classification results, and the error rate of the current classification results is higher. The higher the weight of the current sample is, the higher the weight will be, and the exponential function will be used to amplify it.
  • The base classifiers are combined using a sequential linear weighting method.

Insert image description here

The difference between Bagging and Boosting

Sample selection:

  • Bagging: The training set is selected with replacement from the original set, and each training set selected from the original set is independent.
  • Boosting: The training set in each round remains unchanged, but the weight of each sample in the training set in the classifier changes. The weights are adjusted based on the classification results of the previous round.

Sample weight:

  • Bagging: Use uniform sampling, with equal weight for each sample.
  • Boosting: Continuously adjust the weight of the sample according to the error rate. The greater the error rate, the greater the weight. That is, each sample has a weight. The higher the error rate, the greater the weight, and the greater the probability of needing to be retrained.

Prediction function:
In ensemble learning, we usually assign a weight to each base classifier, and this weight depends on the performance of the classifier on the training set. For well-performing classifiers, we give higher weights so that they play a more important role in the voting decision. Conversely, poorly performing classifiers are assigned lower weights to reduce their impact on the final results.

  • Bagging: All prediction functions are equally weighted.
  • Boosting: Each weak classifier has a corresponding weight, and classifiers with small classification errors will have greater weights.

parallel computing:

  • Bagging: Individual prediction functions can be generated in parallel.
  • Boosting: Each prediction function can only be generated sequentially, because the latter model parameters require the results of the previous round of models.

The following is a new algorithm obtained by combining decision trees with these algorithm frameworks:

  • Bagging + Decision Tree = Random Forest
  • AdaBoost + Decision Tree = Boosted Tree
  • Gradient Boosting + Decision Tree = GBDT

random forest

Self-service sampling

Random forest is an algorithm based on the idea of ​​bagging integrated learning. It uses multiple decision trees for integration. At the same time, it increases the randomness of the model and improves the generalization ability of the model by introducing random feature selection and sample random sampling. Here is a simple example to illustrate the calculation process of random forest.

Suppose we have a dataset with 1000 samples, each sample contains 5 features. We want to classify this dataset using Random Forest.

First, we need to randomly sample the data set and randomly select 1,000 samples from the original data set with replacement (sampling with replacement means that the same sample may be sampled multiple times, for example, a sample 1 is sampled, record Put it into a bag, put it back into the original data set, and then draw one from the 1,000 in the original data set. It may be sample 1, or it may be sample 100. Each time, it is drawn from 1,000, and 1,000 is drawn. times, if you are lucky, you may get 1,000 samples1. Of course, this is almost impossible. Don’t understand that you just put a piece of data from the source data set into the self-service sampling set and never put it back. The self-service sampling set is just a Copy of sample), this new data set is a "bag", which we call a "bootstrap sample". Some samples in this self-service sampling set may appear repeatedly, and other samples may not be sampled. This process is called bootstrap.

Next, we need to train this bootstrap sampling set using a decision tree-based classifier. When training a decision tree, we need to randomly select features on each node. Specifically, we randomly select a certain number of features from the original features each time, and then select the optimal features from these features for partitioning. This process is called random feature selection.

After training the first decision tree, we can predict the remaining samples and record the prediction results of each sample. Then, we randomly selected 1,000 samples from the original data set with replacement to form a new self-service sampling set, trained a second decision tree using the same method, and recorded the prediction results of each sample.

This process is repeated until the specified number of decision trees are trained. Finally, we can vote on the prediction results of each sample to get the final prediction result of the random forest.

It should be noted that decision trees in random forests are usually trained in parallel, that is, each decision tree can be trained on an independent CPU core, thus accelerating the model training process. In addition, the decision tree in the random forest can use some pruning strategies to prevent overfitting, such as minimum sample number limit, maximum depth limit, etc.

Forecast data

Note that the ensemble learning algorithm does not select a model from multiple models through data training, but multiple models input result predictions, and the final result is obtained through methods such as voting or averaging.

Random forest is an ensemble learning method that is built on the decision tree algorithm. The process of predicting data by random forest is as follows:

  1. n samples are randomly selected from the training set with replacement as a subset, and the size of this subset is the same as the size of the training set.

  2. For this subset, k features are randomly selected, where k is a fixed hyperparameter, which is generally smaller than the total number of features.

  3. Based on this subset and k features, a decision tree model is trained.

  4. Repeat steps 1-3 m times to obtain m decision tree models.

  5. For a new data point, input it into each decision tree model to obtain m prediction results.

  6. For classification problems, voting is used, and the category with the most votes among the m prediction results is used as the final prediction result. For regression problems, the average value is used to average m prediction results as the final prediction result.

It should be noted that in random forest, each decision tree model is trained independently, so training and prediction can be performed in parallel, thereby improving the efficiency of the model.

Irises Forecast

The iris dataset is a set of labeled multivariate data that contains measurements of three different species of iris (Iris mountaina, Iris versicolor, and Iris virginia). These measurements (characteristics) include sepal length, sepal width, petal length, and petal width. Each sample contains these four measurements, for a total of 150 samples.

The purpose of this dataset is to differentiate between different varieties of iris flowers using these measurements. This is a very common machine learning problem and is widely used to benchmark classification and clustering algorithms.

The Iris dataset is one of the most commonly used datasets in the field of machine learning. It is widely used in data visualization, model evaluation, feature selection, and algorithm comparison.

We first loaded the iris dataset using the load_iris function in the sklearn library. Then, we use the train_test_split function to split the dataset into training and test sets with a ratio of 0.3

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# 加载鸢尾花数据集
iris = load_iris()
# 打印特征数和数据集大小
print("Number of features: ", len(iris.feature_names))
print("Number of samples: ", len(iris.data))
print(iris.target)

Output:

Number of features:  4
Number of samples:  150
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

Then, we use the train_test_split function to split the dataset into training and test sets with a ratio of 0.3

# 分割数据集为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3,random_state=2)

Use decision trees directly for prediction.

from sklearn.tree import DecisionTreeClassifier
# 定义决策树分类器
clf = DecisionTreeClassifier(random_state=1)

# 训练模型
clf.fit(X_train, y_train)

# 预测测试集
y_pred = clf.predict(X_test)
accuracy = clf.score(X_test, y_test)
print("Accuracy:", accuracy)

Output: Accuracy: 0.9555555555555556

RandomForestClassifier is a classification model based on the random forest algorithm. It has some important parameters. I will explain these parameters in detail below.

  1. n_estimators: Indicates the number of decision trees when building a random forest. The default value is 100. The larger the n_estimators, the accuracy and stability of the model will be relatively improved, but the training time will increase, so there needs to be a trade-off between accuracy and time cost.

  2. criterion: Indicates the index to measure the quality of splitting, you can choose "gini" or "entropy". The default value is "gini", which means that the Gini coefficient is used to measure the splitting quality. And "entropy" means using information gain to measure splitting quality. Generally speaking, choosing the Gini coefficient is faster than information gain, but in some cases information gain may perform better.

  3. max_depth: Indicates the maximum depth of the decision tree. The default value is None, which means there is no limit to the depth. If you set max_depth to a smaller value, you can avoid overfitting, but it may affect the accuracy of the model.

  4. min_samples_split: Indicates the minimum number of samples of a node before performing node splitting. The default value is 2. If you set min_samples_split to a larger value, you can avoid the decision tree from overfitting on local areas, but it may cause the decision tree to underfit.

  5. min_samples_leaf: Indicates the minimum number of samples of leaf nodes, the default value is 1. If you set min_samples_leaf to a smaller value, you can make the model more flexible, but it may cause the decision tree to overfit.

  6. max_features: Indicates the number of features to be considered when determining node splitting. You can enter an integer, floating point number, string, or None. If an integer is entered, it means that the number of features considered is the integer value; if a floating point number is entered, it means the number of features considered is the total number of features times the floating point value; if the string "auto" is entered, it means the features considered The number is the total number of features; if the string "sqrt" is entered, it means that the number of features considered is the square root of the total number of features; if the string "log2" is entered, it means the number of features considered is the base 2 of the total number of features. The logarithm of ; if None is entered, it means that the number of features considered is the total number of features. Generally speaking, the performance of random forests is relatively stable, so max_features can be set to the default value None.

  7. random_state: represents a random seed, which is used to control the generation of random patterns and make the random patterns repeatable. If random_state is set to an integer, it means using the integer as the random seed; if set to None, it means using the default random seed.

Use random forests


# 创建随机森林分类器,指定决策树数量为100,其他参数采用默认值
rfc = RandomForestClassifier(n_estimators=100)

# 使用训练数据集进行训练
rfc.fit(X_train, y_train)

# 使用测试数据集进行预测
y_pred = rfc.predict(X_test)

# 计算模型的准确率
accuracy = rfc.score(X_test, y_test)
print("Accuracy:", accuracy)

Output: Accuracy: 0.9777777777777777

AdaBoost

AdaBoost (Adaptive Boosting) is an ensemble learning method whose purpose is to combine multiple weak classifiers into a strong classifier. Its core idea is that each training session strengthens the weight of samples that were misclassified in the previous training and reduces the weight of those samples that were correctly classified. In this way, in each training session, the model will pay more attention to the samples with poor previous classification results, so that the entire model can better adapt to the data set.

Algorithm process and formulas

The process and formula of the AdaBoost algorithm are as follows:

  1. Initialize the weight of the training data: For a training set D with N samples, the weight of each sample is initialized to 1/N.

  2. For t=1,2,…T, do the following:

    a. Based on the current training data weight distribution, use a base classifier (such as a decision tree) for training.

    b. Calculate the error rate (error rate) of the base classifier: for incorrectly classified samples, the weight increases; for correctly classified samples, the weight decreases.

    c. Calculate the weight of the base classifier: The weight of the base classifier is related to its error rate. The smaller the error rate of the base classifier, the greater its weight.

    d. Update the weight of the training data: According to the weight of the base classifier, update the weight distribution of the training data so that the weight of samples with a large error rate of the base classifier increases and the weight of samples with a small error rate decreases.

  3. The final classifier is the weighted sum of the base classifiers, with the weights being the weights of each base classifier.

The formula of the AdaBoost algorithm is as follows:

Step 1: Initialize weights

D 1 ( i ) = 1 N , i = 1 , 2 , . . . , N D_1(i)=\frac{1}{N}, i=1,2,...,N D1(i)=N1,i=1,2,...,N

Step 2: For t=1,2,...T, perform the following operations:

a. Training base classifier

G t ( x ) : X → { − 1 , 1 } G_t(x):\mathcal{X}\rightarrow\{-1,1\} Gt(x):X{ 1,1}

b. Calculate error rate

ϵ t = P ( G t ( x i ) ≠ y i ) = ∑ i = 1 N D t ( i ) [ G t ( x i ) ≠ y i ] \epsilon_t=P(G_t(x_i)\ne y_i)=\sum_{i=1}^N D_t(i)[G_t(x_i)\ne y_i] ϵt=P(Gt(xi)=yi)=i=1NDt(i)[Gt(xi)=yi]

c. Calculate the weight of the base classifier

α t = 1 2 ln ⁡ 1 − ϵ t ϵ t \alpha_t=\frac{1}{2}\ln\frac{1-\epsilon_t}{\epsilon_t}at=21lnϵt1 ϵt

d. Update weights

D t + 1 ( i ) = D t ( i ) exp ⁡ ( − α t y i G t ( x i ) ) Z t , i = 1 , 2 , . . . , N D_{t+1}(i)=\frac{D_t(i)\exp(-\alpha_ty_iG_t(x_i))}{Z_t},i=1,2,...,N Dt+1(i)=ZtDt(i)e x p ( atyiGt(xi)),i=1,2,...,N
inside,Z t Z_tZtis the normalization factor such that D t + 1 D_{t+1}Dt+1​becomes a probability distribution.

Step 3: Final classifier

f ( x ) = sign ⁡ ( ∑ t = 1 T α t G t ( x ) ) f(x)=\operatorname{sign}\left(\sum_{t=1}^T\alpha_tG_t(x)\right) f(x)=sign(t=1TatGt(x))

Among them, sign⁡ ( x ) sign⁡(x)s i g n ( x ) is a sign function. If x ≥ 0, thensign ⁡ ( x ) = 1 \operatorname{sign}(x)=1sign(x)=1 ; otherwise,sign ⁡ ( x ) = − 1 \operatorname{sign}(x)=-1sign(x)=1

Irises Forecast

The sklearn.ensemble module provides many integration methods, AdaBoost, Bagging, Random Forest, etc. This time I used AdaBoostClassifier.
Let's take a look at the AdaBoostClassifier function first. It has 5 parameters in total:
Insert image description here
The parameter description is as follows:

  • base_estimator: Optional parameter, default is DecisionTreeClassifier. In theory, you can choose any classification or regression learner, but it needs to support sample weights. What we commonly use is CART decision tree or neural network MLP. The default is a decision tree, that is, AdaBoostClassifier uses the CART classification tree DecisionTreeClassifier by default, and AdaBoostRegressor uses the CART regression tree DecisionTreeRegressor by default. Another point to note is that if the AdaBoostClassifier algorithm we choose is SAMME.R, our weak classification learner also needs to support probability prediction, that is, the prediction method corresponding to the weak classification learner in scikit-learn is in addition to predict Predict_proba is also required.
  • algorithm: optional parameter, default is SAMME.R. scikit-learn implements two Adaboost classification algorithms, SAMME and SAMME.R. The main difference between the two is the measurement of weak learner weight. SAMME uses the effect of classifying the sample set as the weak learner weight, while SAMME.R uses the predicted probability of classifying the sample set as the weak learner weight. Since SAMME.R uses continuous values ​​of probability measures, iteration is generally faster than SAMME, so the value of the default algorithm algorithm of AdaBoostClassifier is also SAMME.R. We generally use the default SAMME.R, but it should be noted that if SAMME.R is used, the weak classification learner parameter base_estimator must be restricted to classifiers that support probability prediction. The SAMME algorithm does not have this limitation.
  • n_estimators: integer type, optional parameter, default is 50. The maximum number of iterations of the weak learner, or the maximum number of weak learners. Generally speaking, if n_estimators are too small, it is easy to underfit. If n_estimators is too large, it is easy to overfit. Generally, a moderate value is chosen. The default is 50. In the actual parameter adjustment process, we often consider n_estimators and the parameter learning_rate introduced below.
  • learning_rate: floating point type, optional parameter, default is 1.0. The weight reduction coefficient of each weak learner ranges from 0 to 1. For the same training set fitting effect, a smaller v means that we need more iterations of the weak learner. Usually we use the step size and the maximum number of iterations together to determine the fitting effect of the algorithm. Therefore, these two parameters n_estimators and learning_rate must be adjusted together. Generally speaking, you can start adjusting parameters from a smaller v, the default is 1.
  • random_state: integer type, optional parameter, default is None. If an instance of RandomState, random_state is the random number generator; if None, the random number generator is the RandomState instance used by np.random.
#%%
from sklearn.datasets import load_iris
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = AdaBoostClassifier(n_estimators=50, learning_rate=1.0, random_state=42)
clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test)
print("Accuracy:", accuracy)

Output: Accuracy: 1.0

Integrated learning options

AdaBoost and Random Forest are both ensemble learning methods that build a strong classifier by combining multiple weak classifiers (or decision trees). Although they can all improve classification accuracy, their performance may differ under different data sets and scenarios.

Generally speaking, random forest is suitable for processing high-dimensional data and noisy data sets because it can randomly select features and samples to build multiple decision trees, thereby reducing the risk of overfitting. AdaBoost is suitable for processing low-dimensional data and complex classification problems because it can train multiple weak classifiers by adjusting weights and resampling, and combine them to get a stronger classifier.

Therefore, AdaBoost is usually able to show better classification accuracy when dealing with complex low-dimensional data sets, while random forests may be more suitable when dealing with high-dimensional data sets. However, this is only a general situation. In specific applications, appropriate algorithms need to be selected based on the characteristics and needs of the data set.

Guess you like

Origin blog.csdn.net/liaomin416100569/article/details/130501843