[Machine Learning] Integrated Algorithm: Bagging strategy contains detailed cases


foreword

Bagging is an algorithm based on ensemble learning and is a widely used machine learning technique. The full name of Bagging is Bootstrap Aggregating, and its idea is to improve the generalization ability of the overall learner by combining the results of many independent learners. This blog will introduce how the Bagging algorithm works, its advantages and disadvantages, and how to implement it in Python.

1. Working principle

The working principle of the Bagging algorithm is very simple. First, it draws multiple subsets from the original dataset using random sampling with replacement, which are the same size as the original dataset. It then trains a learner independently using each subset. Finally, when prediction is required, the Bagging algorithm combines the prediction results of all learners to obtain the final prediction result.

Principle process

    1. Randomly divide the original data set D into m subsets D1, D2, ..., Dm
    1. For each subset Di, train a base learner Hi
    1. For each test sample x, input it to all base learners Hi, and get the corresponding prediction results {yi1, yi2, …, yim}
    1. Combine all prediction results to get the final prediction result y

Second, the advantages and disadvantages

The Bagging algorithm has the following advantages:

  • The Bagging algorithm can significantly reduce the variance of the model and improve the stability of the model. By using multiple independent learners, bagging can reduce the sensitivity of the model to the training data, so as to better adapt to the unknown test data.
  • The Bagging algorithm can be calculated in parallel to speed up model training. Since each base learner is trained independently, the Bagging algorithm can process multiple subsets in parallel, thereby speeding up the training process.
  • The Bagging algorithm is not prone to overfitting. By using random sampling to generate multiple subsets, bagging prevents the model from overfitting the training data.

The disadvantages of the Bagging algorithm include:

  • Bagging algorithm is sensitive to noisy data. Since the Bagging algorithm trains a base learner for each subset, if a certain subset contains a large amount of noisy data, the performance of the corresponding base learner may decrease, thereby affecting the performance of the entire model.
  • Bagging algorithm is prone to excessive consistency when dealing with classification problems. If different base learners all make the same wrong prediction for the same test sample, the bagging algorithm will not be able to effectively correct this error.

3. Practical cases

  • Construct a training and testing data set, here we only use the last two eigenvalues ​​for training
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
iris=load_iris()
X=iris.data[:,2:4]
y=iris.target

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
  • Decision tree bagging policy model
bgg_clf=BaggingClassifier(
    DecisionTreeClassifier(max_depth=2),
    n_estimators=10,
    max_samples=155,
    bootstrap=True,
    random_state=42
)

bgg_clf.fit(X_train,y_train)

This uses the decision tree model to generate 50 decision trees, the maximum sample is 50, and choose according to your own data volume

  • forecast result
y_pred=bgg_clf.predict(X_test)
from sklearn.metrics import accuracy_score
print('准确率:',accuracy_score(y_test,y_pred))

insert image description here

  • Comparative experiments
    We compare the bagging model above with a single decision tree
tree_clf=DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train,y_train)

y_pred_tree=tree_clf.predict(X_test)
accuracy_score(y_test,y_pred_tree)

insert image description here

The accuracy rate is relatively small compared with bagging. For better observation, we visualize the decision boundary

from sklearn.inspection import DecisionBoundaryDisplay
import matplotlib.pyplot as plt
import mplcyberpunk
plt.style.use('cyberpunk')

def plot_decision_boundar(modal,ax):

    DecisionBoundaryDisplay.from_estimator(
    modal,
    X,
        ax=ax,
    response_method="auto",
    alpha=0.5,)

    for i,target in enumerate(iris.target_names):
        plt.scatter(
            X[:,0][y==i],
            X[:,1][y==i],
            edgecolors='black',
            label=target,
        )

ax=plt.subplot(121)
plot_decision_boundar(tree_clf,ax)
plt.title('Decision Tree')

ax=plt.subplot(122)
plot_decision_boundar(bag_clf,ax)
plt.title('Decision Tree With Bagging')

insert image description here

4. OOB strategy

OOB (Out-Of-Bag) strategy is a special form of Bagging algorithm. In the Bagging algorithm, in order to train multiple base learners, we draw multiple sub-samples from the original dataset with replacement, and use each sub-sample to train a base learner. However, in each subsample, only about 63.2% of the samples were drawn. Therefore, the remaining about 36.8% of samples can be used for model evaluation, and these samples are called OOB samples.

The OOB (Out-Of-Bag) strategy is a special form of the Bagging algorithm, which can be used to evaluate the performance and feature importance of the Bagging model. , use the oob_score_ attribute to get the OOB evaluation result of the model. Using the OOB strategy can improve the performance of the Bagging algorithm and avoid the problem of overfitting to a certain extent. At the same time, by counting the prediction results of each feature on OOB samples, we can also evaluate the importance of features, thereby further optimizing the performance of the model.
It should be noted that when using the OOB strategy, we need to ensure that each base learner is trained with different sub-samples. If the same sub-samples are used for training, duplicate samples may appear in OOB samples, leading to inaccurate evaluation results.

  • rebuild the model
bag_clf=BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=50,
    max_samples=50,
    random_state=42,
    bootstrap=True,
    oob_score=True
)

bag_clf.fit(X_train,y_train)
bag_clf.oob_score_

insert image description here
The validation score is 0.94, then we use the test set to predict the score

y_pred=bag_clf.predict(X_test)
accuracy_score(y_test,y_pred)

insert image description here
We can print the decision score probability of each feature value

bag_clf.oob_decision_function_

insert image description here

Indicates the probability that each class belongs to

V. Summary

The Bagging algorithm is an algorithm based on ensemble learning, which can significantly reduce the variance of the model and improve the stability of the model. The advantages of the bagging algorithm include reducing the variance of the model, parallel computing, and avoiding overfitting. The disadvantages of the Bagging algorithm include being sensitive to noisy data and prone to consistency problems

I hope you will support a lot, and the interesting things to share in the future are just

おすすめ

転載: blog.csdn.net/qq_61260911/article/details/130249924