[Machine Learning] Integrated Algorithms and Practical Cases


1. What is the integration algorithm?

An ensemble algorithm is a technique for improving machine learning performance by combining the predictions of multiple single learners (classifiers, regressors, etc.). In this process, each single learner is trained on the same dataset, but they can have different parameter settings, algorithm selection, and feature selection for the purpose of improving prediction accuracy and stability.

Integration algorithms are usually divided into three types: algorithms based on Bagging, Boosting, and Stacking.

  • The Bagging algorithm randomly selects data samples for sampling with replacement, and then uses these samples to train multiple independent classifiers. The results of these classifiers are combined to produce the final prediction.
  • The Boosting algorithm is to produce a strong classifier by training a series of weak classifiers, each of which is designed to correct the errors of the previous classifiers.
  • The Stacking algorithm aggregates multiple classification or regression models, which can be trained in stages

Therefore it has many advantages

  • Ensemble algorithms usually have better generalization ability than single learners, which can reduce the risk of overfitting.
  • Since the ensemble algorithm combines the prediction results of multiple learners, its prediction results are usually more accurate than that of a single learner.
  • Ensembling algorithms can be applied to a variety of machine learning problems, including classification, regression, and clustering.

2. Bagging and Boosting case comparison

  • Dataset loading
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score

# 加载Iris数据集
iris = load_iris()
X = iris.data
y = iris.target

# 将数据集拆分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42

We will use three different ensemble algorithms to train our model:

  • Decision Tree Algorithm Based on Bagging
# 基于Bagging的决策树算法
dt = DecisionTreeClassifier()
bag_clf = BaggingClassifier(dt, n_estimators=500, max_samples=100, bootstrap=True, random_state=42)
bag_clf.fit(X_train, y_train)
y_pred_bag = bag_clf.predict(X_test)
  • Based on random forest
# 基于随机森林的算法
rf_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, random_state=42)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
  • Based on Adaboost
# 基于Adaboost的算法
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=200, algorithm="SAMME.R", learning_rate=0.5, random_state=42)
ada_clf.fit(X_train, y_train)
y_pred_ada = ada_clf.predict(X_test)

Let's print their accuracy rate to see

print("Bagging准确率:", accuracy_score(y_test, y_pred_bag))
print("随机森林准确率:", accuracy_score(y_test, y_pred_rf))
print("Adaboost准确率:", accuracy_score(y_test, y_pred_ada))

insert image description here
It can be seen that Bagging and random forest are better than adaboost, but different data sets will have different performances, just try more

3. Soft voting and hard voting

An ensemble algorithm is a method of combining the predictions of multiple single classifiers to obtain a more accurate and robust classifier. Two of the commonly used ensemble methods are soft voting and hard voting.

  • Hard voting means that in the integrated algorithm, the prediction results of multiple classifiers are voted according to the majority principle, that is, the category with the most votes is selected as the final prediction result. This method is suitable for problems where the classification results are discrete values.
    For example, assuming that there are three classifiers that predict the label of a sample as A, B, and B respectively, the hard voting algorithm will determine the final prediction result as B.
  • Soft voting refers to the weighted average of the prediction results of multiple classifiers, and selects the label with the highest average score as the final prediction result. The weight of each classifier can be determined according to its performance on the training set. This approach is suitable for problems where the classification outcome is continuous values.
    For example, assuming that there are three classifiers predicting a sample, and the probabilities of the prediction results are 0.4, 0.5, and 0.6 respectively, the soft voting algorithm will determine the final prediction result as B, and calculate its score as (0.4× w1 + 0.5×w2 + 0.6×w3)/ (w1+w2+w3), where w1, w2 and w3 are the weights of each classifier.

Dataset construction

This time we still construct the iris classification data

import matplotlib
import matplotlib.pyplot as plt
import mplcyberpunk
import warnings
warnings.filterwarnings('ignore')
plt.style.use('cyberpunk')
import numpy as np
np.random.seed(24)
#%%
from sklearn.datasets import load_iris
from sklearn.model_selection  import train_test_split

iris=load_iris()
iris

X_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target)

model building

We will use logistic regression, random forest, SVM as our classification here, and VotingClassifier as our classification voter

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier,VotingClassifier


log_clf=LogisticRegression(random_state=42)
rnd_clf=RandomForestClassifier(random_state=42)
svm_clf=SVC(random_state=42)

hard vote

We test the accuracy of individual and ensemble

voting_vlf=VotingClassifier(estimators=[
    ('lr',log_clf),('rg',rnd_clf),('svs',svm_clf)],
    voting='hard'#hard硬投票,soft软投票
    )

voting_vlf.fit(X_train,y_train)
#%%
from sklearn.metrics import accuracy_score


for clf in (log_clf,rnd_clf,svm_clf,voting_vlf):
    clf.fit(X_train,y_train)
    y_pred=clf.predict(X_test)
    print(clf.__class__.__name__,accuracy_score(y_test,y_pred))

insert image description here
It can be seen that because our data set is better and the accuracy is higher, it is difficult to see the difference

soft vote

svm_clf=SVC(random_state=24,probability=True)
voting_vlf_soft=VotingClassifier(estimators=[
    ('lr',log_clf),('rg',rnd_clf),('svs',svm_clf)],
    voting='soft',#hard硬投票,soft软投票,

    )

voting_vlf_soft.fit(X_train,y_train)

for clf in (log_clf,rnd_clf,svm_clf,voting_vlf):
    clf.fit(X_train,y_train)
    y_pred=clf.predict(X_test)
    print(clf.__class__.__name__,accuracy_score(y_test,y_pred))

insert image description here

The result is the same as above, but if the data is not good, soft voting will be better than hard voting

Summarize

It briefly introduces the integration algorithm, the comparison of common algorithms of the integration algorithm, and the soft voting and hard voting of the integration algorithm. The next section will introduce Bagging, Boosting, and Stacking in the integration algorithm in more detail.

My ability is valid. If there is any error in the above content, please correct me.

I will continue to work hard, I hope you will support me a lot

Guess you like

Origin blog.csdn.net/qq_61260911/article/details/130238057