This may be the most detailed summary of machine learning model fusion Dafa

Ensemble learning refers to a machine learning model that combines two or more models. Ensemble learning is a branch of machine learning that is often used in pursuit of stronger predictive capabilities.

Modern machine learning libraries (scikit-learn, XGBoost) already incorporate common ensemble learning methods internally. Ensemble learning is often used by top and winning participants in machine learning competitions. If you like this article, remember to bookmark, like, and follow.

[Note] The full version of the code, data, and technical exchange can be obtained at the end of the article

Introduction to Integrated Learning

Ensemble learning combines multiple different models and then combines a single model to complete the prediction. Often, ensemble learning finds better performance than a single model.

Common ensemble learning techniques fall into three categories:

Bagging, 如. Bagged Decision Trees and Random Forest.
Boosting, 如. Adaboost and Gradient Boosting
Stacking, 如. Voting and using a meta-model.

Using ensemble learning can reduce the variance of the prediction results while also performing better than a single model.

Bagging

BaggingBy sampling the samples of the training data set, various models are obtained by training, and then various prediction results are obtained. When combining model predictions, individual model predictions can be voted or averaged.

BaggingThe key is the sampling method of the dataset. The common way is to sample from the row (sample) dimension, here is sampling with replacement.

BaggingCan be passed BaggingClassifierand BaggingRegressorused, by default they use a decision tree as the base model, n_estimatorsthe number of trees to create can be specified by parameter.

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import BaggingClassifier

# 创建样例数据集
X, y = make_classification(random_state=1)

# 创建bagging模型
model = BaggingClassifier(n_estimators=50)

# 设置验证集数据划分方式
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 验证模型精度
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# 打印模型的精度
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Random Forest

Random Forest is a combination of Bagging and a tree model:

A random forest ensemble fits a decision tree on different bootstrap samples of the training dataset.
Random Forest will also sample the features (columns) of each dataset.

Instead of considering all features when choosing split points, random forests restrict features to a random subset of features when building each decision tree.

Random forest ensembles are available in scikit-learn via RandomForestClassifierand classes. RandomForestRegressorYou can n_estimatorsspecify the number of trees to create via max_featuresparameters and the number of randomly selected features to consider at each split point.

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier

# 创建样例数据集
X, y = make_classification(random_state=1)

# 创建随机森林模型
model = RandomForestClassifier(n_estimators=50)

# 设置验证集数据划分方式
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 验证模型精度
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# 打印模型的精度
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

AdaBoost

BoostingIteratively tries to correct for errors made by previous models, the more iterations the ensemble produces fewer errors, at least within the limits supported by the data and before overfitting the training dataset.

BoostingThe idea was originally developed as a theoretical idea, and the AdaBoostalgorithm was the first to successfully implement Boostingan ensemble algorithm based on it.

AdaBoostFit a decision tree on a weighted version of the training dataset so that the tree pays more attention to examples where previous members were wrong. AdaBoostNot a full decision tree, but a very simple tree that makes a single decision on one input variable before making a prediction. These short trees are called decision stumps.

AdaBoostCan be passed AdaBoostClassifierand AdaBoostRegressorused, they use a decision tree (decision stump) as the base model by default, and n_estimatorsthe number of trees to create can be specified by parameter.

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import AdaBoostClassifier

# 创建样例数据集
X, y = make_classification(random_state=1)

# 创建adaboost模型
model = AdaBoostClassifier(n_estimators=50)

# 设置验证集数据划分方式
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 验证模型精度
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# 打印模型的精度
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Gradient Boosting

Gradient Boostingis a framework for improving ensemble algorithms, AdaBoostingan extension of right. Gradient BoostingDefined as an additive model under a statistical framework, and allows the use of arbitrary loss functions to make it more flexible, and the use of loss penalties (shrinkage) to reduce overfitting.

Gradient BoostingThe introduced Baggingoperations, such as the sampling of training dataset rows and columns, are called stochastic gradient boosting.

A very successful ensemble technique for structured or tabular data Gradient Boosting, although fitting the model can be slow since the models are added sequentially. More efficient implementations have been developed such as XGBoost, LightGBM.

Gradient BoostingWhen available GradientBoostingClassifierand GradientBoostingRegressorused, the default decision tree is used as the base model. You can n_estimatorsspecify the number of trees to create via a learning_rateparameter that controls the learning rate for each tree's contribution.

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier

# 创建样例数据集
X, y = make_classification(random_state=1)

# 创建GradientBoosting模型
model = GradientBoostingClassifier(n_estimators=50)

# 设置验证集数据划分方式
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 验证模型精度
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# 打印模型的精度
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Voting

VotingUse simple statistics to combine predictions from multiple models.

Hard voting: voting on the predicted category;
Soft voting: Average the predicted probabilities;

VotingPassable VotingClassifierand VotingRegressoruseable. You can take a list of base models as an argument, each model in the list must be a tuple with a name and a model,

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import VotingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

# 创建数据集
X, y = make_classification(random_state=1)

# 模型列表
models = [('lr', LogisticRegression()), ('nb', GaussianNB())]

# 创建voting模型
model = VotingClassifier(models, voting='soft')

# 设置验证集数据划分方式
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 验证模型精度
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# 打印模型的精度
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Stacking

StackingCombine predictions from many different types of base models, and the Votinglike. But Stackingthe weights of each model can be adjusted based on the validation set.

StackingIt needs to be used in conjunction with cross-validation, and can also be passed StackingClassifierand StackingRegressorused, and the base model can be provided as a parameter of the model.

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import StackingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

# 创建数据集
X, y = make_classification(random_state=1)

# 模型列表
models = [('knn', KNeighborsClassifier()), ('tree', DecisionTreeClassifier())]

# 设置验证集数据划分方式
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 验证模型精度
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# 打印模型的精度
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Technology Exchange

Welcome to reprint, collect, like and support!

insert image description here

At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends

Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
Method ②, add micro-signal: dkl88191 , note: from CSDN
Method ③, WeChat search public account: Python learning and data mining , background reply: add group

long press follow