Ensemble learning refers to a machine learning model that combines two or more models. Ensemble learning is a branch of machine learning that is often used in pursuit of stronger predictive capabilities.
Modern machine learning libraries (scikit-learn, XGBoost) already incorporate common ensemble learning methods internally. Ensemble learning is often used by top and winning participants in machine learning competitions. If you like this article, remember to bookmark, like, and follow.
[Note] The full version of the code, data, and technical exchange can be obtained at the end of the article
Introduction to Integrated Learning
Ensemble learning combines multiple different models and then combines a single model to complete the prediction. Often, ensemble learning finds better performance than a single model.
Common ensemble learning techniques fall into three categories:
-
Bagging, 如. Bagged Decision Trees and Random Forest.
-
Boosting, 如. Adaboost and Gradient Boosting
-
Stacking, 如. Voting and using a meta-model.
Using ensemble learning can reduce the variance of the prediction results while also performing better than a single model.
Bagging
Bagging
By sampling the samples of the training data set, various models are obtained by training, and then various prediction results are obtained. When combining model predictions, individual model predictions can be voted or averaged.
Bagging
The key is the sampling method of the dataset. The common way is to sample from the row (sample) dimension, here is sampling with replacement.
Bagging
Can be passed BaggingClassifier
and BaggingRegressor
used, by default they use a decision tree as the base model, n_estimators
the number of trees to create can be specified by parameter.
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import BaggingClassifier
# 创建样例数据集
X, y = make_classification(random_state=1)
# 创建bagging模型
model = BaggingClassifier(n_estimators=50)
# 设置验证集数据划分方式
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# 验证模型精度
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# 打印模型的精度
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Random Forest
Random Forest is a combination of Bagging and a tree model:
-
A random forest ensemble fits a decision tree on different bootstrap samples of the training dataset.
-
Random Forest will also sample the features (columns) of each dataset.
Instead of considering all features when choosing split points, random forests restrict features to a random subset of features when building each decision tree.
Random forest ensembles are available in scikit-learn via RandomForestClassifier
and classes. RandomForestRegressor
You can n_estimators
specify the number of trees to create via max_features
parameters and the number of randomly selected features to consider at each split point.
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
# 创建样例数据集
X, y = make_classification(random_state=1)
# 创建随机森林模型
model = RandomForestClassifier(n_estimators=50)
# 设置验证集数据划分方式
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# 验证模型精度
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# 打印模型的精度
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
AdaBoost
Boosting
Iteratively tries to correct for errors made by previous models, the more iterations the ensemble produces fewer errors, at least within the limits supported by the data and before overfitting the training dataset.
Boosting
The idea was originally developed as a theoretical idea, and the AdaBoost
algorithm was the first to successfully implement Boosting
an ensemble algorithm based on it.
AdaBoost
Fit a decision tree on a weighted version of the training dataset so that the tree pays more attention to examples where previous members were wrong. AdaBoost
Not a full decision tree, but a very simple tree that makes a single decision on one input variable before making a prediction. These short trees are called decision stumps.
AdaBoost
Can be passed AdaBoostClassifier
and AdaBoostRegressor
used, they use a decision tree (decision stump) as the base model by default, and n_estimators
the number of trees to create can be specified by parameter.
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import AdaBoostClassifier
# 创建样例数据集
X, y = make_classification(random_state=1)
# 创建adaboost模型
model = AdaBoostClassifier(n_estimators=50)
# 设置验证集数据划分方式
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# 验证模型精度
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# 打印模型的精度
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Gradient Boosting
Gradient Boosting
is a framework for improving ensemble algorithms, AdaBoosting
an extension of right. Gradient Boosting
Defined as an additive model under a statistical framework, and allows the use of arbitrary loss functions to make it more flexible, and the use of loss penalties (shrinkage) to reduce overfitting.
Gradient Boosting
The introduced Bagging
operations, such as the sampling of training dataset rows and columns, are called stochastic gradient boosting.
A very successful ensemble technique for structured or tabular data Gradient Boosting
, although fitting the model can be slow since the models are added sequentially. More efficient implementations have been developed such as XGBoost, LightGBM.
Gradient Boosting
When available GradientBoostingClassifier
and GradientBoostingRegressor
used, the default decision tree is used as the base model. You can n_estimators
specify the number of trees to create via a learning_rate
parameter that controls the learning rate for each tree's contribution.
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
# 创建样例数据集
X, y = make_classification(random_state=1)
# 创建GradientBoosting模型
model = GradientBoostingClassifier(n_estimators=50)
# 设置验证集数据划分方式
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# 验证模型精度
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# 打印模型的精度
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Voting
Voting
Use simple statistics to combine predictions from multiple models.
-
Hard voting: voting on the predicted category;
-
Soft voting: Average the predicted probabilities;
Voting
Passable VotingClassifier
and VotingRegressor
useable. You can take a list of base models as an argument, each model in the list must be a tuple with a name and a model,
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import VotingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
# 创建数据集
X, y = make_classification(random_state=1)
# 模型列表
models = [('lr', LogisticRegression()), ('nb', GaussianNB())]
# 创建voting模型
model = VotingClassifier(models, voting='soft')
# 设置验证集数据划分方式
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# 验证模型精度
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# 打印模型的精度
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Stacking
Stacking
Combine predictions from many different types of base models, and the Voting
like. But Stacking
the weights of each model can be adjusted based on the validation set.
Stacking
It needs to be used in conjunction with cross-validation, and can also be passed StackingClassifier
and StackingRegressor
used, and the base model can be provided as a parameter of the model.
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import StackingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
# 创建数据集
X, y = make_classification(random_state=1)
# 模型列表
models = [('knn', KNeighborsClassifier()), ('tree', DecisionTreeClassifier())]
# 设置验证集数据划分方式
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# 验证模型精度
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# 打印模型的精度
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
recommended article
-
Li Hongyi's "Machine Learning" Mandarin Course (2022) is here
-
Someone made a Chinese version of Mr. Wu Enda's machine learning and deep learning
-
I'm addicted, and recently I gave the company a big visual screen (with source code)
-
So elegant, 4 Python automatic data analysis artifacts are really fragrant
Technology Exchange
Welcome to reprint, collect, like and support!
At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends
- Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
- Method ②, add micro-signal: dkl88191 , note: from CSDN
- Method ③, WeChat search public account: Python learning and data mining , background reply: add group