Ensemble Methods——python

Ensemble Methods

Agenda

  1. Introduction to Ensemble Methods
  2. RandomForest
  3. AdaBoost
  4. GradientBoostingTree
  5. VotingClassifier

Introduction to Ensemble Method集成方法简介

  • Objective of ensemble methods is to combine the predictions of serveral base estimators ( Linear Regression, Decisison Tree, etc. ) to create a combined effect or more genralized model.
  • Two types of Ensemble Method
    • Averaging Method : Build several estimators independently & average their predictions. Examples are RandomForest etc.
    • Boosting Method : Base estimators are built sequentially using weighted version of data .i.e fitting models with data that were mis-classified. Examples are AdaBoost
  • 集成方法的目的是将服务器基础估计量的预测(线性回归,决策树等)组合起来,以创建组合效果或更通用的模型。
  • 两种合奏方法
    • 平均方法:独立建立多个估算器并平均其预测。 例如RandomForest等。
    • 提升方法:使用数据的加权版本顺序建立基础估计量,即用错误分类的数据拟合模型。 例如AdaBoost

RandomForest

  • Recap - Limitations of decison tree is that it overfits & shows high variance.
  • RandomForest is an averaging ensemble method whose prediction is function of prediction of ‘n’ decision trees.

Algorithm算法

  • Data consist of R rows & M features.
  • Sample of training data is taken.
  • Random set of features are selected.
  • As many as configured number of trees are created using above two steps.
  • Final prediction in case of classification is majority prediction.
  • Final prediction in case of regression is mean/median of individual tree prediction
  • 数据由R行和M特征组成。
  • 取训练数据样本。
  • 选择随机特征。
  • 使用以上两个步骤创建的树数多达配置的数目。
  • 分类时的最终预测为多数预测。
  • 回归时的最终预测是单个树预测的均值/中位数

Comparing Decision Tree & Random Forest for MNIST data

from sklearn.datasets import load_digits
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
digits = load_digits()
X = digits.data
y = digits.target
trainX, testX, trainY, testY = train_test_split(X,y)
dt = DecisionTreeClassifier()
dt.fit(trainX,trainY)
dt.score(testX,testY)
0.8666666666666667
rf = RandomForestClassifier()
rf.fit(trainX,trainY)
d:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
rf.score(testX,testY)
0.9422222222222222

Important Hyper-parameters重要的超参数

  • n_estimators : number of trees to be configured, larger is better but compute cost.
  • max_features : maximum number of features to be considered for splitting the node. For classification this equals to sqrt(n_features). And, for regression max_features = n_features.
  • n_jobs : Configure as -1 so that we can make use of all cores.
  • n_estimators:要配置的树数,越大越好,但计算成本高。
  • max_features:分割节点时要考虑的最大特征数。 对于分类,这等于sqrt(n_features)。 并且,对于回归max_features = n_features。
  • n_jobs:配置为-1,以便我们可以使用所有内核。

Advantages优点

  • Minimal data cleaning or dealing with missing values required.
  • Works well with high dimensional datasets
  • Minimizes variance even for low variance models
  • RandomForest can tell importance of features. We can find important features & use them in model training
  • 最少的数据清理或处理所需的缺失值。
  • 适用于高维数据集
  • 即使对于低方差模型,也将方差最小化
  • RandomForest可以告诉您特征的重要性。 我们可以找到重要特征并将其用于模型训练
rf.feature_importances_
array([0.        , 0.00543958, 0.02560746, 0.00800673, 0.00895759,
       0.01389844, 0.01580389, 0.00047989, 0.00015724, 0.01136254,
       0.02286808, 0.00644846, 0.01348737, 0.03369562, 0.00533938,
       0.00039563, 0.        , 0.0069135 , 0.0235346 , 0.02540173,
       0.02776872, 0.04000905, 0.01740614, 0.00099215, 0.        ,
       0.01306952, 0.05192474, 0.02896724, 0.03092515, 0.01844088,
       0.03358791, 0.        , 0.        , 0.02351313, 0.01904516,
       0.02066056, 0.03877189, 0.01769953, 0.02996979, 0.        ,
       0.        , 0.01198491, 0.04506319, 0.03494813, 0.02430869,
       0.02114026, 0.01186137, 0.        , 0.        , 0.00192081,
       0.01603356, 0.01685823, 0.01250104, 0.02597298, 0.0210745 ,
       0.00209447, 0.        , 0.0018123 , 0.0201196 , 0.01684666,
       0.02274341, 0.02801594, 0.0221781 , 0.00197252])

AdaBoostAdaBoost

  • Boosting in general is about building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.
  • AdaBoost was first boosting algorithm.
  • AdaBoost can be used for both classification & regression

Algorithm算法

  • Core concept of adaboost is to fit weak learners ( like decision tree ) sequantially on repeatedly modifying data.
  • Initially, each data is assigned equal weights.
  • A base estimator is fitted with this data.
  • Weights of misclassified data are increased & weights of correctly classified data is decreased.
  • Repeat the above two steps till all data are correctly classified or max number of iterations configured.
  • Making Prediction : The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction.
  • 一般而言,提升是关于从训练数据中构建模型,然后创建第二个模型,以尝试从第一个模型中纠正错误。 添加模型,直到完美预测训练集或添加最大数量的模型为止。
  • AdaBoost是第一个增强算法。
  • AdaBoost可用于分类和回归
  • adaboost的核心概念是使弱学习者(如决策树)依次适应反复修改数据。
  • 最初,为每个数据分配相等的权重。
  • 基本估计量已与此数据拟合。
  • 错误分类的数据的权重增加,正确分类的数据的权重降低。
  • 重复上述两个步骤,直到正确分类了所有数据或配置了最大迭代次数。
  • 做出预测:然后通过加权多数表决(或总和)将来自所有预测的预测合并,以得出最终预测。
from sklearn.ensemble import AdaBoostClassifier

ab = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=8),n_estimators=600)
ab.fit(trainX,trainY)
AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(class_weight=None,
                                                         criterion='gini',
                                                         max_depth=8,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort=False,
                                                         random_state=None,
                                                         splitter='best'),
                   learning_rate=1.0, n_estimators=600, random_state=None)
ab.score(testX,testY)
0.9822222222222222
ab = AdaBoostClassifier(base_estimator=RandomForestClassifier(n_estimators=20),n_estimators=600)
ab.fit(trainX,trainY)
AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=RandomForestClassifier(bootstrap=True,
                                                         class_weight=None,
                                                         criterion='gini',
                                                         max_depth=None,
                                                         max_features='auto',
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         n_estimators=20,
                                                         n_jobs=None,
                                                         oob_score=False,
                                                         random_state=None,
                                                         verbose=0,
                                                         warm_start=False),
                   learning_rate=1.0, n_estimators=600, random_state=None)
ab.score(testX,testY)
0.9666666666666667

GradientBoostingTree

  • A machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
  • One of the very basic assumption of linear regression is that it’s sum of residuals is 0.
  • These residuals as mistakes committed by our predictor model.
  • Although, tree based models are not based on any of such assumptions, but if sum of residuals is not 0, then most probably there is some pattern in the residuals of our model which can be leveraged to make our model better.
  • So, the intuition behind gradient boosting algorithm is to leverage the pattern in residuals and strenghten a weak prediction model, until our residuals don’t show any pattern.
  • Algorithmically, we are minimizing our loss function, such that test loss reach it’s minima.
  • 用于回归和分类问题的机器学习技术,它以一组弱预测模型(通常为决策树)的形式生成预测模型。
  • 线性回归的最基本假设之一是残差之和为0。
  • 这些残差是我们的预测模型所犯的错误。
  • 尽管基于树的模型并非基于任何此类假设,但是如果残差之和不为0,则很可能在我们模型的残差中存在某种模式,可以利用该模式来使我们的模型变得更好。
  • 因此,梯度增强算法的直觉是利用残差中的模式并加强弱预测模型,直到残差不显示任何模式为止。
  • 从算法上讲,我们正在使损失函数最小化,以使测试损失达到最小值。

Problem : House Price Prediction using GradientBoostingTree

from sklearn.datasets import load_boston
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.ensemble import GradientBoostingRegressor
house_data = load_boston()
X = house_data.data
y = house_data.target
gbt = GradientBoostingRegressor()
gbt
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
                          learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='auto',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)
from sklearn.model_selection import train_test_split
trainX, testX, trainY, testY = train_test_split(X,y)
gbt.fit(trainX,trainY)
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
                          learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='auto',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)
test_score = np.zeros(100, dtype=np.float64)
for i, y_pred in enumerate(gbt.staged_predict(testX)):
    test_score[i] = gbt.loss_(testY, y_pred)
plt.plot(test_score)
plt.xlabel('Iterations')
plt.ylabel('Least squares Loss')
Text(0, 0.5, 'Least squares Loss')

在这里插入图片描述

VotingClassifier

  • Core concept of VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or weighted vote to predict the class labels.
  • Voting classifier is quite effective with good estimators & handles individual’s limitations, ensemble methods can also participate.
  • Types of Voting Classifier
    • Soft Voting Classifier, different weights configured to different estimator
    • Hard Voting Classifier, all estimators have equal weighage
  • VotingClassifier的核心概念是组合概念上不同的机器学习分类器,并使用多数投票或加权投票来预测类标签。
  • 投票分类器通过良好的估算器非常有效,可以处理个人的局限性,集成方法也可以参与。
  • 投票分类器的类型
    • 软投票分类器,为不同的估算器配置了不同的权重
    • 硬投票分类器,所有估算器的权重均等

Problem : DIGIT identification using VotingClassifier

from sklearn.ensemble import VotingClassifier,RandomForestClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
estimators = [ 
    ('rf',RandomForestClassifier(n_estimators=20)),
    ('svc',SVC(kernel='rbf', probability=True)),
    ('knc',KNeighborsClassifier()),
    ('abc',AdaBoostClassifier(base_estimator=DecisionTreeClassifier() ,n_estimators=20)),
    ('lr',LogisticRegression()) 
]
vc = VotingClassifier(estimators=estimators, voting='hard')
digits = load_digits()
X,y = digits.data, digits.target
trainX, testX, trainY, testY = train_test_split(X,y)
vc.fit(trainX,trainY)
vc.score(testX,testY)
d:\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
d:\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
d:\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)





0.9777777777777777
for est,name in zip(vc.estimators_,vc.estimators):
    print (name[0], est.score(testX,testY))
rf 0.9577777777777777
svc 0.49777777777777776
knc 0.9822222222222222
abc 0.8511111111111112
lr 0.9488888888888889
vc = VotingClassifier(estimators=estimators, voting='soft', weights=[2,.1,3,2,2])
vc.fit(trainX,trainY)
vc.score(testX,testY)
d:\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
d:\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
d:\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)





0.98
发布了186 篇原创文章 · 获赞 21 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/sinat_23971513/article/details/105263777