10. Integrated Learning and random forest

1. What is an integrated learning

What is an integrated learning, before we are all using an algorithm to predict, there will inevitably feel "autocratic" in. Integrated Learning is integrated in a multiple algorithms and multiple algorithms to predict the same problem, then majority rule, this is the integration of learning.

There are many examples in our lives integrated learning, such as the recommended time to buy things to see if 10 people recommend you buy product A, but only one person recommend you buy B products, we will be more inclined to buy B products.

We look at how sklearn study is to provide an integrated interface for us.

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier  # ensemble,与集成学习有关的模块

X, y = make_moons(n_samples=500, noise=0.3, random_state=666)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
# 传入一系列分类器,和Pipeline有点像
voting_clf = VotingClassifier([
    ("log_clf", LogisticRegression()),
    ("svm_clf", SVC()),
    ("dt_clf", DecisionTreeClassifier())
], voting="hard")  # voting="hard"表示少数服从多数

voting_clf.fit(X_train, y_train)
print(voting_clf.score(X_test, y_test))  # 0.888

2.SoftVoting Classifier

What SoftVoting Classifier is it? He said before learning to use an integrated way we vote, selected by majority rule. But sometimes the majority is not it reasonable, that is likely to occur so-called "democratic tyranny." Therefore, a more rational approach should be, for different algorithms, the votes cast should be the weights.

Such as economic policy, the right to recommend that economists should be fully taken into account, or the value of the votes cast to be larger. And ordinary people, even if he is a great space scientists, but nothing to do with the economy, so his vote should not be assigned a greater weight, which is understandable.

我们实际举例一下

We see five models to predict the three models predict is more likely to B, so 2vs3, then the end result is B. But take a closer look, you can find clues, forecast for the Model A, their grasp is very high, while the forecast for the Model B, relatively speaking grasp is not so high. So this vote is unfair, what do you think of, K neighbors very beginning, we also use a number of simple vote, but not the distance to be taken into account. Integrated learning is the same, if a particular algorithm with certainty, then it's right to vote should have larger values ​​assigned.

If Soft Voting, then use the average of the probability consideration, the use of five models to predict the probability of A sum averaged to yield 0.616, and 0.384 B time, we believe the results of the classification of A.

Therefore, the use of Softing Voting, then requires that each model in the collection can be estimated probabilities. In logistic regression, the prediction itself is based on a probabilistic model, there is a predict_proba. The knn obviously also possible, k 3, then take, there are two data is red, one is blue, then red probability is 2/3. Decision tree is the same, we went to a leaf node, based on the training data set obtained by the corresponding probabilities obtained. SVM is also possible, but will be difficult, but sklearn provides us with a good, but sacrifice some computing resource.

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier  # ensemble,与集成学习有关的模块

X, y = make_moons(n_samples=500, noise=0.3, random_state=666)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
# 传入一系列分类器,和Pipeline有点像
voting_clf = VotingClassifier([
    ("log_clf", LogisticRegression()),
    ("svm_clf", SVC(probability=True)),  # 默认SVM是不支持预测概率的,如果需要的话,要加上这个参数
    ("dt_clf", DecisionTreeClassifier())
], voting="soft")  # voting="soft"表示不是少数服从多数,而是有权值

voting_clf.fit(X_train, y_train)
print(voting_clf.score(X_test, y_test))  # 0.896

3.Bagging和Pasting

Although there are many methods of machine learning, but from the point of view voting is still not enough. So we want to create more sub-model, integrated views of more sub-models. Also note that, between the sub-models not consistently required difference.

So how do you create a difference it? One approach is to look at a portion of the data for each sub-model, for example, 500 samples of data, each sub-model 100 sample look data, so that even if the same algorithm, but due to different samples, there is a difference. But in this case, the accuracy of prediction models child becomes not lower it? It is true that, although a sub-model accuracy rate is relatively low, but multiple sub-models combine accuracy becomes high, and this is the power of integrated learning.

So just look at each sub-model part of the sample data, and that part of the sample data to see how how it? Here we also had differences, we have two ways.

  • Bagging:放回取样
  • Pasting:不放回取样

Bagging more commonly used, but also to avoid the problems caused by random, because each time returned to sampling.

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

X, y = make_moons(n_samples=500, noise=0.3, random_state=222)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=222)

bagging_clf = BaggingClassifier(
    DecisionTreeClassifier(), # 传入一个分类器,一般我们都会选择决策树
    n_estimators=500,  # 要创建多少个子模型
    max_samples=100,  # 每一个子模型看多少个样本
    bootstrap=True  # 表示是否放回取样,True表示放回,所以在sklearn中只用一个Bagging,通过控制bootstrap来表示是否放回
)

bagging_clf.fit(X_train, y_train)
print(bagging_clf.score(X_test, y_test))  # 0.944

4.oob (out-of-Bag) and more discussion on the Bagging

We sampling with replacement, then every time because the portion of the sample is selected from all of the samples, it is possible to have a portion of the sample is not eligible, after a rigorous mathematical proof, about 37% sample did not get. 37% of this sample is called out-of-bag, meaning that it has never been taken.

That being the case, then there is no need to use train_test_split born, because 37% of the data is not seen, and we direct users to test 37% of the data can be. sklearn also encapsulates a oob_score_ for us, it allows us to directly see the scores.

from sklearn.datasets import make_moons
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

X, y = make_moons(n_samples=500, noise=0.3, random_state=222)

bagging_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=100,
    bootstrap=True,
    oob_score=True  # 加上这个参数表示我们要记录哪些样本被取了,这样后续才能使用那些没有被使用的样本
)

# 直接传入所有数据进行训练
bagging_clf.fit(X, y)
# 这个命名方式也不用说了,表示不是由用户传来的,而是中途生成的,可以调用查看的属性
print(bagging_clf.oob_score_)  # 0.902

并且Bagging极易使用并行化处理,可以使用n_jobs参数

Apart from sample to sample, look at each portion of the sample, the feature can be sampled, i.e., every look part of the feature. For very large number of samples of those features, such as pixel identification, we can use the characteristics of random sampling, designated by bootstrap_features = True. In addition, we can either random sampling for a sample or a random sampling for the feature.

5. Extra-Trees and Random Forests

Before we integrated learning models are decision trees, these trees are random, and the number of trees and more, it becomes what it? Right, so there is a better image of the argument, called random forests.

并且sklearn已经为我们封装了随机森林这个算法

from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier

X, y = make_moons(n_samples=500, noise=0.3, random_state=666)

rf_clf = RandomForestClassifier(n_estimators=500,  # 多少颗决策树
                                random_state=666,  # 树是随机的
                                oob_score=True,
                                n_jobs=-1
                                )
# 直接传入所有数据进行训练
rf_clf.fit(X, y)
# 这个命名方式也不用说了,表示不是由用户传来的,而是中途生成的,可以调用查看的属性
print(rf_clf.oob_score_)  # 0.892

In addition to random forests, there is an Extra-Trees, meaning it is extremely random number, the number is plural, it expressed a lot of numbers. On the nodes into the tree, and use the random stochastic threshold characteristics. It provides additional randomness to suppress over-fitting, but increases the bias.

from sklearn.datasets import make_moons
from sklearn.ensemble import ExtraTreesClassifier

X, y = make_moons(n_samples=500, noise=0.3, random_state=666)

rf_clf = ExtraTreesClassifier(n_estimators=500,  # 多少颗决策树
                              random_state=666,  # 树是随机的
                              oob_score=True,
                              bootstrap=True,
                              n_jobs=-1
                              )
rf_clf.fit(X, y)
print(rf_clf.oob_score_)  # 0.892

当然除了分类问题,也可以解决回归问题。

6.Ada Boosting和Gradient Boosting

Boosting, inherit multiple models, each model are trying to enhance (Boosting) the overall effect. For chestnuts

This is an Ada Boosting, first we predict, will have a point to make mistakes, there would be no point in doing a good predictive marker into the new data, so see a blue dot deepened, he expressed on a data model is not very predictable, and the new model will continue to predict, for a model does not forecast good data, corresponding weight will increase. So since it is predicted that there will always be mistakes, then continue with the subsequent repeat the same action. So each model are constantly enhance the overall effect, which is Ada Boosting

我们看看sklearn中的Ada Boosting

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

X, y = make_moons(n_samples=500, noise=0.3, random_state=666)

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2),
                            n_estimators=500
                            )
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)
ada_clf.fit(X_train, y_train)

print(ada_clf.score(X_test, y_test))  # 0.864

Gradient Boosting principle is that a training model, an error e1, training for e1, e2 error, e2 for training, an error e3, the final prediction: m1 + m2 + m3 + ······, so that every training times are for a mistake before compensation.

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier

X, y = make_moons(n_samples=500, noise=0.3, random_state=666)

gb_clf = GradientBoostingClassifier(max_depth=2,  # 这里不需要指定分类器,默认是以决策树作为基础的
                                     n_estimators=500
                                     )
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)
gb_clf.fit(X_train, y_train)

print(gb_clf.score(X_test, y_test))  # 0.896

这些集成算法都可以解决回归问题

7.Stacking

Remember Voting Classifier before it? A plurality of algorithms to predict, predict multiple results, then the results of these three comprehensive result. But Stacking is not the same, it does not directly integrated three results, but with this result as three inputs, add another layer, to predict the result, this is Stacking. Also Stacking can only solve classification problems, back problems can be solved.

如果再复杂一些的话,我们可以分为三层。这就意味着我们需要把数据集分成三份,第一份训练第一层,第二份训练第二层,第三份训练第三层。其实看到这里就有点像是神经网络了,关于Stacking,sklearn没有提供相应的模型让我们使用,有兴趣的话可以自己实现一下。

Guess you like

Origin www.cnblogs.com/traditional/p/11524945.html