Integrated learning in a random forest --python

Integrated learning

If you ask a complex question to thousands of people, then summarized their responses. Under normal circumstances, aggregate out of the answer better than to answer the experts. Similarly, if you are a set of predictor polymerization (such as classifier, the return device) forecast, large results better than the best single predictor. Such a set of prediction is called integration, for this technique is called ensemble learning, a learning algorithm is called integrated integrated process.

Common integration method has the following centralized, bagging, boosting, stacking.

Voting classifier

Suppose you have better trained classifier, such as a return to Luo Ji, a svm, a random forest, a k-nearest neighbor. In order to get better results, the easiest way is to integrate each classifier to predict, and then predict the voting results as most categories, this classification is known as the most hard-vote poll. The results thus obtained are generally better than a single classifier to achieve a plurality of weak learners into a strong learner. At the same time, to ensure that the predictor as possible independent of each other, this will give better results. This ensures that they make different types of errors, so as to enhance accuracy.

from sklearn.model_selection import  train_test_split
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf=LogisticRegression()
svc_clf=SVC()
rnd_clf=RandomForestClassifier()

voting_clf=VotingClassifier(estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svc_clf)], voting='hard')
voting_clf.fit(X_train, y_train)

from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svc_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
LogisticRegression 0.864
RandomForestClassifier 0.88
SVC 0.888
VotingClassifier 0.88

If you want all classifiers estimated probability, the probability of multiple classifiers on average, rely on the highest probability to classify and make predictions, this method is called soft-vote law. In general, his performance even more outstanding voting method in hard, because he was more confident of those voting is given a higher weight. In SVC, often do not show probability, so to its ultra-parameter

SVC(probability=True)
LogisticRegression 0.864
RandomForestClassifier 0.872
SVC 0.888
VotingClassifier 0.912

The results show significantly better than the hard vote.

bagging,pasting

One of acquiring different types of classifiers is to use different training algorithms. In addition, there is a method using the same algorithm for each predictor, but is trained on different random subset of the training set. If the result back to the sample, the method is called bagging (bootstrap aggregation method). If you do not put back, it is called pasting method.

bagging and pasting allow a plurality of training examples predictor is sampled a plurality of times, but only allows the bagging multiple sampling instances are the same predictor.

sklearn of bagging and pasting

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bag_clf=BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)
y_pred=bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

If you are using pasting method, the bootstrap = Flase can.

The accuracy was 0.92 at this time, to see the results of pasting method is not much difference of 0.912. Look again and then the decision tree performance

tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)
print(accuracy_score(y_test, y_pred_tree))

The result is 0.856, which can reflect the benefits of an integrated learning gap 1vs500 decision tree.

Evaluation outer package

For bagging, it some instances may be sampled a plurality of times, those examples are not sampled real-called packet outer column (OOB).

BaggingClassifier(oob_score=True)

Such setting can be automatically performed after the evaluation outer package.

bag_clf=BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, bootstrap=True, n_jobs=-1, oob_score=True)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

Results: 0.9013333333333333

Results of the test set is shown below

from sklearn.metrics import  accuracy_score
y_pred=bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

The result is 0.904, with little difference between outsourcing assessment.

Random Forests

We understand the bagging algorithm, Random Forest (Random Forest, hereinafter referred to as RF) like to understand. It is an evolution of Bagging algorithm, that is to say, the thought of it is still bagging, but with the unique improvements. We now look at what improvements RF algorithm.   

    First, RF CART decision tree was used as a weak learners, which makes us think of the gradient prompted tree GBDT. Second, based on the use of decision trees, RF to build decision trees made improvements for ordinary decision tree, we will choose an optimal feature on all nodes of the n samples do feature in the decision tree left and right subtrees left divided, but wherein the RF portion of the sample by random selection of the node, this number is less than n, is assumed to nsubnsub, nsubnsub then randomly selected sample of these features, the features do choose an optimal decision tree sub-tree division. This further enhances the generalization ability of the model.    

    If nsub = nnsub = n, the CART decision tree at this time and RF Common CART decision tree no difference. nsubnsub smaller, the model is robust about, of course, this time for the fit of the training set will deteriorate. That nsubnsub smaller variance model will be reduced, but the bias will increase. In the real case, generally acquired by cross-validation parameter adjustment of a suitable value nsubnsub.

    In addition to the above two points, RF and ordinary bagging algorithm is no different, the following brief summary of RF algorithms.

    Input sample set D = {(x, y1), (x2, y2), ... (xm, ym)} D = {(x, y1), (x2, y2), ... (xm, ym )}, the number of iterations of weak classifiers T.

    The final output of the strong classifier is f (x) f (x)

    1) For t = 1,2 ..., T:

      a) for the t-th training set randomly sampled m times were collected to give the set of samples comprises samples DtDt m

      b) (x) Gt (x), in the training of a decision tree model nodes, characterized by selecting a portion of the sample features on all samples with a sampling node Training Set DtDt Gt t-th decision tree model, these randomly selected selecting an optimal characteristic features part of the sample do decision tree is divided left and right subtrees

    2) If the classification algorithm predicts that the T weak learners cast one category or categories the most votes for the final category. If the algorithm is a regression, regression results weak learners T was subjected to a final value of the output of the arithmetic mean model obtained.

Extreme Random Forests

 extra trees is a variant of RF, RF and almost exactly the same principle, the only difference are:

    1) For each training set of decision trees, random sampling is used in the RF bootstrap selected set of samples as a training set for each decision tree, and no extra trees generally random sampling, i.e., the original training set using each decision tree.

    2) After the selected characteristic is divided, the decision tree will be based on the RF Gini coefficient, average variance principles like, selecting an optimal characteristic value of the division point, and that the same traditional decision tree. But more radical extra trees, he will randomly select a characteristic value divided by a decision tree.

    As can be seen from the second point, due to the random selection of the eigenvalues ​​division point, rather than optimum position, this will cause the size of the generated decision tree generally be greater than the generated RF tree. In other words, the variance of the model with respect to the further reduction of RF, but relative to the RF bias is further increased. At some point, the generalization ability of extra trees better than RF.

The importance of features

In general, the more important features of the closer vicinity of the root node. sklearn, it is possible to see the importance of each feature by feature_importances_. Random Forests in favor of the importance of rapid extraction feature.

from sklearn.datasets import  load_iris
iris=load_iris()
from sklearn.ensemble import RandomForestClassifier
rnd_clf=RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)
sepal length (cm) 0.10346956848238457
sepal width (cm) 0.025617561677439962
petal length (cm) 0.43341727470590563
petal width (cm) 0.43749559513426994

Enhance law

boosting refers to any integrated approach to combine multiple weak learners known as a strong learner, thinking most of the Act is to enhance the predictor circuit training, each time to make some corrections to their first order. AdaBoost and divided into two lifting gradient method.

AdaBoost

One way to correct the new predictor its first order, is an ex-sequence fitting example inadequate. This is the AdaBoost method.

To build a AdaBoost classifier, first to train a classifier basis, to predict with other training set. Is then error free instance to increase its relative weight Next, a second classifier prediction, continues to update the weights.

Again, this storm drain a huge disadvantage. Relative bagging and pasting it, AdaBoost not parallel, each predictor only a predictor of the current assessment is completed and before the beginning of training. Therefore, in terms of expansion, a relatively large gap between the other two.

from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.5, random_state=42)
ada_clf.fit(X_train, y_train)

è¿éåå¾çæè¿ °

è¿éåå¾çæè¿ °

è¿éåå¾çæè¿ °

Gradient upgrade

Unlike AdaBoost be adjusted as instance weights in each iteration, but to give a new predictor of residual before a predictor of fit.

np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3*X[:, 0]**2 + 0.05 * np.random.randn(100)

First, a fitted regression decision trees

from sklearn.tree import DecisionTreeRegressor

tree_reg1=DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)

For the first tree residuals, training the second tree

y2=y-tree_reg1.predict(X)
tree_reg2=DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

Residuals for the second tree, the third tree training

y3=tree_reg2.predict(X)
tree_reg3=DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X,y3)

Adding the predicted all the trees, so that a new instance of prediction 

y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

Of course, we can also use a simple method. GBRT integration can be called directly

from sklearn.ensemble import GradientBoostingRegressor

grbt=GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
grbt.fit(X, y)

Among them, the learning rate is too low and too few predictors are prone to insufficient fitting problem, learning rate is too high or too predictor can lead to over-fitting problems.

Stacking method

Unlike other methods, which is based on a very simple idea: Instead of using some simple functions to integrate polymerization predicting all predictors, why not train a model to perform this polymerization it? Such that the prediction is the prediction respectively different values ​​of (3.1,2.7,2.9) and then the final prediction is the prediction of these as input, the final prediction.

Process is as follows:

First, the training set into two subsets, the first subset is used to train the predictor of the first layer.

Then in a second subset of the prediction by the predictor of the first layer. Because predictor never seen these instances, it is possible to ensure that the predictions are "clean." These predicted values ​​as input feature, create a new training set. Mixer in training on this new training set, let it be predicted according to the prediction of the first layer of the target. You can train a variety of different mixers in this way.

 

Guess you like

Origin blog.csdn.net/weixin_42307828/article/details/86360622