Getting Started (a) machine learning ensemble learning

Author: chen_h
Micro Signal & QQ: 862251340
micro-channel public number: coderpai


Using a variety of different models and not just a model that is very reliable. A collection of several models of working together in a group called Ensemble. This method is called Ensemble Learning, is an integrated learning.

Voting Model

You can use different algorithms to train your model, then combined to predict the final output. For example, you use random forest classifier, SVM classifier, linear regression; model matching each other, and vote by using VotingClassifier class sklearn.ensemble in order to get the best classification answer. Hard voting refers to selecting the model from the collection to predict the final vote by simple majority to get accuracy. Only when all probability classifier can be calculated results can be soft to vote. Soft vote by each algorithm to calculate the average probability of optimum results.

DETAILED Python code is as follows:

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
	estimators = [('lr', log_clf), ('rf', rnd_clf), ('svc',svm_clf)],
	voting = 'hard')
voting_clf.fit(X_train, y_train)

VotingClassifier generally higher than the accuracy of single classifier. Be sure to include different classifiers, so as not to appear to fit the data with respect to the similar distributions.

bagging和pasting

You can use a single model in a variety of random subset of the data set, instead of running a variety of models on a single data set. Random sampling with replacement algorithm called Bagging. If your mind is hard to imagine, just imagine ignored a few random data and modeling data set with the rest of the data. In the case of pasting, the same application process, the only difference is that the multiple sampling of training examples pasting allowed the same predictor.

python code is as follows:

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1, random_state=42)
bag_clf.fit(X_train, y_train)

y_pred = bag_clf.predict(X_test)

bootstrap = True parameter specifies Bagging. For pasting algorithm, change the parameter to bootstrap = False on it.

If the classifier can calculate the probability of its forecast, the BaggingClassifier automatically perform a soft vote. This is done by checking whether your classifier predict_proba () method to verify.

However, Bagging usually much better than pasting effect.

Assess

Bagging on the implementation of the training set, only 63% of the examples included in the model, not seen 37% of instances before this means classifier. These may be the same as used to assess cross-validation.

To use this feature, only class in the previous example BaggingClassifier added oob_score = True parameters on it.

Python code is as follows:

bag_clf = BaggingClassifier(
DecisionTreeClassifier(random_state=42), n_estimators=500,
max_samples=100, bootstrap=True, n_jobs=-1, random_state=42,
oob_score = True) 

So far, only example is sampled. However, for data sets with a large number of features, as well as other techniques.

Random Forests

The decision tree is a collection of random forest. Random Forests execution Bagging internally. Random forests create a few trees, sometimes thousands of trees, and calculate the best model for a given data set. Random Forest algorithm is not considered at all features split node, but select the best features from all features of the sub-set. This deviation in exchange for a higher to a lower variance, resulting in better models.

Python code is as follows:

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1, random_state=42)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

Parameters: n_estimators is to limit the number of forest trees. The maximum number of max_leaf_nodes used to set the final node, so that the algorithm will not drill down and over-fitting models in a single feature (For detailed instructions, see decision tree). n_jobs computer specifies the number of cores to be used; all possible value of -1 means the maximum core.

Using a grid search can be improved by changing the model parameter values.

AdaBoost

Although the function of mathematics AdaBoost technology is very daunting, but the concept is very simple. First, select a basic classifier, for a given set of prediction. Write down the instances of misclassification. Right misclassified instances of weight gain. The right to a second classifier trained on the training set and use the updated weights.

In simple terms, classification and prediction run. Run another instance of a classifier to fit the previous classification and prediction error. Repeat until all / most of the training examples are suitable.

Python code is as follows:

from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.5, random_state=42)
ada_clf.fit(X_train, y_train)

Scikit-learn to use a multi-class version called Adaboost SAMME (multi-Index Loss Function Stagewise Additive Modeling). If the predictor may calculate a probability (having predict_proba () method), the use Scikit Learn SAMME.R (R represents a real number), it relies on the probability of overfitting and easily. If the over-fitting, try the basic Estimator be regularized.

Gradient Boosting

And AdaBoost similar, Gradient Boosting also applies to add to the overall continuous predictive models. Gradient Boosting not updated AdaBoost training examples such as the weight, but the new model residuals match.

In short, the model fit into a given training set. Computing a residual new training examples. A new model was trained on them and so on. Select Add all models predict.

python code is as follows:

from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0, random_state=42)

Narrow learning rate contribution of each tree. Trade-off between learning_rate and n_estimators. Reducing the value of learning_rate will increase the overall number of trees. This is called contraction. The difference is that the estimated number is increased to a larger value may cause the model overfitting.

XGBoost

XGBoost is the latest, best and most powerful gradient enhancement method. XGBoost not make a "yes" and "no decision" on a Leaf node, but the distribution of positive and negative value for each decision. All the trees are weak learners, and provide slightly better than random guessing decisions. But overall, XGBoost did very well.

Python code is as follows:

from xgboost import XGBoostClassifier
xgb_clf = XGBClassifier()
xgb_clf.fit(X, y)

XGBoost tree and may be used with linear models, and has been used most successfully Kaggle game model. It is a powerful tool for your data science toolbox.

Published 414 original articles · won praise 168 · views 470 000 +

Guess you like

Origin blog.csdn.net/CoderPai/article/details/90609944
Recommended