Random Forest algorithm (RandomForest) + sklearn library

Random Forest algorithm study
recently doing kaggle and found Random Forest algorithm on the classification effect is very good, the effect is in most cases far better than svm, log regression algorithms knn good effect. So I think pondering pondering principle of this algorithm.

To learn Random Forests, first briefly introduce integrated learning methods and decision tree algorithm. Only later do a brief introduction of the two methods (see specific study recommended Chapter 5 and Chapter 8 statistical learning methods).

Bagging and Boosting the concept differs from
the main part of learning from: http: //www.cnblogs.com/liuwu265/p/4690486.html

Random Forests belong to integrated learning (Ensemble Learning) in the bagging algorithm. In the ensemble learning, the algorithm is divided into bagging and boosting algorithm. Let's look at the characteristics and differences between the two methods.

On Bagging (bagging method)
algorithm bagging procedure is as follows:

Original sample concentration using the method Bootstraping n training samples randomly selected from a total of k rounds extraction, to obtain k training sets. (K independent of each training set, there may be repeated elements)
for k training sets, we trained models k (k models that may be based on specific issues, such as decision trees, etc. KNN)
for classification by: pOLL produce classification results; for regression problem: by the mean of k prediction models predict as the final result. (The same importance for all models)
Boosting (lift method)
Boosting algorithm is as follows:

For each sample in the training set to establish the weights wi, expressed concern for each sample. When a high probability samples are misclassified, the need to increase the weight of the sample.
Iterative process, each iteration is a weak classifier. We need a strategy combining them, as the final model. (E.g. AdaBoost for each weak classifier a weight, which is a linear combination of the most final classification weak classifiers smaller the error, the larger weights.)
On Bagging, on Boosting main difference between
the selected sample: Bagging uses Bootstrap random sampling with replacement; and Boosting each round of training set is the same, just change the weight of each sample weight.
Sample weights: Bagging using a uniform sampling, sample weights equal to each; on Boosting adjusted sample weights according to the error rate, the larger the error the greater the weight of the sample.
Prediction function: Bagging weight of all the prediction function equal weight; on Boosting smaller error prediction function greater weight its weight.
Parallel Computing: Bagging each prediction function can be generated in parallel; on Boosting respective prediction function must generate an iterative sequence.
The following is a decision tree algorithms with these new algorithms in conjunction frameworks obtained:

1) Bagging + = random forest tree

2) AdaBoost + = tree boosting tree

3) Gradient Boosting + tree = GBDT

Tree
common decision tree algorithm ID3, C4.5, CART three. Model three algorithms constructing ideas are very similar, but with a different target. Decision tree model is constructed as follows:

ID3, C4.5 Decision Tree
input: training set D, feature set A, a threshold value output eps: tree T

If D All the samples belonging to the same class Ck, and T is a single node in the tree, the class Ck as the class tag of the node, returns T
if A is an empty set, i.e., no features as division basis, and T is a single node tree, in example D and the maximum number of class Ck as a marker of the node class, T returns
otherwise, each feature a calculation of the information gain D (ID3) / gain ratio information (C4.5), select the maximum gain wherein Ag
information gain if the Ag (ratio) is less than the threshold value eps, then the set T is a single node of the tree, and D in example largest number of class Ck as the class tag of the node, return T
otherwise in accordance with the features of Ag to D divided several non-empty set Di, Di the maximum number of classes in example as a marker to construct the child node, a node constituting the tree T and its child nodes, T returns
to the i-th node, Di is in the training set to A- {Ag} is a set of features called recursively from 1 to 5, to obtain the subtree Ti, Ti returns
generated CART decision tree
here only a brief introduction CART ID3, and C4.5 of the difference.

CART trees are binary, and the ID3, and C4.5 tree may be multi-
CART when generating the sub-tree, is selected as a feature value of a cut points, generating two sub-trees
selected based on feature points are cut and Gini index selecting the smallest features and Gini index cut points generated subtree
pruning of tree
pruning tree primarily to preventing over, the process is not described in detail.

The main idea is to back up from the leaf node, a node attempts to prune, loss of function before and after comparison of the value of the decision tree pruning. Finally, we dynamic programming (tree dp, acmer should know) you can get the global optimum pruning program.

Random Forest (Random Forests)
random forests is an important Bagging based integrated learning methods, it can be used for classification, regression and other issues.

Random forest has many advantages:

With a high accuracy
randomness is introduced, so that the random forest is not easily fit through the
randomness introduced, so that the random forest has good noise immunity
can handle high data dimensions, and do not feature selection
handles both discrete data, continuous data can be processed, standardized data sets without
fast training speed, variable importance ranking can be obtained
easily parallelized
random forest disadvantages:

When the lot number of the random tree in the forest, space and time would be needed to train large
random forest model there are many places hard to explain, considered a little black box model
with Bagging process described above is similar to the random forest construction process is as follows:

From the original training using concentrated Bootstraping selected random sampling with replacement m samples, a total of n_tree samples, generating n_tree training set
for n_tree training set, we were trained n_tree a decision tree model
for a single decision tree model where the number of training samples feature is n, then every time you select the best feature split based on information gain / gain ratio information / Gini index to divide
each tree has been so split until all training examples belong to the node the same class. In the split decision tree pruning does not require
multiple decision trees generated random forest composition. For classification problems, according to a multi-tree classifier vote on the final classification result; for regression problems, the final result is determined by the mean forecast predicted value of more than trees

 

Use Cases

from sklearn.ensemble import RandomForestClassifier



trainSet,trainLabel,testSet,testLabel  = getFuturesDataSet(npyPath,0.67)

model = RandomForestClassifier(bootstrap=True,random_state=0) model.fit(trainSet,trainLabel) #降维 # x_pca_test = pca.fit_transform(x_test) result = model.predict(np.array(testSet))

  

 Related documents Original: http: //scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


----------------

Original link: https: //blog.csdn.net/qq547276542/article/details/78304454

Guess you like

Origin www.cnblogs.com/blogwangwang/p/11532396.html