Principle and the use of random forests (integrated algorithm)

1. What is a Random Forest?

Random Forests fact, multiple decision trees.

By the result of the method the samples resampled different training sample set, training and learning in these new training sample set, the final merge of each learner as the final learning result, wherein the weight of each sample is weighted Like the process is as follows:

 

 

In this  method, learning is between a and b independent of each other, this feature makes the process easier parallel

2. The principle of random forests

1  # guiding packet 
2  Import numpy AS NP
 . 3  
. 4  Import matplotlib.pyplot AS PLT
 . 5 % matplotlib inline
 . 6  
. 7  # RandomForestClassifier random forest ExtraTreesClassifier to limit forest 
. 8  from sklearn.ensemble Import RandomForestClassifier, ExtraTreesClassifier
 . 9  
10  from sklearn Import Datasets
 . 11  
12 is  Import AS PD PANDAS
 13 is  
14  from sklearn.model_selection Import train_test_split
 15  
16 from sklearn.tree import DecisionTreeClassifier

Random Forests: decision trees constructed from multiple satellites, each tree is a tree just mentioned principle
more stars tree operation together ------------> integration algorithm

1  # load data, wine 
2 Wine = datasets.load_wine ()
 . 3 Wine
1  # retrieve data and target 
2 X-Wine = [ ' Data ' ]
 . 3 Y = Wine [ ' target ' ]
 . 4  X.shape
 . 5  
. 6  # classify data, the training data and test data 
. 7 X_train, X_test, y_train, = android.permission.FACTOR. train_test_split (X-, Y, test_size = 0.2 )
 . 8  
. 9  # random forest train prediction calculation accuracy of 
10 CLF = RandomForestClassifier ()
 . 11  clf.fit (X_train, y_train)
 12 is Y_ = clf.predict (X_test)
 13 is  from sklearn. metrics Import accuracy_score
14 accuracy_score(y_test,y_)

 

# Tree prediction accuracy 
dt_clf = DecisionTreeClassifier ()

dt_clf.fit(X_train,y_train)

dt_clf.score(X_test,y_test)

 

1  # accuracy tree multiple runs 
2 Score = 0 
 . 3  for I in Range (100 ):
 . 4      X_train, X_test, y_train, android.permission.FACTOR. Train_test_split = (X-, Y, test_size = 0.2 )
 . 5      
. 6      dt_clf = DecisionTreeClassifier ()
 . 7  
. 8      dt_clf.fit (X_train, y_train)
 . 9  
10      Score + = dt_clf.score (X_test, android.permission.FACTOR.) / 100
 . 11      
12 is  Print ( ' Decision tree accuracy multiple runs: ' , Score)

1  # accuracy of the random forest multiple runs 
2 Score = 0 
 . 3  for I in Range (100 ):
 . 4      X_train, X_test, y_train, android.permission.FACTOR. Train_test_split = (X-, Y, test_size = 0.2 )
 . 5      # n_estimators tree representing the number of 
. 6      CLF = RandomForestClassifier (n_estimators = 100 )
 . 7      
. 8      clf.fit (X_train, y_train)
 . 9  
10      Score + = clf.score (X_test, android.permission.FACTOR.) / 100
 . 11      
12 is  Print ( ' random forests accuracy multiple runs: ' , Score)

3. Limit forest

 Random Forests decision tree relatively speaking, is a random sample

Relatively extreme forest tree, the only randomized, and split random conditions, although this is not the best conditions for the split. (Splitting conditions of the decision tree is the largest information gain)

As in the Random Forests, a candidate feature random subset, but not to find the most differentiated threshold,
but randomly drawn threshold value for each candidate feature,
and select the optimal threshold of these randomly generated threshold as the division rule

3. Summary

Random Forest method for generating:

1. Centralized generating n sample from the sample by way of resampling

2. Assuming wherein a number of samples, a selection of n samples of k features, the best way to establish a division point decision tree

3. repeated m times, m generate decision trees

4. majority voting mechanism to predict

Advantages and disadvantages:

Random Forests is a relatively good model in the use of the effect of my project point of view, it has a very high efficiency for multi-dimensional data set classification features, you can also make choices feature importance.

Higher operating efficiency and accuracy, it is relatively simple to achieve. But will fit over the noise in the data is relatively large disadvantage over-fitting for the random forest is still more deadly.

 

 

 

Guess you like

Origin www.cnblogs.com/xiuercui/p/11962676.html