Introduction
Random forest refers to a classifier that uses multiple trees to train and predict samples. It is composed of multiple CART (Classification And Regression Tree). For each tree, the training set used is sampled with replacement from the total training set , which means that some samples in the total training set may appear multiple times in the training set of a tree, or may never appear in a single tree. tree in the training set. When training the nodes of each tree, the features used are randomly extracted from all features according to a certain proportion without replacement . Assuming that the total number of features is
M
, then this proportion can be(√M),12(√M),2(√M) 。
training process
The training process of random forest can be summarized as follows:
(1) Given training set S
, test set T
, and feature dimension F
. Determining parameters: the number of CARTs used t
, the depth d
of each tree, the number of features used at each node f
, termination conditions: the minimum number of samples s
on the node, the least information gain on the nodem
For the 1-t tree, i=1-t
:
(2) There is a training set S(i) with the same size as S, which is replaced from S, as a sample of the root node, and training starts from the root node
(3) If the current node reaches the termination condition, set the current node as a leaf node. If it is a classification problem, the predicted output of the leaf node is the class with the largest number of samples in the current node sample set, and the c(j)
probability p
is c(j)
1% of the current sample set. Proportion; if it is a regression problem, the prediction output is the average value of each sample value in the current node sample set. Then continue to train other nodes. If the current node does not reach the termination condition, the f-dimensional feature is randomly selected from the F-dimensional feature without replacement. Using this f-dimensional feature, find the one-dimensional feature with the best classification effect k
and its threshold . The th
samples whose kth-dimensional feature th
of the sample on the current node is smaller than that are divided into the left node, and the rest are divided into the right node. Continue training other nodes. The criteria for judging the classification effect will be discussed later.
(4) Repeat (2) (3) until all nodes are trained or marked as leaf nodes.
(5) Repeat (2), (3), (4) until all CARTs have been trained.
forecasting process
The prediction process is as follows:
For the 1-t tree, i=1-t:
(1) Starting from the root node of the current tree, according to the threshold th of the current node, determine whether to enter the left node ( <th
) or the right node ( >=th
), until it reaches a certain leaf node, and output the predicted value.
(2) Repeat (1) until all t trees have output predicted values. If it is a classification problem, the output is the class with the largest sum of predicted probabilities among all trees, that is, the accumulation of p of each c(j); if it is a regression problem, the output is the average of the outputs of all trees.
Regarding the evaluation criteria of the classification effect, because CART is used, the evaluation criteria of CART are also used, which are different from C3.0 and C4.5.
For classification problems (dividing a sample into a certain class), that is, discrete variable problems, CART uses the Gini value as the criterion. defined as
For example: divided into 2 categories, there are 100 samples on the current node, 70 samples belong to the first category, and 30 samples belong to the second category, then
For regression problems, it is relatively simpler to use directly
Feature importance measure
When calculating the importance of a feature X, the specific steps are as follows:
For each decision tree, select the corresponding out-of-bag (OOB)to calculate the out-of-bag data error, denoted as errOOB1.
The so-called out-of-bag data means that each time a decision tree is built, a piece of data is obtained by repeated sampling for training the decision tree. At this time, about 1/3 of the data is not used and does not participate in the establishment of the decision tree. This part of the data can be used to evaluate the performance of the decision tree and calculate the prediction error rate of the model, which is called the out-of-bag data error.
This has been shown to be an unbiased estimate, so no cross-validation or a separate test set is required in the random forest algorithm to obtain an unbiased estimate of the test set error.
Randomly add noise interference to the feature X of all samples of the out-of-bag data OOB (the value of the sample at the feature X can be randomly changed), and calculate the error of the out-of-bag data again, which is recorded as errOOB2.
- Assuming there are N trees in the forest, the importance of feature X =
∑errOOB2−errOOB1N . The reason why this value can explain the importance of the feature is that if random noise is added, the accuracy of the out-of-bag data will drop significantly (that is, errOOB2 will increase), indicating that this feature has a great impact on the prediction results of the sample, which in turn indicates the degree of importance. relatively high.
Feature selection
On the basis of feature importance, the steps of feature selection are as follows:
- Calculate the importance of each feature and sort in descending order
- Determine the proportion to be eliminated, and eliminate the corresponding proportion of features according to the feature importance to obtain a new feature set
- Repeat the above process with the new feature set until m features remain (m is the value set in advance).
- According to each feature set obtained in the above process and the out-of-bag error rate corresponding to the feature set, the feature set with the lowest out-of-bag error rate is selected.
advantage
- perform well on the dataset
- On many current data sets, it has a great advantage over other algorithms
- It can handle very high-dimensional (many features) data without feature selection
- After training, it can give which features are more important
- When creating a random forest, an unbiased estimate is used for the generlization error
- fast training
- During the training process, the interaction between features can be detected
- easy to parallelize
- Implementation is relatively simple
Code
An example of simply using the random forest algorithm in sklearn:
#Import Library
from sklearn.ensemble import RandomForestClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Random Forest object
model= RandomForestClassifier()
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)
In addition, the random forest algorithm is also implemented in OpenCV. For specific usage examples, see RandomForest Random Forest Summary .