Summary of Machine Learning Algorithms - Random Forest

Introduction

Random forest refers to a classifier that uses multiple trees to train and predict samples. It is composed of multiple CART (Classification And Regression Tree). For each tree, the training set used is sampled with replacement from the total training set , which means that some samples in the total training set may appear multiple times in the training set of a tree, or may never appear in a single tree. tree in the training set. When training the nodes of each tree, the features used are randomly extracted from all features according to a certain proportion without replacement . Assuming that the total number of features is M, then this proportion can be (M),12(M),2(M)

training process

The training process of random forest can be summarized as follows:

(1) Given training set S, test set T, and feature dimension F. Determining parameters: the number of CARTs used t, the depth dof each tree, the number of features used at each node f, termination conditions: the minimum number of samples son the node, the least information gain on the nodem

For the 1-t tree, i=1-t:

(2) There is a training set S(i) with the same size as S, which is replaced from S, as a sample of the root node, and training starts from the root node

(3) If the current node reaches the termination condition, set the current node as a leaf node. If it is a classification problem, the predicted output of the leaf node is the class with the largest number of samples in the current node sample set, and the c(j)probability pis c(j)1% of the current sample set. Proportion; if it is a regression problem, the prediction output is the average value of each sample value in the current node sample set. Then continue to train other nodes. If the current node does not reach the termination condition, the f-dimensional feature is randomly selected from the F-dimensional feature without replacement. Using this f-dimensional feature, find the one-dimensional feature with the best classification effect kand its threshold . The thsamples whose kth-dimensional feature thof the sample on the current node is smaller than that are divided into the left node, and the rest are divided into the right node. Continue training other nodes. The criteria for judging the classification effect will be discussed later.

(4) Repeat (2) (3) until all nodes are trained or marked as leaf nodes.

(5) Repeat (2), (3), (4) until all CARTs have been trained.

forecasting process

The prediction process is as follows:

For the 1-t tree, i=1-t:

(1) Starting from the root node of the current tree, according to the threshold th of the current node, determine whether to enter the left node ( <th) or the right node ( >=th), until it reaches a certain leaf node, and output the predicted value.

(2) Repeat (1) until all t trees have output predicted values. If it is a classification problem, the output is the class with the largest sum of predicted probabilities among all trees, that is, the accumulation of p of each c(j); if it is a regression problem, the output is the average of the outputs of all trees.

Regarding the evaluation criteria of the classification effect, because CART is used, the evaluation criteria of CART are also used, which are different from C3.0 and C4.5.

For classification problems (dividing a sample into a certain class), that is, discrete variable problems, CART uses the Gini value as the criterion. defined as Gini(p)=1Kk=1p2k , pk is the proportion of the k-th sample in the dataset on the current node.

For example: divided into 2 categories, there are 100 samples on the current node, 70 samples belong to the first category, and 30 samples belong to the second category, then Gini=10.7×070.3×03=0.42 , it can be seen that the more average the class distribution is, the larger the Gini value is, and the more uneven the class distribution is, the smaller the Gini value is. When looking for the best classification features and thresholds, the judging criteria are: argmaxGiniGiniLeft G i n i R i ght , that is, to find the best feature f and threshold th, so that the Gini value of the current node minus the Gini value of the left child node and the Gini value of the right child node are the largest.

For regression problems, it is relatively simpler to use directly argm and x ( Va r Va r L e ftVa r R i ght) As a criterion, that is, the variance Var of the training set of the current node minus the variance VarLeft of the left child node and the variance VarRight of the right child node are the largest .

Feature importance measure

When calculating the importance of a feature X, the specific steps are as follows:

  1. For each decision tree, select the corresponding out-of-bag (OOB)​to calculate the out-of-bag data error, denoted as errOOB1.

    The so-called out-of-bag data means that each time a decision tree is built, a piece of data is obtained by repeated sampling for training the decision tree. At this time, about 1/3 of the data is not used and does not participate in the establishment of the decision tree. This part of the data can be used to evaluate the performance of the decision tree and calculate the prediction error rate of the model, which is called the out-of-bag data error.

    ​This has been shown to be an unbiased estimate, so no cross-validation or a separate test set is required in the random forest algorithm to obtain an unbiased estimate of the test set error.

  2. Randomly add noise interference to the feature X of all samples of the out-of-bag data OOB (the value of the sample at the feature X can be randomly changed), and calculate the error of the out-of-bag data again, which is recorded as errOOB2.

  3. ​Assuming there are N trees in the forest, the importance of feature X = errOOB2errOOB1N . The reason why this value can explain the importance of the feature is that if random noise is added, the accuracy of the out-of-bag data will drop significantly (that is, errOOB2 will increase), indicating that this feature has a great impact on the prediction results of the sample, which in turn indicates the degree of importance. relatively high.

Feature selection

On the basis of feature importance, the steps of feature selection are as follows:

  1. Calculate the importance of each feature and sort in descending order
  2. Determine the proportion to be eliminated, and eliminate the corresponding proportion of features according to the feature importance to obtain a new feature set
  3. Repeat the above process with the new feature set until m features remain (m is the value set in advance).
  4. According to each feature set obtained in the above process and the out-of-bag error rate corresponding to the feature set, the feature set with the lowest out-of-bag error rate is selected.

advantage

  • perform well on the dataset
  • On many current data sets, it has a great advantage over other algorithms
  • It can handle very high-dimensional (many features) data without feature selection
  • After training, it can give which features are more important
  • When creating a random forest, an unbiased estimate is used for the generlization error
  • fast training
  • During the training process, the interaction between features can be detected
  • easy to parallelize
  • Implementation is relatively simple

Code

An example of simply using the random forest algorithm in sklearn:

#Import Library
from sklearn.ensemble import RandomForestClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create Random Forest object
model= RandomForestClassifier()

# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)

In addition, the random forest algorithm is also implemented in OpenCV. For specific usage examples, see RandomForest Random Forest Summary .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325990539&siteId=291194637