Random forest case analysis

Before reading the random forest model, it is recommended to read the decision tree model manual first (click to jump to the help manual page of the decision tree model), because the random forest model is essentially a synthesis of multiple decision tree models, and the decision tree model only builds one classification tree, but the random forest model builds many decision trees, which is equivalent to repeating the decision tree model. The random forest model is constructed based on random samples, and at each tree node, the random characteristics of the split are taken into account. Therefore, in a general sense, the random forest model is better than the decision tree model (but not necessarily, in actual research, the data should be used as the allow).

As mentioned above, the random forest model is a combination of multiple decision tree models, but there is only one copy of the data itself. How to achieve multiple decision tree models? This involves the principle of random sampling. For example, if there are 100 data, and 100 data are randomly selected from them for the first time, a new data will be obtained. This new data can be used to build a decision tree model, and then the second time Repeatedly randomly extract 100 pieces of data, get another piece of new data, and use the new data again to build a decision tree model. Repeat the above steps repeatedly to obtain multiple decision tree models; as for how many decision trees there should be, this can be determined by parameter settings. After obtaining multiple decision trees, use the voting method (for example, if multiple pieces of data point to category A, then it should be category A) or the mean probability method to count the final classification results.

In addition, the interpretation of other parameters and indicators of random forest is basically consistent with that of decision tree, because its substantive principle is the synthesis of multiple decision tree models.


Random forest model case

1  background

Random forest still uses the classic 'Iris classification data set' for case demonstration, with a total of 150 samples, including 4 feature attributes (4 independent variables X), namely sepal length, sepal width, petal length, petal width , labeled as the iris flower category, including 3 categories: bristle iris, color-change iris, and Virginia iris (hereinafter referred to as categories A, B, and C).

2  Theory

The principle of the random forest model can be seen in the figure below.

For example, this case has 150 samples, 80% or 120 samples are used to train the model, and 3 decision trees are trained (3 here is the parameter value of the number of decision trees, which can be set by yourself). The data of the 3 decision trees is After random sampling, three decision trees are constructed, and the voting strategy is used to make the final decision. The entire model is the overall random forest model. The specific relevant parameter information is as follows:

Among the above parameters, there are four parameter values: node splitting standard, node classification minimum sample, leaf node minimum sample size and tree maximum depth, which are completely consistent with the decision tree model. For details, please refer to the decision tree model manual (click to jump to decision tree model's help manual page).

Limit on the maximum number of features: Random forest builds multiple decision trees. Each decision tree does not necessarily use all the features (ie, independent variable This parameter needs to be set and the system can automatically determine it.

Number of decision trees: The default value is 100, which means building 100 decision trees. This parameter value can be set by yourself, and usually does not need to be set. The greater the number of decision trees, the more stable the model construction will be, but the longer the model running time will be.

Whether there is replacement sampling: Based on the principle of random sampling, for example, 100 samples are selected out of 100, and number 5 is drawn for the first time. Is it possible to continue to draw number 5 for the second time? If it is sampling with replacement, it is possible. If so, Without replacement sampling, it is impossible to draw number 5 again. Under normal circumstances, replacement sampling should be used, especially when the sample data set is small.

Out-of-bag data test: For example, 100 samples are randomly selected. Some samples are repeatedly drawn, and some of the remaining numbers may not be drawn anyway. This type of data is called 'out-of-bag data'. This part of the data can be used in the test used in the data. When this parameter is not selected, the test model does not use out-of-bag data for testing.

When using random forests, you can usually set parameters for "minimum sample size for node classification", "minimum sample size for leaf nodes", and "maximum tree depth". These three parameter values ​​are completely consistent with the decision tree model. You can refer to the decision tree model. Manual (click to jump to the help manual page of the decision tree model).

3  operations

This example operates as follows:

The default selection of the training set proportion is: 0.8 or 80% (150*0.8=120 samples) for training the random forest model, and the remaining 20% ​​or 30 samples (test data) are used for model verification. It should be noted that in most cases, the data will be standardized first, and the processing method is generally normal normalization. The purpose of this processing is to make the data maintain a consistent dimension. Of course, other dimensional methods can also be used, such as intervalization, normalization, etc.

Then set the parameters as follows:

The node splitting standard defaults to the gini coefficient and does not need to be set. The maximum number of features is automatically limited by default. The number of decision trees SPSSAU defaults to 100 and usually does not need to be set. Replacement sampling and out-of-bag data testing are selected. If you want to set parameters, set the existing parameters of the decision tree model, including "minimum sample size for node classification", "minimum sample size for leaf nodes", and "maximum tree depth", and then compare the different parameters. The quality of the model under parameter conditions is determined, and the optimal model is selected. In this case, the default values ​​can be used for the time being.

4  SPSSAU output results

SPSSAU outputs a total of 6 results, which are basic information summary, decision tree structure diagram, feature model diagram and feature weight diagram, training set or test set model evaluation results, test set result confusion matrix, model summary table and model code, as explained below :

item illustrate
Summary of basic information Data distribution of dependent variable Y (label item), etc.
Feature model diagram and feature weight diagram Used to analyze the importance of features
Training set or test set model evaluation results It is very important to analyze the model effect evaluation of training set and test set data.
Test set result confusion matrix Further evaluation of the test set data is very important
Model summary table Model parameters and evaluation summary table
model code Core python code for model construction

In the above table, the basic information summary only shows the classification distribution of the dependent variable Y (label item). The feature model diagram and the feature weight diagram can be used to view the relative importance comparison of features; model evaluation results (including training set or test set) ), which is very important for judging the fitting effect of the model, especially the fitting effect of the test set. Therefore, SPSSAU provides a separate confusion matrix of test set results for further viewing the effect of the test set data; the model summary table will Various parameter values ​​are summarized, and the core code for random forest model construction is in the final appendix.

5 Text Analysis

In terms of the comparison of the importance of feature weights, X3 and It is more normal to appear when the amount is small).

Next, the most important model fitting conditions are explained, as shown in the following table:

The above table provides four evaluation indicators for the training set and test set respectively, namely precision rate, recall rate, f1-score, accuracy rate, as well as average index and sample size index, etc., as explained in the following table:

For the specific interpretation of the above specific indicators, please refer to the decision tree model help manual. Usually, the F1-score value is used for evaluation. Overall, the F1-score value of the training data is 0.95, which is very high, and the comprehensive f1-score value of the test data is 0.94, which is also very high (note: the value is 0.906 in the decision tree model), which means that random forest brings relatively better prediction results.

Then you can view the 'confusion matrix' of the test data, which is the intersection set of model predictions and facts, as shown below:

In the 'Confusion Matrix', the larger the value of the diagonal line of the lower right triangle, the better, which means that the predicted value is completely consistent with the true value. In the above figure, only 2 samples in category B are judged to be category C, and the rest are all correct, which means that this random forest model performed well on the test data. Finally, SPSSAU outputs model parameter information values, as shown in the following table:

The above parameter information is only for re-output summary and has no other purpose. The final SPSSAU output uses the slearn package in python to build the core code of this random forest model as follows:

model = RandomForestClassifier(criterion=gini, max_depth=40, min_samples_leaf=10, min_samples_split=2, n_estimators=100, bootstrap=True, oob_score=True, max_features=auto)

model.fit(x_train, y_train)

6  Analysis

The following key points are involved, as follows:

  • Is standardization required when using a random forest model?
    The general recommendation is to standardize, usually using normal standardization.
  • What should be the proportion of the training set?
    If the amount of data is large, such as 10,000, then the proportion of the training set can be higher, such as 0.9. If the amount of data is small, the proportion of the training set should be smaller to reserve more data for testing. Can.
  • Saving predicted values
    ​​When saving predicted values, SPSSAU will generate a new title to store the category information predicted by the model. The meaning of its number is consistent with the number of the label item (dependent variable Y) in the model.
  • How to set parameters?
    If you want to set parameters, it is recommended to increase the minimum sample size of node rows, increase the minimum sample size of leaf nodes, and consider setting the maximum depth of the tree to a relatively small value. After setting, summarize and compare the training fitting effects and test fitting effects respectively, adjust the parameters, and find the relatively optimal model. It is also recommended to ensure that the f1-score value of the training set and test set data is above 0.9.

  • When SPSSAU builds a random forest model, how to deal with the categorical data included in the independent variable Put it in. You can click to view the dummy variables.
    http://spssau.com/front/spssau/helps/otherdocuments/dummy.html
  • What is the criterion for judging the qualification of a random model in SPSSAU?
    In machine learning models, training data is usually used to train the model first, and then test data is used to test the model effect. Usually the criterion is that the training model has a good fitting effect, and the test model also has a good fitting effect. In machine learning models, it is easy for the phenomenon of 'overfitting', that is, false good results, to occur, so it is necessary to focus on the fitting effect of the test data. For a single model, parameters can be transformed and optimized. At the same time, multiple machine learning models can be used, such as decision trees, support vector machines (SVM), neural networks, etc., to comprehensively compare and select the optimal model.
  • More references on random forests?
    More information about random forests can be viewed through the sklearn official manual, click to view.
    https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees
  • Does SPSSAU prompt abnormal data quality when running a random forest model?

The current random forest model supports classification tasks. You need to ensure that the label item (dependent variable Y) is categorical data. If it is quantitative continuous data, or the sample size is small (or non-members only analyze the first 100 samples), it may not be calculated. This indicates abnormal data quality.

Guess you like

Origin blog.csdn.net/m0_37228052/article/details/132986169