(Iv) implementation and interpretation of random forests in Python

Author: chen_h
Micro Signal & QQ: 862251340
micro-channel public number: coderpai


Getting Started (a) machine learning ensemble learning

(Ii) Method Bagging

(C) transactions with Python Random Forest algorithm

(Iv) implementation and interpretation of random forests in Python


By constructing decision trees from a single use and understand Random Forests

Fortunately, like Scikit-Learn for such a library, and now it is easy to implement hundreds of machine learning algorithms in Python. It's easy, we usually do not need any knowledge of how the model works to use it. While not need to know all the details, but to understand how machine learning models work continues to be very helpful. This allows us to modify the parameters in the model is poor performance, or how to interpret the model to make decisions, if we want to persuade others to believe our model, this is essential.

In this article, we will describe how to build and use the Random Forest in Python. In addition to looking at the code, we will try to understand how this model. Because random forest tree composed of many, we must first understand how a single decision tree to classify a simple question. Then, we will try to use random forests to solve real-world problems of scientific data.

Understand decision tree

Building block is a random forest tree, is a very intuitive model. We can be seen as a series of decision trees for us is whether the data classification issue /. This is an interpretable model because it can be classified like we do: Before we make a decision (in an ideal world), we will ask a series of queries on the available data, or a series of features.

In the CART algorithm, by determining the characteristic (called nodes split) to build a decision tree, wherein the node at the time after the determination performed by the Gini index. We will discuss the details of the Gini index in more detail later, but first, let's build a decision tree, so that we can understand it at a high level.

Decision tree on simple issues

We'll start with a very simple binary classification began as follows:

Here Insert Picture Description

Our data only features, X1 and x2, FIG six data samples, 0 and 1 is divided into two labels. Although this problem is very simple, but it is linearly inseparable, which means that we can not draw a straight line through the data to classify data.

However, we can draw a series of lines, the data points into a plurality of frames, which we call a node. In fact, this is the decision tree operation carried out during training. Simply put, the decision tree is a nonlinear model by constructing a lot of linear boundary structure.

To create decision trees and on the training data, we use sklearn to.

from sklearn.tree import DecisionTreeClassifier

# Make a decision tree and train
tree = DecisionTreeClassifier(random_state=RSEED)
tree.fit(X, y)

During the training, we offer the features and model label so that it can study according to the characteristics and classification. Then we can face our model on the training data for testing. PS: Because we do not test data.

print(f'Model Accuracy: {tree.score(X, y)}')

Model Accuracy: 1.0

We see it get 100% accuracy, this is what we expected, because we gave it the training data, and does not limit the depth of the tree. Facts have proved that the ability to fully learn the training data may be a disadvantage decision tree, because it may lead to over-fitting, which we will discuss later.

Visualization tree

So, when we train the decision tree, is in fact some of what changes happen? We find a way to be understood that decision tree visualization, we can scikit-learn the function, as follows:

Here Insert Picture Description

In addition to the leaf node, all nodes have five parts:

  • Feature-based queries, each split node will give the answer true or false, based on the answers, the data points to move down the tree;
  • gini: gini index node. When we move downwardly, a weighted average gini index decreases;
  • samples: Number of observation node;
  • value: the number of samples in each class. For example, the top node there are two samples in the class 0, 4 1 samples in the class;
  • class: Most of the midpoint node classification. In the leaf node, which is forecast for all nodes in the samples;

For understanding the leaf nodes will have no problem, because these are where the final prediction. To classify new data point, we only need to data points down on the line, feature point of use to answer questions until the class is predicted to reach a leaf node.

In order to see the tree in a different way, we can draw the split decision tree built on the original data.

Here Insert Picture Description

Each line is split, according to the characteristic values ​​of the data points will be divided into nodes. For this simple question, we do not limit the maximum depth, the partition will be placed in each node point only at the point of the same class. Later we will see such a perfect division of training data may not be what we want, because it may lead to over-fitting.

gini impurity

On this section, we will not understand the concept of Gini purity. Gini is not purity node node is the probability that a randomly selected sample is marked wrong if, in fact, is the distribution of nodes in the sample. For example, the top-node, the Gini impurity was 44.4%. We come to this value using the following equation:

Here Insert Picture Description

Gini impurity node n is 1 minus the sum of the square of the probability of each class. This may sound a bit confusing, so we have to do a root example:

Here Insert Picture Description

, In the feature tree disassembled at each node, thereby to minimize GIni impurity.

Then, it repeats this procedure to split greedy recursive process, until it reaches the maximum depth or a sample from each node contains only one class. Weighting each total Gini impurity trees must be reduced. In the second layer of the tree, the total weighted value of 0.333 Gini impurity:

Here Insert Picture Description

Finally, the last layer of the total weight Gini impurity becomes 0, meaning that each node is pure, and not from the point of the misclassified randomly selected node. While this seems to be very good, but this means that the model may over-fitting, because the node is only available training data to build.

Overfitting: Why is better than a tree forest

You might ask why not use a decision tree? It seems to be the perfect classifier, because it does not make any mistakes! Not a wrong data points. But the key point to remember is that this tree is not to make mistakes in the training data. Target machine learning model is a good overview of the new data it had never seen before.

When we have a very flexible model (model with high capacity), it can occur overfitting, because it basically is through the close fit the training data to generate the model. The problem is that the model is not only learning the actual relationship in the training data, but also to learn any noise present. Flexible model with high variance, because the learning parameters (such as decision tree structure) with training data will change significantly.

On the other hand, a flexible model is not considered to have a high bias, because it makes the assumption that the training data (data which tend to pre-conceived ideas.) For example, linear classifiers assume that the data is linear, does not have the flexibility to adapt to non-linear relationship. Not a flexible model may not be able to adapt to the training data, in both cases, high variance and high deviation, then the model can not well be extended to the new data.

When we do not limit the maximum depth, because the decision tree is very easy to over-fitting because it has unlimited flexibility, which means that it can keep unlimited growth, until it is a leaf node has only one category, perfectly performed all of these classification. If you go back to decision tree and maximum image depth is limited to 2 (split only once), the classification is no longer 100% correct. We reduced the variance decision tree, but at the cost of increased deviation.

We order to limit the depth of the tree, the decision tree we can apply a wide set of models combined into a single called random forests. It will reduce the variance (good) and increase the deviation (difference).

Random Forests

Random Forest model is composed of many decision trees. This model is not simply the average forecast of the tree (what we might call the "Forest"), but the use of two key concepts, it is a random name:

  • Random sample of training data points when constructing the tree
  • Random subset of features to consider when split node
A random sample of training data

When training, random forest each tree will learn from a random sample of data points. Sample with replacement sampling, called the bootstrap, which means that some of the samples will be used multiple times in a tree in. The idea is to train the different samples of each tree, although with respect to a particular set of training data for each tree may have very different, but generally speaking, the smaller the variance of the entire forest, but will not increase deviation of the price.

When testing to predict the average prediction of each decision tree. Such training sets on each learning data in the different sub-averaged prediction of the process is then referred to as bagging, it is an abbreviation of bootstrap aggregating.

Random subset of features used to split the node

Another main concept of random forest, considering only a subset of all features of each split node for each decision tree. Typically, this is set to sqrt (n_features) for classification, which means that if there are 16 wherein, each node in each tree will consider only the features of four random split node. (Random Forest also contemplated all of the features of each node, commonly used in the regression. These options can be controlled Scikit-Learn Random Forest implementations).

If you can understand a decision tree, bagging the idea, and random subset of features, then you work on random forests have a good understanding of:

The random forest tree together hundreds or thousands, on a slightly different training sets each training decision tree, given the limited number of features, each split node in the tree. Random Forests final prediction is to make the average forecast by each tree.

To understand why random forest better than a single tree, imagine the following scenario: You have to decide whether Tesla stock will rise, and you can visit a dozen of the company does not have prior knowledge of the analyst. Each analyst has a lower bias, because they do not have any assumptions, data and can focus on learning from news reports.

This seems to be an ideal situation, but the problem is reported in addition to the actual signal may also contain noise. Because analysts to make predictions based on data entirely - they have a high degree of flexibility - information they might be swayed by irrelevant. Analysts could focus on different predictions come from the same data. In addition, if the report is given different training sets, each individual analysts are very different, and will put forward very different predictions.

The solution is not dependent on any one person, but to bring together the voting results for each analyst. In addition, like random forests, allowing access only a portion of each analyst's report, and wish to cancel the effect of noise information through sampling. The idea in real life, we rely on a variety of sources (do not believe an independent review of the Amazon), therefore, the decision tree is not only intuitive, but also at random in the forest they fit together as well.

Random Forest Practices

Next, we will Scikit-Learn to build a random forest use in Python. We will use into training and test sets of real data sets, rather than learning a simple question. We use the model to estimate test set as implementation of the new data, which also allows us to determine the extent of over-fitting model.

data set

We want to solve the problem is a binary classification task, the goal is to predict an individual's health status. These features are socio-economic and lifestyle characteristics of individuals, poor health label is 0, 1 health status. The data set collected by the Centers for Disease Control and Prevention, available here ( https://www.kaggle.com/cdc/behavioral-risk-factor-surveillance-system).

Here Insert Picture Description

Typically, 80% of data science projects for feature time to clean up, exploration and production data. However, for this article, we will use direct modeling. This imbalance is a classification problem, so accuracy is not a suitable indicator. Instead, we evaluated using ROC AUC from 0 (worst) to measure 1 (best), and random guessing score of 0.5. We can also draw the ROC curve to evaluate the model.

Then we started to learn Random Forest Code, as follows:

from sklearn.ensemble import RandomForestClassifier

# Create the model with 100 trees
model = RandomForestClassifier(n_estimators=100, 
                               bootstrap = True,
                               max_features = 'sqrt')
# Fit on training data
model.fit(train, train_labels)

After a few minutes of training, the model is ready to predict the test data as follows:

# Actual class predictions
rf_predictions = model.predict(test)
# Probabilities for each class
rf_probs = model.predict_proba(test)[:, 1]

We class prediction and prediction probability (predict_proba) to calculate the ROC AUC. Once we have tested to predict, we can calculate the ROC AUC.

from sklearn.metrics import roc_auc_score

# Calculate roc auc
roc_value = roc_auc_score(test_labels, rf_probs)

result

Random Forests final test ROC AUC was 0.87, and the largest single decision tree has unlimited depth final test ROC AUC of 0.67. If we look at the training point, the two models have reached a ROC AUC 1.0, which is expected, because we give these models provide training label, does not limit the maximum depth of each tree.

Although Random Forest overfitting (better than on the training data on test data), but it's better to promote the test data than a single tree. Random Forests having lower variance (good), while maintaining the same low-bias decision tree (also very good).

We can also plot a tree (top) and Random Forest (bottom) of the ROC curve. The top and the left curve is a better model:

Here Insert Picture Description

The above is a decision tree of ROC

Here Insert Picture Description

The figure is the ROC random forests

Another assessment model we can use, the confusion matrix:

Here Insert Picture Description

This shows that the model correctly predicted in left and right corners and a model error of prediction in the lower left and upper right corner. We can use these figures to diagnose our model, and determine whether it has done enough good to go into production.

The importance of features

Wherein it represents the importance of random forest Gini impurity reduction in the sum of all the nodes in the split feature. We can use these to try to find out the Random Forest as the most important predictor of features. It may be extracted from the trained forest random important features of the data and puts it Pandas box, as follows:

import pandas as pd

# Extract feature importances
fi = pd.DataFrame({'feature': list(train.columns),
                   'importance': model.feature_importances_}).\
                    sort_values('importance', ascending = False)

# Display
fi.head()

    feature	   importance
    DIFFWALK	   0.036200
    QLACTLM2	   0.030694
    EMPLOY1	   0.024156
    DIFFALON	   0.022699
    USEEQUIP	   0.016922

Tell us the importance of characteristics which features the most discrimination between classes, it allows us to understand the problem. For example, where an indication of whether a patient DIFFWALK difficulty walking characteristics is the most significant feature.

By building the most important feature of the analysis, we can delete the low importance of features, using the feature of high importance to learning.

Visualization Random Forests

Finally, we can visualize a single tree in the forest. This time, we must limit the depth of the tree, or it will be too large to be converted to images. In order to get the figure below, I will be limited to a maximum depth 6. It is still impossible for us to fully resolve trees! However, given our in-depth study of the decision tree, we have the working principle of the model.

Here Insert Picture Description

Published 414 original articles · won praise 168 · views 470 000 +

Guess you like

Origin blog.csdn.net/CoderPai/article/details/92018343