Machine Learning Series (7)_Decision Tree and Random Forest Concept

Note: This blog refers to Station B: Classic Machine Learning Algorithms (2) - Decision Trees and Random Forests

There are three algorithms for decision trees:
insert image description here

1. Entropy and Gini Coefficient

Entropy: the degree of chaos in a thing
insert image description here

  1. If there are many attributes in a set, the degree of confusion is large, and the entropy value is also large.
  2. If there are few attributes within a set, the degree of confusion is small, and the entropy value is also small
    insert image description here

insert image description here
insert image description here
The Gini coefficient and entropy are different in the formula, but the result expressed is the same.

  • The larger the entropy and Gini coefficient are, the worse the current classification effect is.
  • The smaller the entropy and Gini coefficient are, the better the current classification effect is.

Second, the decision tree construction example

The basic idea of ​​constructing a tree is that the entropy of a node decreases rapidly as the depth of the tree increases. The faster the entropy decreases, the better, so we can hopefully get a decision tree with the shortest height.

The faster the entropy reduction, the better: there are three steps to construct the decision tree, instead of five steps to construct the decision tree. To make the decision tree less hierarchical.

Take playing football as an example:
the entropy at this time is simply the probability calculated based on whether or not you have played in the past, without considering other factors such as weather:
insert image description here
insert image description here
insert image description here

insert image description here

3. Information gain (ID3 algorithm)

ID3 Algorithm (Information Gain): A more traditional algorithm that uses information gain to build decision trees.

Information gain is to subtract the original entropy from the entropy after using a certain index as the root node.

In this case it looks like thisgain(outlook)

insert image description here
The greater the information gain, the simpler the decision tree will be. The degree of information gain here is measured by the degree of change in information entropy.

Therefore, the information gain values ​​of the four parameters are calculated, and the root node has the largest information gain.

The rest of the nodes are the same, similar to a recursive operation, each time the node with the largest information gain of the same type is selected as the node.

4. Information gain rate (C4.5 algorithm)

Usually, it is unreliable to use only information gain to draw decision trees. If a feature has many attributes, but the number of samples corresponding to the attributes is very small, in this case, the information gain is very large, but we cannot get what we want. Effect.

For example, if the ID is also regarded as a feature, then the ID of each sample is different, and each sample has only itself in the classification, so the purity is very high, the entropy is 0, and the information gain is the largest. But this will cause each sample to fall into one class, which is not what we expect.

Hence the introduction of the information gain rate.

Information gain rate = information gain / own entropy value

How to measure the effect of the final decision tree, you can use the evaluation function:
insert image description here
Here, H(t) is used to represent the Gini coefficient or entropy value calculated by each leaf node. The ultimate goal is to hope that the purity of each leaf node is very large, so I hope that H(t ) is small.

Therefore, the smaller the evaluation function, the better.

The C4.5 algorithm is an extension of the ID3 algorithm that
can handle continuous attributes. First, the continuous attribute is discretized, and the value of the continuous attribute is divided into different intervals, based on the comparison of the Gian value of each split point.
Consideration of missing data: When building a decision tree, missing data can be simply ignored, i.e. only records with attribute values ​​are considered when calculating gains.

insert image description here
insert image description here

5. Binary selection value

insert image description here

6. Decision tree pruning

That is, the number of layers of the decision tree is smaller and the height is shorter. Because if the decision tree is very high, it can eventually achieve 100% purity on the training set, but for the test set, it may cause errors and over-fitting, so the number of layers of the decision tree needs to be small and the height is short. .

  1. Pre-pruning: Stop early in the process of building a decision tree.

That is to say, specify the depth of the decision tree, for example, specify 3, and then stop continuing to layer after the third layer is assigned.

Or specify min_sample, that is, when the number of samples is less than 50, it will stop branching.
insert image description here
Why preprune? To prevent the risk of overfitting.

  1. Post-pruning: After the decision tree is constructed, pruning begins.
    insert image description hereBecause the more the number of leaf nodes, the more trivial, the easier it is to cause overfitting.
    In the end, the formula value in the above figure should be minimized.
    The value of a in the above formula can be manually specified:
    if a is larger, the number of leaf nodes should be smaller
    ; if a is smaller, the number of leaf nodes may be larger

7. Random Forest

Forest: Multiple decision trees form a forest.

Random forest: construct a decision tree, use a decision tree to complete the decision-making operation, each decision tree performs the decision-making operation separately, and the final result goes to the mode of all decision trees.

Random: Since the sample may have outliers, a random selection is made. There are two levels of randomness:

  1. Randomness of data selection The random
    samples are replaced (that is, there can be exactly the same samples), and the proportion is random, and they are randomly selected in the original training set.

  2. Randomness of Feature Selection

  3. Bootstraping with replacement sampling

  4. Bagging: A total of n samples are built with replacement sampling. Classifier

Parameters of the decision tree: These parameters are mainly about how to pre-prune and post-prune the decision tree to prevent overfitting.

1、 criterion:The feature selection method can be gini (Gini coefficient), entropy (information gain), usually gini, that is, the CART algorithm, if the latter is selected, it is ID3 and C4, .5

2、 splitter:The feature division point selection method can be best or random, the former is to find the optimal division point among all the division points of the feature, and the latter is to find the locally optimal division point in some randomly selected division points, generally in the sample size. When it is not big, choose best, the sample size is too large, you can use random

3、 max_depth:The maximum depth of the tree can not be entered by default, so the depth of the subtree will not be limited. Generally, when there are few samples and few features, there is no limit, but when there are too many samples or too many features, it can be set. An upper limit, generally 10~100

4、 min_samples_split:The minimum number of samples required for node subdivision. If the sample tree on the node is already lower than this value, the optimal dividing point will not be found for dividing, and the node will be used as a leaf node. The default is 2. If there are too many samples In this case, a threshold can be set, which can be determined according to business needs and data volume.

5、 min_samples_leaf:The minimum number of samples required for leaf nodes. If this threshold is not reached, all leaf nodes of the same parent node will be pruned. This is a parameter to prevent overfitting. You can enter a specific value, or a number less than 1. (Percentage will be calculated based on sample size)

6、 min_weight_fraction_leaf:The sum of all sample weights of leaf nodes, if it is lower than the threshold, will be pruned together with the sibling nodes, the default is 0, that is, the weight problem is not considered. This is generally considered when the sample category has a large deviation or has many missing values.

7、 max_features:The division considers the maximum number of features. If not input, all features are defaulted. You can choose log2N, sqrt(N), auto, or a floating-point number (percentage) or integer (specific number of features) less than 1. If there are too many features, such as more than 50, you can consider choosing auto to control the generation time of the decision tree

8、 max_leaf_nodes:The maximum number of leaf nodes to prevent overfitting. The default is not limited. If a threshold is set, the optimal decision tree will be obtained within the threshold range. It can be set when the sample size is too large.

9、min_impurity_decrease/min_impurity_split:The minimum impurity is most needed for division. The former is that the feature is not considered when it is lower than the feature selection. The latter is that if the selected optimal feature does not reach this threshold after being divided, it will not be divided, and the node will become a leaf node.

10、class_weight:Class weight, optional when the sample has large missing values ​​or large class deviation, to prevent the decision tree from tilting to samples with too large classes. Can be set or balanced, the latter will automatically calculate the weight according to the number of samples, the smaller the number of samples, the higher the weight, which corresponds to min_weight_fraction_leaf

11、presort:Whether to sort, basically don't care

Guess you like

Origin blog.csdn.net/wxfighting/article/details/124363436