Machine Learning - Decision Trees

Accumulation of a few steps leads to a thousand miles, accumulation of laziness leads to an abyss

Note: This article mainly refers to Zhou Zhihua's "Machine Learning" when finishing.

main content

Decision tree is a common type of algorithm in machine learning, and its core idea is to predict new samples by building a tree-like model. The leaf nodes of the tree are the decision results, and all non-leaf nodes correspond to each decision-making process.

Basic process

Generally, a decision tree contains a root node, several internal nodes and several leaf nodes; the leaf node corresponds to the decision result, and each other internal node (including the root node) corresponds to an attribute test. The decision tree divides the sample set contained in the node into different sub-nodes according to the result of the attribute test of the internal node. This is a simple and intuitive "divide-and-conquer" strategy.

The purpose of decision tree learning is to generate a decision tree with strong generalization ability.

decision tree

The figure above is a simple decision tree used to predict whether a loan user has the ability to repay the loan. Loan users mainly have three attributes: whether they own real estate, whether they are married, and their average monthly income. Each internal node represents an attribute condition judgment, and the leaf node represents whether the loan user has the ability to repay. For example: User A has no real estate, is not married, and has a monthly income of 5K. Judging by the root node of the decision tree, User A fits the right branch (owning real estate is "No"); then judges whether to get married, User A fits the left branch (Whether married is No); then judge whether the monthly income is greater than 4k, User A fits The left branch (monthly income is greater than 4K), the user falls on the "repayable" leaf node. Therefore, it is predicted that user A has the ability to repay the loan.

Attribute division selection

So, how should we choose the optimal partitioning attribute? As the division process continues, we hope that the samples contained in the branch nodes of the decision tree belong to the same category as much as possible, that is, the "purity" of the nodes is getting higher and higher.

To make the node purity higher and higher, we have to choose appropriate indicators to evaluate the purity, and we also have to choose the attribute division on these indicators. At present, the mainstream attribute classification indicators used in decision trees are as follows:

information gain

"Information entropy" is the most commonly used indicator to measure the purity of a sample set. It represents the sum of the entropy of different categories of data in the total data set. When the sample set is more "pure", the information entropy is smaller (the value range of information entropy is 0~log2(k)).

Information gain: Gain(A)=Info(D)-Info_A(D), which divides the sample set by attributes.

Decision tree induction algorithm (ID3):

[1] The tree starts with a single node representing the training sample.

[2] If the samples are all in the same class, the node becomes a leaf and is labeled with that class.

[3] Otherwise, the algorithm uses an entropy-based metric called information gain as heuristic information to select the attribute that best classifies the sample. This attribute is called the "test" or "decision" attribute of the node.

[4] In this version of the algorithm, all attributes are categorical, i.e. discrete-valued. Continuous attributes must be discretized.

[5] For each known value of the test attribute, create a branch and divide the sample accordingly.

[6] The algorithm uses the same process to recursively form a sample decision tree on each partition. Once an attribute appears on a node, it does not have to be considered on any descendants of that node.

The recursive partitioning step stops only if one of the following conditions holds:

(a) All samples of a given node belong to the same class.

(b) There are no remaining attributes that can be used to further divide the sample. In this case, a majority vote is used.

This involves turning a given node into a leaf and labeling it with the class in which the majority of the samples are. Alternatively, the knot can be stored

Class distribution of point samples.

(c) Branches

test_attribute=ai has no samples. In this case, a leaf is created with the majority class in samples.

Gain rate

In fact, the information gain criterion has a preference for attributes with a large number of possible values. In order to reduce the possible adverse effects of this preference, the information gain is not directly used, but the "gain ratio" is used to select Optimal partitioning properties.

in

The gain ratio is the information gain divided by the "intrinsic value" of the selected partitioning attribute. Generally speaking, the greater the number of possible values ​​of attribute a, the greater the inherent value, so the information gain is divided by the inherent value, which offsets the information gain's preference for attributes with a large number of values.

The gain rate criterion has a preference for attributes with a small number of possible values. Therefore, the C4.5 algorithm does not directly select the division attribute with the largest gain rate, but uses a heuristic: first find the information from the candidate division attributes Attributes with above-average gains are selected from those with the highest gain rate.

Decision Tree Induction Algorithm (C4.5):

Gini index

The CART (Classification and Regression Tree) decision tree uses a "Gini index" to select attributes for division. The purity of dataset D can be measured by the Gini value.

Gini value

Gini index

The Gini value reflects the probability that two randomly drawn samples from dataset D have inconsistent class labels. Therefore, the smaller the Gini value, the higher the purity of the dataset D.

In the candidate attribute set A, we select the attribute with the smallest Gini index after division as the optimal division attribute.

Pruning

Pruning is the main method for decision tree learning algorithms to deal with "overfitting". In order to classify the training samples as accurately as possible, the nodes will be divided continuously, sometimes resulting in too many branches of the decision tree. At this time, some characteristics of the training set itself may be regarded as generalized features because the training process is too good. learning can lead to overfitting.

Therefore, the risk of overfitting can be reduced by actively removing some branches. At present, there are two basic strategies for decision tree pruning:

[1] Pre-pruning

Refers to the decision tree generation process, each node is estimated before the division, if the current node division can not improve the generalization performance of the decision tree, stop the division and mark the current node as a leaf node ;

[2] After pruning

After training a complete decision tree, examine the non-leaf nodes from the bottom up. If replacing the subtree corresponding to the node with a leaf node can improve the generalization performance, then replace the subtree with leaf node.

[3] Generalization performance

How to judge whether the generalization performance of decision tree is improved? We use a performance evaluation strategy to divide the dataset into a test set and a validation set, and divide the training set by attributes based on the attribute division criteria. If the accuracy of the validation set decreases, the pruning strategy will prohibit the nodes from being divided. If the accuracy of the validation set increases, Then the pruning strategy will preserve the node division.

In general, the risk of underfitting of post-pruning strategies is small, and the generalization performance is often better than that of pre-pruning decision trees. However, the post-pruning process is carried out after the complete decision tree is generated, and all non-leaf nodes in the tree are inspected one by one from the bottom up, so the training time cost is higher than that of the unpruned decision tree and the pre-pruned decision tree. The tree is much bigger.

Continuous and missing values

The attributes in the decision tree processing logic mentioned above are all discrete values, so when the attributes are continuous or missing, we should know how to deal with them.

[1] Continuous value processing

A reasonable processing logic for continuous-valued attributes is: discretization of continuous-valued attributes. What we need to do is to sort all the values ​​of the attribute in the training samples, and divide the sorted value queue into partitions. Each partition is the value of a discrete point of the attribute.

Without loss of generality, we will take a dichotomy (that is, divided into two partitions, which can be divided into N according to needs) to describe. Taking height division as an example, the height data values ​​(cm) in the training set are sorted as follows:

{160,163,168,170,171,174,177,179,180,183}

Because it is a dichotomy, we only need to find a dividing point. Each division point can be placed between two values, or it can be placed above one value, here we take the median between the two values. Then the optional division points in the above example are:

{161.5,165.5,169,170.5,172.5,175.5,178,179.5,181.5}

According to different divisions, we can calculate the size of the information gain separately, and then select the best division.

It should be noted that, unlike discrete values, the division of continuous values ​​can continue to be divided on child nodes. That is, you divide the height into two parts: "less than 175" and "greater than or equal to 175". For the child nodes of "less than 175", you can still continue to divide it into two parts of "less than 160" and "greater than 160".

[2] Missing value handling

In order to avoid the waste of data information due to missing values ​​in the sample, we need to solve two problems:

How to do partition attribute selection when attribute value is missing?

The samples of missing values ​​are not substituted into the formula calculation of selection judgment (information gain, gain rate, Gini index), and only after the calculation is multiplied by a valued sample ratio.

For example, there are 10 samples in the training set, and there are two missing values ​​in the attribute a, then when calculating the information gain of the attribute division, we can ignore the samples of these two missing values ​​to calculate the information gain, and then multiply the calculation result 8/10 will do.

Given a division attribute, if the value of the sample is missing on this attribute, how to divide the sample?

If the value of the sample x on the partition attribute a is unknown, then the x is divided into all sub-nodes, but different weights are assigned to the x divided into different sub-nodes (different weights on different sub-nodes generally reflect It is the proportion of the data contained in the child node to the data set of the parent node).

Multivariate Classifier

When the real classification boundary of the learning task is complex, many segment divisions must be used to obtain a good approximation. At this time, the decision tree will be quite complex, and the prediction time will be expensive due to a large number of attribute tests.

The decision tree model will be greatly simplified if an oblique dividing boundary can be used, which is a multivariate decision tree. In this type of decision tree, the non-leaf nodes no longer test only on an attribute, but on a linear combination of attributes; in other words, each non-leaf node is a linear classifier in the form of linear regression. The following figure is an example of a multivariate decision tree for selecting melons on continuous attribute density and sugar content:

annotation

Advantages of decision trees: Intuitive, easy to understand, effective on small datasets.

Disadvantages of decision tree: it is inconvenient to deal with continuous variables; when there are many categories, the error rate increases rapidly; the scalability is general.

Summarize

1) The decision tree adopts a simple and intuitive "divide-and-conquer" strategy

2) The decision tree measures the pros and cons of different attribute divisions according to the overall purity of the data set

3) When the attribute selection is too frequent, the model will be over-fitted. At this time, the pruning algorithm needs to be used for processing.

4) The judgment criterion of the pruning algorithm is whether the performance indicators of the model before and after pruning are improved

5) When encountering an attribute with continuous value, the data can be converted into continuous value by disassembling it into two pieces

6) When encountering missing value attributes, it can be adapted by adding the influence of missing value data to the attribute selection algorithm

7) When dealing with learning tasks with complex real classification boundaries, we can use linear combinations of attributes to divide the data

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324459148&siteId=291194637