[Eating the book together] "Machine Learning" Chapter 4 Decision Tree

Chapter 4 Decision Tree

insert image description here

4.1 Basic process

  Decision tree is a common type of machine learning method, which makes decisions based on tree structure, and determines the division attributes through the analysis of training samples to simulate the human decision-making process.

  Generally, a decision tree contains a root node, several internal nodes, and several leaf nodes. The leaf node corresponds to the decision result, and each other node corresponds to an attribute test. Each node contains The sample set of is divided into sub-nodes according to the results of attribute tests. The root node contains the complete set of samples, and the path from the root node to each leaf node corresponds to a decision test sequence. The purpose of decision tree learning is to generate a decision tree with strong generalization ability, and its basic process follows a simple and intuitive "divide and conquer" strategy, as follows:

4.2 Partition selection

  The key to decision tree learning is how to choose the optimal sub-attribute. As the division process continues, we hope that the samples contained in the branch nodes of the decision tree belong to the same category as much as possible, that is, the "purity" of the nodes is getting higher and higher. , the following introduces several commonly used indicators when selecting the optimal partition attribute.

(1) Information gain

  Before introducing information gain, let's introduce information entropy. Information entropy is the most commonly used indicator to measure the purity of a sample set, assuming that the current sample set DDkthin DThe proportion of class k samples is pk ( k = 1 , 2 , 3 , . . . , n ) p_k(k=1,2,3,...,n)pk(k=1,2,3,...,n ) , thenDDThe information entropy of D is defined as follows, E nt ( D ) Ent(D)The smaller the value of En t ( D ) , it means thatDDThe higher the purity of D.
E nt ( D ) = − ∑ k = 1 npk log ⁡ 2 pk Ent(D) = - \sum\limits_{k = 1}^n { {p_k}{ {\log }_2}{p_k } }Ent(D)=k=1npklog2pk
  Information gain is introduced below. Assume the discrete attribute aaa hasVVV possible values​​a 1 , a 2 , . . . , a V {a^1,a^2,...,a^V}a1,a2,...,aV , if attributeaaa to the sample setDDD is divided, andVVV branch nodes, where thevvthv branch nodes containDDAll in D in attribute aaThe value of a is ava^vaA sample of v , denoted asD v D^vDv , so that D v D^vcan be calculatedDThe information entropy of v . Then, considering that different branch nodes contain different numbers of samples, each node is assigned a weight, that is, the branch node with more samples has a greater influence, so the following information gain is obtained. In general, the greater the information gain, the greater the use of attributeaaThe greater the "purity" improvement obtained by dividing by a , such as the famousID 3 ID3ID 3 Decision Tree Learning Algorithm.
Gain ( D , a ) = E nt ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ E nt ( D v ) Gain(D,a) = Ent(D) - \sum\limits_{v = 1}^V {\frac{ {\left| { {D^v}} \right|}}{ {\left| D \right|}}Ent({D^v})}Gain(D,a)=Ent(D)v=1VDDvEnt(Dv )
(2) Gain rate

  In fact, the information gain criterion has a preference for attributes with a large number of possible values. In order to reduce the possible adverse effects of this preference, we introduce the gain ratio below, which is defined as follows: Gain _ ratio
( D , a ) = Gain ( D , a ) IV ( a ) , IV ( a ) = − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ log ⁡ 2 ∣ D v ∣ ∣ D ∣ Gain\_ratio(D,a) = \ frac{ {Gain(D,a)}}{ {IV(a)}},IV(a) = - \sum\limits_{v = 1}^V {\frac{ { \left| { {D^v }} \right|}}{ {\left| D \right|}}} {\log _2}\frac{ { \left| { { {D^v}} \right|}}{ { \left| D \ right|}}Gain_ratio(D,a)=I V ( a )Gain(D,a),I V ( a )=v=1VDDvlog2DDv
  IV (a) IV(a)I V ( a ) is called the attributeaaIntrinsic value of a , attribute aaThe greater the number of possible values ​​of a , IV ( a ) IV(a)The value of I V ( a ) will generally be larger. It should be noted that the gain rate criterion has a preference for attributes with a small number of possible values, such as the famousC 4.5 C4.5C 4.5 The decision tree algorithm uses a heuristic method to select the optimal partition attribute—first find out the attributes with information gain higher than the average level from the candidate partition attributes, and then select the one with the highest gain rate.

(3) Gini index

  First introduce the Gini value, the data set DDThe purity of D can be calculated by Gini valueG ini ( D ) Gini(D)Gini ( D )来度量,Gini(D) Gini(D )G ini ( D ) reflects the data fromDDTwo samples are randomly drawn in D , and the probability of their class labels being inconsistent, soG ini ( D ) Gini(D)The smaller G ini ( D ) , the smaller the data setDDThe higher the purity of D. attributeaaThe Gini index of a is defined as follows. Finally, the attribute that makes the Gini index the smallest after division is selected as the optimal division attribute, such as the famousCART CARTC A RT equation
G ini ( D ) = ∑ k = 1 n ∑ k ′ ≠ kpkpk ′ = 1 − ∑ k = 1 npk 2 , G ini _ index ( D , a ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ G ini ( D v ) Gini(D) = \sum\limits_{k = 1}^n {\sum\limits_{k' \ne k} { {p_k}{p_{k'} } } } = 1 - \sum\limits_{k = 1}^n {p_k^2} ,Gini\_index(D,a) = \sum\limits_{v = 1}^V {\frac{{\left | { {D^v}} \right|}}{ {\left| D \right|}}} Guinea({D^v})G ini ( D )=k=1nk=kpkpk=1k=1npk2,Gini_index(D,a)=v=1VDDvGini ( D _v)

4.3 Pruning treatment

  Pruning is the main method for decision tree learning algorithms to deal with "overfitting". In decision tree learning, in order to classify training samples as correctly as possible, the node division process will be repeated continuously, sometimes resulting in too many branches of the decision tree, so that Taking some characteristics of the training set itself as general properties of all data leads to overfitting. Therefore, the risk of overfitting can be reduced by actively removing some branches. Common basic strategies include "pre-pruning" and "post-pruning".

(1) Pre-pruning

  Pre-pruning means that in the process of decision tree generation, each node is estimated before division. If the division of the current node cannot improve the generalization performance of the decision tree, the division is stopped and the current node is marked as a leaf. Nodes, that is, premature termination of the growth of certain branches. Here is an example of a watermelon data set, and the following are the details of the data set.

  Below is the original decision tree constructed.

  Next, we divide according to the attribute "umbilical part", and found that the accuracy of the verification set before division is 42.9%, and after division is 71.4%, so the division with "umbilical part" is determined, and the same method is applied to the following "color" and "root", as follows:

(2) Post pruning

  Post-pruning is to generate a complete decision tree from the training set, and then examine the non-leaf nodes from the bottom up. If the subtree corresponding to the node is replaced by a leaf node, the decision tree can be generalized. If the performance is improved, replace the subtree with a leaf node.

  As shown in the original decision tree above, first examine node 6. If this node is replaced by a leaf node, the accuracy of the verification set will increase from 42.9% to 57.1%, so pruning is performed, and the same is true for other nodes ,As follows:

(3) Advantages and disadvantages of pre-pruning and post-pruning

  Pre-pruning is to stop the growth in advance during the process of building a decision tree, that is, no longer divide sub-nodes at a certain node, but directly use the node as a leaf node. The advantage of pre-pruning is that it can reduce the size of the decision tree, save computing resources and time, and avoid over-fitting problems. The disadvantage of pre-pruning is that it may oversimplify the decision tree, leading to underfitting problems, that is, it cannot fit well to the training set and test set, and some useful information is lost. Pre-pruning needs to set some conditions for stopping growth, such as the number of samples on the node, information gain, Gini index, etc.

  Post-pruning is to simplify the decision tree from bottom to top after building a complete decision tree, that is, to replace some subtrees with leaf nodes or delete them directly. The advantage of post-pruning is that it can preserve the integrity of the decision tree and avoid under-fitting problems, and it can also reduce over-fitting problems by simplifying the decision tree. The disadvantage of post-pruning is that it requires additional computing resources and time to build a complete decision tree and perform pruning operations, which may affect efficiency. Post-pruning needs to set some evaluation criteria to judge whether to prune, such as the accuracy rate and error rate on the verification set.

4.4 Continuous and missing values

(1) Continuous value processing

  Since the number of possible values ​​of continuous attributes is no longer limited, nodes cannot be divided directly according to the possible values ​​of continuous attributes. Here, continuous attribute discretization can be used, and the simplest method is to use dichotomy, as shown below:
T a = { ai + ai + 1 2 , 1 ⩽ i ⩽ n − 1 } {T_a} = \{ \frac{ { {a^i} + {a^{i + 1}}}}{2},1 \leqslant i \leqslant n - 1\}Ta={ 2ai+ai+1,1in1 }
  Take these median points as candidate division points, and select the optimal division point to divide the sample set. If information gain is used, it is as follows:
Gain ( D , a ) = max ⁡ t ∈ T a Gain ( D , a , t ) = max ⁡ t ∈ T a E nt ( D ) − ∑ λ ∈ { − , + } ∣ D t λ ∣ ∣ D ∣ E nt ( D t λ ) Gain(D,a) = \mathop {\max }\limits_{t \in {T_a}} Gain(D,a,t) = \mathop {\max }\limits_{t \in {T_a}} Ent(D) - \sum\limits_ {\lambda \in \{ - , + \} } {\frac{ {\left| {D_t^\lambda } \right|}}{ {\left| D \right|}}Ent(D_t^\lambda ) }Gain(D,a)=tTamaxGain(D,a,t)=tTamaxEnt(D)λ{ ,+}D Dtl Ent(Dtl)
  Other commonly used continuous attribute discretization methods are as follows:

  • Equal width method: Divide the value range of the continuous attribute into n intervals of equal width, and each interval represents a discrete value. This method is simple and easy to implement, but it may ignore the distribution characteristics of the data, resulting in large data differences within intervals and small data differences between intervals.
  • Equal frequency method: Divide the value range of the continuous attribute into n intervals, so that the number of data records contained in each interval is equal or nearly equal, and each interval represents a discrete value. This method can ensure the uniformity of the data, but it may cause unreasonable division of intervals, resulting in some outliers or boundary values ​​being classified into the same interval.
  • The method based on cluster analysis: cluster the continuous attributes according to the clustering algorithm, and then merge the records of the same cluster into the same interval according to the clustering results, and each interval represents a discrete value. This method can be divided according to the inherent characteristics of the data, but the number of clusters needs to be determined in advance, and the clustering algorithm itself has certain complexity and uncertainty.

(2) Missing value processing

  Incomplete samples are often encountered in real tasks, that is, some attribute values ​​​​of the samples are missing, especially in the case of a large number of attributes, there are often a large number of samples with missing values, so for the decision tree in this case, there are the following Two issues remain to be resolved:

  • How to do partition attribute selection with missing attribute values
  • Given a partition attribute, how to partition the sample if the value of the sample on this attribute is missing

  Here are a few common solutions:

  • Ignore missing values: This method is to directly delete or ignore samples containing missing values, and only use complete samples to build and predict decision trees. This method is simple and easy to implement, but it may cause data loss and bias, and reduce the performance and generalization ability of the decision tree.
  • Using missing values: This method is to directly calculate the data that is not missing. When selecting the optimal partition attribute, directly use samples without missing attributes for selection. In the subsequent sub-node division, if a sample is encountered on the attribute If the value of is missing, this sample will be assigned to all current sub-nodes at the same time, that is, the same sample will be assigned to different sub-nodes with different probabilities.
  • Filling missing values: This method uses known data to estimate or infer missing values, and then replaces the original missing values ​​with estimated or inferred values ​​to make the sample complete. This method can preserve the integrity of the data, but it is necessary to choose a suitable filling method, otherwise errors and noise may be introduced, which will affect the accuracy of the decision tree. Commonly used filling methods include mean, median, mode, nearest neighbor, regression, interpolation, etc.

4.5 Multivariate Decision Trees

  If each attribute is regarded as an axis in the coordinate space, then ddThe samples described by d attributes correspond toddFor a data point in d- dimensional space, classifying samples means finding the classification boundary between different types of samples in this coordinate space. Here, the classification boundary is composed of several segments parallel to the coordinate axis. Such a decision Trees are also called "multivariate decision trees", such as the following two examples:

Text to display when the first image doesn't show up
The text to display when the second image is not displayed

4.6 Related Papers

  The following combines some papers in recent years to specifically introduce the application of decision trees.

(1) NBDT: Neural-Backed Decision Trees
  Neural networks are a powerful machine learning method, but they usually lack interpretability and are difficult to apply in fields that require accurate and reasonable predictions, such as finance and medical care. To solve this problem, this paper proposes a method of embedding decision trees into neural networks, called NBDT, which can increase the interpretability of the network without changing the original network structure and reducing accuracy. The NBDT method retains the feature extraction part of the original neural network structure, and only performs decision tree embedding on the last fully connected layer, thereby converting the output of the neural network into a series of understandable decision rules. The NBDT method can be applied to any pre-trained neural network without retraining or fine-tuning, just using a simple algorithm to generate the decision tree structure and threshold. Experiments on multiple image classification datasets demonstrate that the NBDT method can maintain the accuracy of the original network while providing visualization and explainable decision-making process. This paper presents an efficient and general approach for improving the interpretability of neural networks.

(2)Decision tree based malware detection using static and dynamic analysis

  This paper proposes a method for malware detection using decision trees, combining static and dynamic analysis features such as APIs, Summary Information, DLLs, and Registry Keys Changed, while using Cuckoo sandbox for dynamic malware analysis, which is A customizable tool that provides high accuracy. In the experiment, more than 2,300 malware samples and more than 1,500 normal software samples were used, and the decision tree was compared with other machine learning methods, such as support vector machine, naive Bayes, random forest, etc., and it was found that the decision tree was accurate There are advantages in indicators such as degree, recall, precision and F1 score.

Guess you like

Origin blog.csdn.net/qq_44528283/article/details/130164679