Decision Tree Machine Learning Chapter 4

decision tree


foreword

  Decision tree concept: A decision tree is a decision-making mechanism based on a tree structure , similar to a decision-making mechanism when humans face decision-making problems . Decision trees are a common class of machine learning methods.
  Taking the binary classification task as an example, we hope to learn a model from a given training data set to classify new examples. The task of classifying this sample can be regarded as a decision-making or judgment process for the question "does the current sample belong to the positive class?"

  The purpose of decision tree learning : to generate a decision tree with strong generalization ability, that is, strong ability to deal with unseen examples.

1. Basic process

  The final conclusion of the decision-making process corresponds to the judgment result we want ; each judgment question raised in the decision-making process is a "test" for a certain attribute; each test result is either the final conclusion derived, or the further judgment question derived, and its consideration range is within the limited range of the last decision result.
  Generally, a decision tree contains a root node, several internal nodes and several leaf nodes. The leaf node corresponds to the decision result, and each other node corresponds to an attribute test; the sample set contained in each node is divided into corresponding sub-nodes according to the result of the attribute test; the root node contains the complete set of samples .

  The path from the root node to each leaf node corresponds to a sequence of decision tests.
  Its basic flow follows a simple and intuitive "divide and conquer" strategy, as shown in the diagram below:
insert image description here

以c语言形式的表现
Node TreeGenerate(D,A){		//输入为一个样本集合D和一个属性集合A;输出是一个结点
	Node node;
	if(D中样本全部属于同一类别C){	//递归结束出口,如果运行到某处,比如:判定一个西瓜为好瓜坏瓜的决策树,对当前传过来的样本集都是好瓜,则此node设为好瓜的叶结点;并返回
		将node标记为C类叶结点;
		return;
	}
	if( A = ∅  |  D中样本在A上取值相同){	//递归结束的出口,仍以西瓜为例,当从某一分支判定到最后,没有属性可以划分,剩余样本大部分是好瓜,将node设为好瓜的叶结点;或者D中样本在当前A属性上都取值为同一个值,那就不用再向下划分了。比如都是好瓜,则直接将node设为好瓜叶结点,无需向下寻找划分属性。
		将node设为叶结点,类别标记为D中样本最多的类;
		return;
	}
	从A中选择最优划分属性a*;		//a*如何获得会在下一节提到
	for(a*的每一个值a*v ){		//比如当前寻找到最优划分属性为根蒂,对根蒂属性有弯的和直的,那就进行两次循环,生成两个分支;
		为node生成一个分支;
		令Dv表示D中在a*上取值为a*v的样本子集;		//例如,在根蒂属性上取值为好瓜的样本集合就是Dv,根蒂=好瓜就是a*v
		if (Dv为空 ){		//如果在根蒂属性上取值为好瓜的样本集合是空的,无法继续划分,也无法判定,将其设为D中最多的类别,判定成功率较高
			将分支节点标记为叶结点,其类别标记为D中样本最多的类;return;
		}
		else {				//子集不为空集合,则继续向下划分,直至遇见函数出口
			TreeGenerate(Dv,A\{a*})		//A\{a*}表示属性集合A去除a*后的属性集合
		}
	}
}

The generation of decision trees is a recursive process. In the basic decision tree algorithm, there are three situations that will lead to recursive return:
(1) The samples contained in the current node all belong to the same category, and there is no need to divide
(2) The current attribute set is empty, or all samples have the same value on all attributes, and cannot be divided
(3) The sample set contained in the current node is empty and cannot be divided

  In the case of (2), we mark the current node as a leaf node, and set its category as the category containing the most samples in the drop point; in the case of (3), we also mark the current node as a leaf node, but locate its category as the category containing the most samples in its parent node. Note that the processing of these two cases means different: case (2) uses the posterior distribution of the current node, while case (3) uses the sample distribution of the parent node as the prior distribution of the current node.

2. Partition selection

  In the algorithm in the first section, we dealt with the optimal partition property, which is described in detail in this section. Requirements for optimal partition attributes : As the partition process continues, the samples contained in the branch nodes of the decision tree belong to the same category as much as possible , that is, the "purity" of the nodes is getting higher and higher.

1. Information gain

Information entropy :
  " Information entropy " is the most commonly used indicator to measure the purity of a sample set .
  E nt ( D ) = − ∑ k = 1 ∣ y ∣ pklog 2 pk Ent(D) = - \sum_{k=1}^{|y|} p_k log_2 p_kEnt(D)=k=1ypklog2pk
  Among them, p k is the proportion of the kth class samples in the sample set D. (k=1,2,...,|y|). The smaller the value of information entropy Ent(D), the higher the purity of D.

Definition :
  G gain ( D , a ) = E nt ( D ) − ∑ v = 1 VD v DE nt ( D v ) Gain(D,a)= Ent(D) - \sum_{v=1}^V \frac{D^v}{D} Ent(D^v)GainD,a=Ent(D)v=1VDDvEnt(Dv )
  Among them, a is a discrete attribute in D, V is the number of possible values ​​of a, and dividing D with a will generate V branches, and D v is the vth branch node that contains all samples (number) in D that take the value of a vonattributea.
  Interpretation of the formula:
  the information gain Gain ( D , a ) obtained by dividing the sample set D with attributea Gain (D, a)GainD,a ) = information entropy E nt of sample set D( D v ) Ent(D^v)Ent(Dv )-∑ The number of sample subsets corresponding to each possible value of attribute a to the current possible value of attribute a∣ D v ∣ The number of sample set D∣ D ∣ \sum_{Each possible value of attribute a}\frac{The number of sample subsets corresponding to the current possible value of attribute a|D^v|}{The number of sample set D|D|}Each possible value of attribute aThe number of sample sets D D The number of sample subsets corresponding to the current possible value of attribute a∣D _ _ _vThe information entropy E nt ( D v ) Ent(D^v) of the corresponding sample subsetEnt(Dv)

  Generally speaking, the greater the information gain, the greater the "purity improvement" obtained by using attribute a for division. Therefore, we can use information gain to select the partition attribute of the decision tree, that is, the optimal partition attribute mentioned in the decision tree algorithm.

2. Gain rate

  Background: Information gain has a preference for attributes with a large number of possible values. In order to reduce the possible adverse effects of this preference, the famous C4.5 decision tree algorithm does not directly use information gain, but uses the gain rate to select the optimal partition attribute.
  Define the gain ratio formula as:
       G ainratio ( D , a ) = Gain ( D , a ) IV ( a ) Gain_ratio(D,a) = \frac{Gain(D,a)}{IV(a)}Gainratio(D,a)=I V ( a )Gain(D,a)
其中,
       IV ( a ) = − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ log 2 ∣ D v ∣ ∣ D ∣ IV(a) = -\sum_{v=1}^{V}\frac{|D^v|}{|D|}log_2\frac{|D^v|}{| D|}I V a =v=1VDDvlog2DDv
IV(a) is called the "intrinsic value" of attribute a. The more possible values ​​of attribute a (that is, the larger V), the larger the value of IV(a).
  It should be noted that the gain rate criterion has a preference for attributes with a small number of possible values, so the C4.5 algorithm does not directly calculate the attribute with the largest gain rate as a candidate partition attribute, but uses a heuristic: first find out the attributes with information gain higher than the average level from the candidate partition attributes, and then select the one with the highest gain rate .

3. Gini index

  The CART decision tree uses the "Gini index" to select partition attributes. The purity of the data set D can be measured by the Gini value:
       G ini ( D ) = ∑ k = 1 ∣ y ∣ ∑ k ′ ≠ kpkpk ′ = 1 − ∑ k = 1 ∣ y ∣ pk 2 Gini(D) = \sum_{k=1}^{|y|} \sum_{k'≠k}p_kp_{k'}= 1 - \sum_{k =1}^{|y|}p_k^2G i n i ( D )=k=1yk=kpkpk=1k=1ypk2
Among them, p k is the proportion of the kth class samples in the sample set D.
  Intuitively, Gini(D) reflects the probability that two samples randomly drawn from the data set D have inconsistent class labels. Therefore, the smaller the Gini(D), the higher the purity of the dataset D.
  The Gini index of attribute a is defined as:
       G ini _ index ( D , a ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ G ini ( D v ) Gini\_index(D,a) = \sum_{v=1}^{V} \frac{|D^v|}{|D|}Gini(D^v)Gini_index(D,a)=v=1VDDvG i n i ( Dv )
  Therefore, in the candidate attribute set A, select the attribute that makes the Gini index the smallest after division as the optimal division attribute.

3. Pruning treatment

  Pruning is the main means of decision tree learning algorithm to deal with "overfitting".

The basic strategies of decision tree pruning are "pre-pruning" and "post-pruning".
  1. Pre-pruning means that in the process of decision tree generation, each node is estimated before division. If the division of the current node cannot improve the generalization performance of the decision tree, the division is stopped and the current node is marked as a leaf node. 2. Post-pruning is to generate a complete decision tree from the training set, and then examine the non-leaf nodes from bottom to top. If replacing the subtree corresponding to the node with a leaf node can improve the generalization performance of the decision tree, replace the subtree with a leaf node
  .

  Generally, the performance evaluation mentioned in the second chapter of this series is used to judge whether the performance of the decision tree is improved.

1. Pre-pruning

  Advantages: Pre-pruning makes many branches of the decision tree not "expanded", which not only reduces the risk of overfitting, but also significantly reduces the training time overhead and test time overhead of the decision tree.
  Disadvantages: Although the current division of some branches cannot improve the generalization performance, and may even lead to a decline in the generalization performance, the subsequent division based on it may lead to a significant improvement in performance; pre-pruning is based on the "greedy" nature, prohibiting certain branches from expanding, which brings the risk of underfitting to the pre-pruning decision tree.
Pre-pruning a decision tree may result in a decision tree with only one level of division, also known as a "decision stump".
insert image description here

2. Post pruning

  Post-pruned decision trees usually retain more branches than pre-pruned decision trees.
  Advantages: In general, post-pruning decision trees have little risk of underfitting, and generalization performance is often better than pre-pruning decision trees.
  Disadvantages: The post-pruning process is performed after the complete decision tree is generated, and all non-leaf nodes are inspected one by one from the bottom up, so the training time overhead is much larger than that of the unpruned decision tree and the pre-pruned decision tree.

4. Continuous and missing values

1. Continuous value processing

  Before this, we discussed the generation of decision trees based on discrete attributes. Continuous attributes are often encountered in real life. Since the number of possible values ​​​​of continuous attributes is infinite, nodes cannot be divided directly according to the possible values ​​​​of continuous attributes. Continuous attribute discretization technology is used to deal with this problem. The simplest strategist deals with continuous attributes using dichotomy (the mechanism used in the C4.5 decision tree algorithm).

Processing:
  1. Sort. Given a sample set and a continuous attribute a, assuming that a has n different values ​​on D, sort these values ​​from small to large, and record them as { a 1 , a 2 , … , ana^1,a^2,…,a^n a1,a2,,an }. (For example, attribute a is density ∈ [0,1], 0.996, 0.874, 0.235, 0.238, 0.374 appear in the sample, and 0.235, 0.238, 0.374, 0.874, 0.996 after sorting).
  2. Division and candidate division points. Divide D intoD t − and D t + D_t^- and D_t^+Dtand Dt+Two subsets, D t − D_t^-DtContains samples whose value of attribute a is not greater than the division point t, D t + D_t^+Dt+Contains samples whose value of attribute a is greater than the division point t. Obtaining the dividing point: From the values ​​of the sorted attribute a, the median point of two adjacent values ​​is used as the candidate dividing point (for example, for the first two values ​​0.235, 0.238, the candidate dividing point t1 is 0.236, rounded down). For n values, two adjacent divisions are performed, and there are n-1 candidate division points.
  3. Investigate candidate division points. After the candidate division points are determined, examine these division points like discrete attribute values, and select the optimal division point to divide the sample set. In the previous section, we selected the optimal division point through information gain, and made certain changes to the information gain formula in the division of continuous attribute values: Gain ( D , a ) = maxt ∈ T a G ain ( D , a , t ) = maxt ∈ T a E nt ( D ) − ∑ λ ∈ { − , + } ∣ D t λ ∣ ∣ D ∣ E nt ( D t λ ) Gain(D ,a) = max_{t∈T_a}Gain(D,a,t)= max_{t∈T_a}Ent(D) - \sum_{λ∈\{-,+}\} \frac{|D_t^λ|}{|D|}Ent(D_t^λ
)Gain(D,a)=maxtTaGain(D,a,t)=maxtTaEnt(D)λ{ ,+}DDtlEnt(Dtl)
其中, G a i n ( D , a , t ) Gain(D,a,t) Gain(D,a,t ) is the information gain after the sample set d is divided into two based on the dividing point t. The split point is selected by the modified information gain.
  Note: Different from the discrete attribute, if the current node partition attribute is a continuous attribute, this attribute can still be used as the partition attribute of its descendant nodes.

2. Missing value processing

  Background:
  In real-world tasks, incomplete samples are often encountered, that is, some attributes of the sample are missing. If the incomplete samples are simply discarded and only the samples without missing events are used for learning, obviously, the data information is greatly wasted. Therefore, it is necessary to consider using training examples with missing attribute values ​​for learning.
  Two problems need to be solved:
1. How to select the partition attribute when the attribute value is missing?
2. Given a division attribute, if the value of the sample on this attribute is missing, how to divide the sample?

  Solution:
For Question 1:
  Obviously we can only use D ~ \tilde{D}D~ to judge the pros and cons of attribute a.
  Given a training set D and attribute a, letD ~ \tilde{D}D~ represents the subset of samples in D that have no missing values ​​on attribute a. Suppose attribute a has V possible values ​​{ a 1 , a 2 , … , a V a^1,a^2,…,a^V a1,a2,,aV } (the value of attribute a appearing in the sample set), letD ~ v \tilde{D}^vD~v meansD ~ \tilde{D}DIn ~, the value of attribute a is ava^vaThe sample subset of v , D ~ k \tilde{D}_kD~kmeans D ~ \tilde{D}D~ The subset of samples belonging to the kth class (k=1,2,..., |y|), obviouslyD ~ \tilde{D}D~= ∪ k = 1 ∣ y ∣ D ~ k ∪_{k=1}^{|y|}\tilde{D}_k k=1yD~k(subset without missing values ​​= sum of subsets without missing values ​​for each class) , D ~ \tilde{D}D~= ∪ v = 1 ∣ V ∣ D ~ v ∪_{v=1}^{|V|}\tilde{D}^v v=1VD~v (subset without missing values ​​= sum of non-missing value subsets for each value of attribute a). Suppose we assign a weightwx w_xwx, and define the following elements:
  the proportion of samples without missing values:
      ρ = ∑ x ∈ D ~ wx ∑ x ∈ D wx ρ=\frac{\sum_{x∈\tilde{D}}w_x}{\sum_{x∈D}w_x}r=xDwxxD~wx
  The proportion of Class K in the non -lost value sample:
      ρ ~ k = ∑ x ∈ d ~ kwx ∑ x ∈ d ~ wx (1 ≤ k ≤ ∣ y ∣) \ tilde {ρ} _k = \ frac {\ sum_ {x∈ \ t_x}} w_x} {\ s} {\ s um_ {x t \ tilde {d}} w_x} (1≤k≤ | y |)r~k=xD~wxxD~kwx(1ky ) Samples without missing values ​​take the value of ava^v
  on attribute aaThe proportion of samples of v
       : r ~ v = ∑ x ∈ D ~ vwx ∑ x ∈ D ~ wx ( 1 ≤ v ≤ ∣ V ∣ ) \tilde{r}_v=\frac{\sum_{x∈\tilde{D}^v}w_x}{\sum_{x∈\tilde{D}}w_x}(1≤v≤|V|)r~v=xD~wxxD~vwx(1vV )
  Based on the above three elements, we generalize the information gain formula as:
      Gain ( D , a ) = ρ × G ain ( D ~ , a ) Gain(D,a) =ρ×Gain(\tilde{D},a)Gain(D,a)=r×Gain(D~,a)
             = ρ × ( E n t ( D ~ ) − ∑ v = 1 V r ~ v E n t ( D ~ v ) ) =ρ×(Ent(\tilde{D})-\sum_{v=1}^V\tilde{r}_vEnt(\tilde{D}^v)) =r×(Ent(D~)v=1Vr~vEnt(D~v ))
among them,
      E nt ( D ~ ) = − ∑ k = 1 ∣ y ∣ ρ ~ klog 2 ρ ~ k Ent(\tilde{D}) = -\sum_{k=1}^{|y|}\tilde{ρ}_klog_2\tilde{ρ}_kEnt(D~)=k=1yr~klog2r~k
  According to the generalized information gain formula, it can be obtained through calculation that the partition attribute selection is performed in the absence of attribute values, that is, problem 1 is solved.

For question 2:
  If the value of the sample x on the division attribute a is known, then divide x into the sub-node corresponding to its value, and the sample weight remains wx w_x in the sub- nodewx. If the value of the sample x on the division attribute a is unknown, then divide x into all child nodes at the same time, and the sample weight is equal to the attribute value ava^vaThe child node corresponding to v is adjusted to r ~ v ⋅ wx \tilde{r}_v·w_xr~vwx; Intuitively, it is to assign the same sample to different sub-nodes with different probabilities.
  Question 2 is solved.

  The C4.5 algorithm uses the above solution.

5. Multivariate Decision Tree

  A multivariate decision tree, as the name implies, is a tree structure that uses multiple variables for decision generation during the decision-making process. Take a multivariate decision tree that implements
  oblique partitions as an example. In this type of decision tree, non-leaf nodes are not only for a certain attribute, but for a linear combination of attributes. In other words, each non-leaf node is a linear classifier.
  In the learning process of multivariate decision tree, instead of finding an optimal partition attribute for each non-leaf node, it tries to build a suitable linear classifier.


Summarize

The most famous representatives of decision tree learning algorithms are ID3, C4.5 and CART.
This chapter describes the process of decision tree generation, decision tree partition selection, pruning strategy, continuous value and missing value processing, multivariate decision tree.
In the selection of decision tree division, information gain, gain rate, Gini index and other criteria are commonly used for division. These criteria have a large impact on the size of the decision tree, but only a limited impact on generalization performance.
This chapter introduces the basic strategy of decision tree pruning. The pruning method and degree have a significant impact on the generalization performance of decision trees.

Guess you like

Origin blog.csdn.net/G_Shengn/article/details/127687362