[Statistical Learning|Book Reading] Chapter 5 Decision Tree p55-p75

train of thought

Decision tree is a basic classification and regression method. Mainly discuss classification decision trees. The decision tree model has a tree structure, and in classification problems, it represents the process of classifying instances based on features. It can be considered as a set of if-then rules, or as a conditional probability distribution defined in feature space and class space.
During learning, the training data is used to establish a decision tree model based on the principle of minimizing the loss function. When predicting, use the decision tree model to classify new data.
Decision tree learning usually includes three steps: feature selection, generation of decision tree and construction of decision tree.

Decision Tree Modeling and Learning

决策树是由训练数据集估计条件概率模型
用损失函数表示是否有很好的拟合
学习问题转化为在损失函数意义下的最优决策树问题
递归选择最优特征生成决策树
可能发生过拟合所以进行剪枝

The learning goal of decision tree is to build a decision tree model based on a given training data set so that it can correctly classify instances. Decision tree learning is to estimate the conditional probability model from the training data. There are infinitely many conditional models based on the classification of the feature space. The selected conditional model should not only have a good fit for the training data, but also have a good fit for the unknown data. predict.
Decision tree learning achieves this with a loss function. The decision tree loss function is usually a regularized maximum likelihood function, and the decision tree learning strategy is to minimize the loss function as the objective function. When the loss function is determined, the decision tree learning problem becomes the problem of selecting the optimal decision tree in the sense of loss function. Because selecting the optimal decision tree from all possible decision trees is an NP-complete problem, in reality, decision tree learning algorithms usually use heuristic algorithms to approximate the optimization problem.

Decision tree learning algorithm:

特征选择
决策树的生成
决策树的剪枝

feature selection

Feature selection is to select features that have the ability to classify the training data. The feature selection criterion is 信息增益or 信息增益比.

information gain

Information gain can intuitively indicate that a feature has better classification ability.
Definition: Information gain g ( D , A ) g(D,A) of feature A on training data set Dg(D,A ) , defined as the empirical entropyH ( D ) H(D)H ( D ) and the empirical conditional entropy H ( D ∣ A ) H(D|A)of D under the given conditions of feature AThe difference of H ( D A ) , namely: g ( D , A ) = H ( D ) − H ( D ∣ A ) g(D,A)=H(D)-H(D|A)g(D,A)=H(D)H ( D A ) in general, entropyH ( Y ) H(Y)H ( Y ) and conditional entropyH ( Y ∣ X ) H(Y|X)The difference between H ( Y X ) is called mutual information, and the information gain in decision tree learning is equivalent to the mutual information of classes and features in the training data set.

The feature selection method based on the information gain criterion: For the training data set D, calculate the information gain of each feature, compare their sizes, and select the feature with the largest information gain.

information gain ratio

When the classification is difficult, that is, when the experience entropy of the training data set is large, the information gain value will be relatively large, otherwise, the information gain value will be relatively small. This problem can be corrected using the information gain ratio.
Definition: The information gain ratio is defined as the ratio of the information gain to the empirical entropy of the training dataset.

Decision Tree Generation (Generation Algorithm)

ID3 algorithm

The core of the ID3 algorithm is to apply the information gain criterion to select features on each node of the decision tree, and build the decision tree recursively.

C4.5 Generation Algorithm

C4.5 In the process of generation, the information gain ratio is used to select features.

Pruning of decision trees

Decision tree generation algorithms generate decision trees recursively until they cannot go any further. The tree generated in this way is often very accurate for the classification of training data, but it is not so accurate for the classification of data so far, that is, it is prone to over-fitting phenomenon. In order to prevent this phenomenon, we propose the pruning of decision tree. simplify.

The pruning of the decision tree is often achieved by minimizing the overall loss function of the decision tree.
Let the number of leaf nodes of the tree T be ∣ T ∣ |T|T , t is the leaf node of the tree T, and there are Nt sample points on the leaf node, among which there areN tk N_{tk}Ntk, k = 1 , 2 , . . . , K k=1,2,...,Kk=1,2,...,K H t ( T ) H_t(T) Ht( T ) is the experience entropy on the leaf node t,α ≥ 0 \alpha \ge 0a0 is the parameter, then the decision tree learning loss function can be defined as:
C α ( T ) = ∑ t = 1 ∣ T ∣ N t H t ( T ) + α ∣ T ∣ C_{\alpha }(T) = \sum_ {t=1}^{|T|}N_tH_t(T)+\alpha |T|Ca(T)=t=1TNtHt(T)+α T ∣The
empirical entropy is:H t ( T ) = − ∑ k N tk N tlog N tk N t H_t(T)=-\sum_{k}\frac{N_{tk}}{N_t}log \ frac{N_{tk}}{N_t}Ht(T)=kNtNtklogNtNtk
In the loss function, record the first item of the loss function as: C ( T ) = ∑ t = 1 ∣ T ∣ N t H t ( T ) = − ∑ t = 1 ∣ T ∣ ∑ k KN tk N tlog N tk N t C(T)=\sum_{t=1}^{|T|} N_tH_t(T)=-\sum_{t=1}^{|T|} \sum_{k}^{K}\frac {N_{tk}}{N_t}log \frac{N_{tk}}{N_t}C(T)=t=1TNtHt(T)=t=1TkKNtNtklogNtNtk
This is: C α ( T ) = C ( T ) + α ∣ T ∣ C_\alpha (T)=C(T)+\alpha |T|Ca(T)=C(T)+αT.
C ( T ) C(T) C ( T ) represents the prediction error of the model on the training data, that is, the degree of fitting between the model and the training data,∣ T ∣ |T|T represents model complexity, parameterα ≥ 0 \alpha \ge 0a0 controls the influence between the two. Largerα \alphaα promotes the selection of simpler models, and vice versa;α = 0 \alpha = 0a=0 means that only the fitting degree of the model and the training data is considered, and the complexity of the model is not considered.

The minimization of the loss function defined above is equivalent to the regularized maximum likelihood estimation. Therefore, pruning using the principle of minimum loss function is to use regularized maximum likelihood estimation for model selection.

CART algorithm

决策树的生成
决策树的剪枝

Classification and regression tree (CART) can be used for both classification and regression. Use the square error minimization criterion for the regression tree, and use the Gini index (Gini index) minimization criterion for the classification tree to perform feature selection and generate a binary tree.

Use the "Gini index" to select the partition attribute, and the purity of the node can be measured by the Gini value:
G ini ( p ) = ∑ k = 1 K pk ( 1 − pk ) = 1 − ∑ k = 1 K pk 2 Gini(p )=\sum_{k=1}^{K}p_k(1-p_k)=1-\sum_{k=1}^{K}p_k^2Gini(p)=k=1Kpk(1pk)=1k=1Kpk2
For a given sample D, its Gini index is: G ini ( D ) = 1 − ∑ k = 1 K ( ∣ C k ∣ ∣ D ∣ ) 2 Gini(D)=1-\sum_{k=1}^ {K} (\frac{|C_k|}{|D|} )^2G ini ( D )=1k=1K(DCk)2
在特征A的条件下,集合D的基尼指数定义为: G i n i ( D , A ) = ∣ D 1 ∣ ∣ D ∣ G i n i ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ G i n i ( D 2 ) Gini(D,A)=\frac{|D_1|}{|D|}Gini(D_1)+ \frac{|D_2|}{|D|}Gini(D_2) Gini(D,A)=DD1Gini(D1)+DD2Gini(D2)

Guess you like

Origin blog.csdn.net/m0_52427832/article/details/127047366