"Statistical learning methods," notes - Decision Tree

Decision tree is a classification and regression basic method tree model is a description of the classification tree instance.

Figure 1-1 from "machine learning" classification tree watermelon

Suppose a given training data set , wherein the input instance, A is the number of feature categories, marked class (in the following by indicating the mark type, the data set D according to the label category different subset of the divided ).

Provided wherein A (a feature category) with n different values, the data set D The characteristic A (a feature category) values will be divided into n subsets , . Note subset in the category belongs to a labeled sample is set , . So the decision tree model is constructed based on the training data set, for example it can be correctly classified.

Decision Tree Learning typically includes three steps: feature selection , generating the decision tree and tree pruning .

 

1. Feature Selection

Characterized in that the selection key to select the characteristic data having the ability to classify, is commonly used criterion information gain or information gain ratio .

Prior to the introduction of information and gain information gain ratio, introduce entropy (entropy) and conditional entropy .

Entropy measures the uncertainty of a random variable with a discrete random variable X, its probability distribution , its entropy

Conditional entropy , also assume that a random variable (X, Y), which is the joint probability distribution , the entropy of the condition under conditions known in the random variable X is the random variable Y uncertainty, defined as

Clear definition of entropy and conditional entropy after, will seek information gain , it said after learning of information characteristic of X, Y uncertainty class information to reduce the extent .

A conditional entropy H wherein the training data set D information gain G (D, A) is defined as the set D entropy H (D) wherein A and D under the conditions given in (D | A) difference, i.e.,

Then the information gain criterion feature selection method is: the training data set D, the information gain is calculated for each feature, then select the maximum information gain classification characteristic.

 

Gain ratio information is information gain G (D, A) and the training data set D on the specific entropy of the characteristic values A, i.e.,

Wherein, , n-A is the number of feature values, gain ratio information to solve the information gain values eigenvalue problem tend to choose more.

 

2. Decision Tree

Generating a decision tree based on the feature selecting different criteria, to generate two ways: the ID3 algorithm and C4.5 algorithm .

ID3 algorithm:

Input: training data set D, wherein A set threshold ;

Output: decision tree T

  1. If the subset of Example D belong to the same category , then t is a single node in the tree, and as a marker of the category node, return T;
  2. If the subset of Example D does not belong to the same category, the category A and the feature set is empty , then t is a single node in the tree, the tree the subset D in Example largest category as the category of the mark node, return T;
  3. If the subset of Example D does not belong to the same category, the category A and the feature set is not empty , wherein A is recalculated each set of feature categories D, information gain, maximum gain selection information feature category ;
  4. If the information gain is smaller than the threshold value , then t is a single node in the tree, the tree the subset D in Example largest category as the category of the mark node, return T;
  5. If the information gain is not less than the threshold value , for each of the possible values , a subset of the plurality of divided nonempty subset of D , constituting a corresponding plurality of sub-nodes, the i th node, to the training set, in order for the feature set, recursively calling (1) through (5), to give the subtree , return .

     

    C4.5 algorithms:

    Input: training data set D, wherein A set threshold ;

    Output: decision tree T

  6. If the subset of Example D belong to the same category , then t is a single node in the tree, and as a marker of the category node, return T;
  7. If the subset of Example D does not belong to the same category, the category A and the feature set is empty , then t is a single node in the tree, the tree the subset D in Example largest category as the category of the mark node, return T;
  8. If the subset of Example D does not belong to the same category, the category A and the feature set is not empty , wherein A is recalculated each set of feature categories D, information gain, maximum gain selection information feature category ;
  9. If the information gain is smaller than than the threshold value , then t is a single node in the tree, the tree the subset D in Example largest category as the category of the mark node, return T;
  10. If the information gain is not less than the threshold value , for each of the possible values , a subset of the plurality of divided nonempty subset of D , constituting a corresponding plurality of sub-nodes, the i th node, to the training set, in order for the feature set recursively calling (1) through (5), to give the subtree , return .

     

    ID3, and C4.5 algorithm steps similar to the difference that the information gain is calculated ID3 C4.5 calculated gain ratio information.

     

    3. The decision tree pruning

    Decision Tree pruning usually minimizing tree overall loss function or cost function implemented.

    Provided tree T leaf node tree is | T |, T is a leaf node in the tree T, which is the leaf node has sample points, which belong to the sample point category has a, for the entropy of the leaf node t, is parameters, the decision tree may be defined as the loss of function

    Among them, the entropy is

    Figure 3-1 Tree nodes involved in the calculation of the loss function

    Decision tree pruning algorithm:

    Input: generation algorithm, the entire tree T, the parameter ;

    Output: sub-tree trimmed

  11. The entropy is calculated for each node;
  12. 递归地从树的叶结点向上回缩,设一组叶结点回缩到父结点之前和之后的整体树分别为,其对应的损失函数分别为,如果,则进行剪枝,将其父结点变为新的叶结点;
  13. 重复步骤(2),知道不能继续为止,即能得到损失函数最小的子树

     

    剪枝算法可以结合到决策树的生成算法中,这样可以减少模型生成所消耗的时间。

     

     

     

     

Guess you like

Origin www.cnblogs.com/lincz/p/11789832.html