Andrew Ng machine learning introductory notes 6- decision tree (supplemented by watermelon book)

6 Decision Tree

6.1 Structure

  • A root node: Sample Collection
  • A plurality of internal nodes: corresponding to the attribute test
  • Number of leaf nodes: decision results
  • == each node represents the attribute, attribute value representative of the connection line of each node ==
  • When the first layer is optimal dividing attribute is selected, the D recursive after division, a select optimal dividing layer properties, and finally forming a decision tree
  • Only one division called decision tree stumps

6.2 The purpose

Generating a generalization ability, i.e. the ability to process the decision tree example no strong

6.3 metric division of property

With a divided decision tree of the sample contained in the branch node as belonging to the same category, i.e., the node should increasingly high purity

6.3.1 ID3 Decision Tree - information gain criterion

  • Entropy is defined as D

\ [Ent (D) = - \ sum_ {k = 1} ^ {| y |} {} p_klog_2p_k \ tag {6.1} \]

D for the current sample set, \ (P_K \) is the k th sample D, the proportion of total sample class (classification according to the label) . Information entropy, the higher the purity of D

  • Attribute information of a resulting gain is defined as D divided

\[ Gain(D,a)=Ent(D)-\sum_{v=1}^{V}\frac{|D^v|}{D}Ent(D^v)\tag{6.2} \]

\ (D ^ v \) is the discrete attribute \ (A = \) { \ (A ^. 1, A ^ 2, \ DOTS, A ^ V \) } after divided attribute \ (a ^ V \) on sample. The larger the gain information, means a property obtained by dividing the purity of the larger lift . Thus optimal dividing algorithm
\ [a _ * = \ underset {a \ in A} {\ arg \ max} \ text {Gain} (D, a) \ tag {6.3} \]

6.3.2 C4.5 decision tree - gain ratio criterion

Information gain criterion have a preference value can be more number of properties in order to reduce such adverse effects, C4.5 decision tree algorithm chooses the optimal division of property gain, the gain is defined as
\ [\ text {Gain_ratio} ( D, a) = \ frac {
\ text {Gain (D, a)}} {\ text {IV} (a)} \ tag {6.4} \] where
\ [\ text {IV} ( a) = - \ sum_ {v = 1} ^ {
V} \ frac {| D ^ v |} {| D |} log_2 \ frac {| D ^ v |} {| D |} \ tag {6.5} \] become the attribute a intrinsic value, attribute values may be a greater number of V, the greater the IV

  • You may have a preference for fewer number of property values , starting with division of property find the information gain above-average properties , and then select from the highest rate of gain as the optimal division of property, the optimal partitioning algorithm

  • \[ a_*=\underset{a \in A}{\arg\max}\text{Gain_ratio}(D,a),\text{ when }Gain(D,a)>mean(Gain(D,a))\tag{6.6} \]

6.3.3 CART decision tree - Gini index measuring quasi

  • Purity is measured by the data set D Gini index

\ [\ Text {Gini} (D) = \ sum_ {k = 1} ^ {| y |} \ sum_ {k \ ne k} p_kp_ {k} = 1- \ sum_ {k = 1} ^ { | y |} p_k ^ 2 \ {6.7} \]

The reaction Gini index from the data set D in the two random samples, inconsistent marking probability of its class. Thus Gini index, the higher the purity of D

  • Gini index is defined attribute a is
    \ [\ text {Gin \ _index } (D, a) = \ sum_ {v = 1} ^ V \ frac {| D ^ v |} {| D |} \ text {Gini} (D ^ v) \ tag {
    6.8} \] Accordingly candidate attributes in set a, selected such that the minimum division Hou Jini attribute index properties i.e. optimal dividing
    \ [a _ * = \ underset {a \ in a} {\ arg \ min} \ text {Gini \ _index} (D, a) \ tag {6.9} \]

6.4 pruning process

Node division process is repeated, sometimes resulting in excessive tree branch, the training set as the general properties of their own characteristics have led to all data overfitting , thus actively remove some of the branch to reduce the risk of overfitting

6.4.1 Pre-pruning

Decision Trees, for each node to be estimated before the division, if the current node tree can not bring generalization performance improvement , stop dividing and the current node is a leaf node labeled

  • Many branches of the decision tree will not expand, reducing the risk of over-fitting, enhance risk underfitting

After pruning 6.4.2

Start the training set to generate a complete decision tree, and then from the bottom up to inspect non-leaf node, if the node corresponding to the replaced sub-tree leaf node tree can bring generalization performance , then the Alternatively subtree is a leaf node

  • After pruning pruning is usually retain more than the pre-decision tree branch, underfitting risk is very small, the generalization performance is often better than the pre-decision tree pruning, but pruning and training time than those without much pre-pruning tree

Missing values ​​with consecutive values ​​6.5

6.5.1 Continuous value processing

The median sample value calculation neighboring property as a division point t, calculating gain different information t, the time T taken as the maximum division point
\ [\ begin {aligned} \ text {Gain} (D, a) & = \ max_ {t \ in T_a } \ text {Gain} (D, a, t) \\ & = \ max_ {t \ in T_a} \ text {Ent} (D) - \ sum _ {\ lambda \ in \ {-, + \}} \ frac {| D_t ^ \ lambda |} {| D |} \ text {Ent} (D_t ^ \ lambda) \ end {aligned} \ tag {6.10} \]

6.5.2 missing values

  1. Dividing the selected attribute values ​​of attributes are missing

    Promotion of information gain calculation formula

  2. Dividing a given attribute, the attribute values ​​for the missing samples divided

    The number of different values ​​according to the number of samples as the weight value of each attribute are divided

More than 6.6 variable decision tree

[Picture outside the chain dump failure (img-wkouPzNF-1568602608084) (E: \ Artificial Intelligence Markdown \ Machine Learning \ pictures \ 6.6 multivariate decision tree is divided .png)]

This requires a lot of properties to test the boundaries, large prediction time overhead. Is divided so that the boundary is not parallel to the axis line as internal standard setting non-leaf node is a linear combination of attributes , such as form \ (\ sum_ {i = 1 } ^ {d} w_ia_i = t \) a linear classifier, training results are shown below

[Picture outside the chain dump failure (img-A3pjsNT6-1568602608094) (E: \ Artificial Intelligence Markdown \ Machine Learning \ pictures \ 6.6 multivariable decision tree training results .png)]

Guess you like

Origin www.cnblogs.com/jestland/p/11548516.html