03. Supervised algorithm - decision tree

1. Decision tree algorithm

The decision tree algorithm can be used for classification and regression

Training and testing of decision trees:

Training phase: Construct a tree from a given training set (select features from the root node, how to perform feature segmentation)

Testing phase: walk from top to bottom according to the constructed tree model

Once the decision tree is built, the task of classification or prediction becomes easy. The difficulty lies in how to construct a tree. In other words, the difficulty lies in how to rank the importance of data features.

The key to building a decision tree is which attribute is selected as the classification basis in the current state. According to different objective functions, there are three main algorithms for establishing a decision tree:

1.1. ID3(Iterative Dichotomiser)

Entropy: A measure of the uncertainty of a random variable, which can be understood as the degree of chaos inside an object.

信息熵:
H ( D ) = − ∑ k = 1 K ∣ C k ∣ ∣ D ∣ log ⁡ 2 ∣ C k ∣ ∣ D ∣ {\rm{H(D) = - }}\sum\limits_{ {\rm{k}} = 1}^K { { {|{C_{_k}}|} \over {|D|}}{ {\log }_2}} { {|{C_{_k}}|} \over {|D|}} H(D)=k=1KDCklog2DCk

Among them, K is the category, D is the number dataset, and C k is the dataset under category K.

信息增益:Indicates the degree to which feature X reduces the uncertainty of class Y (represents the specificity after classification, and hopes that the results of classification will be of the same type together)

条件熵
H ( D ∣ A ) = ∑ i = 1 n ∣ D i ∣ ∣ D ∣ H ( D i ) {\rm{H(D|A) = }}\sum\limits_{ {\rm{i}} = 1}^n { { {|{D_i}|} \over {|D|}}H({D_i})} H(D∣A)=i=1nDDiH(Di)
Among them, A is a feature, and i is a feature value.

Information gain to construct a decision tree 步骤:

1. Calculate the information gain according to different classification features, and find out the feature with the largest information gain as the current decision node
2. Update the subset, select new features among yourself, and find the feature with the largest information gain
3. If The divided subset contains only a single feature, which is the leaf node of the branch

缺点

  • ID3 has no pruning strategy and is easy to overfit;
  • The information gain criterion has a preference for features with a large number of possible values, and the information gain of features similar to "number" is close to 1;
  • Can only be used to deal with discrete distribution features;
  • Missing values ​​are not considered.

With so many shortcomings, is there any improvement——C4.5

1.2. C4.5

C4.5 is an improvement of the ID3 algorithm . Compared with ID3, which selects attributes using subtrees 信息增益, C4.5 uses信息增益率

In addition, C4.5 has the following uses :

  • In the process of decision tree construction, pruning
  • It can also handle non-discrete data
  • Can also handle incomplete data

信息增益 g ( D , A ) = H ( D ) − H ( D ∣ A ) g({\rm{D,A}}){\rm{ = H(D) - H(D|A)}} g(D,A)=H(D)H(D∣A)

信息增益率 g R ( D , A ) = g ( D , A ) H A ( D ) {g_R}({\rm{D,A}}){\rm{ = }}{ {g(D,A)} \over { {H_A}(D)}} gR(D,A)=HA(D)g(D,A)

Among them, the formula H ( D ) = − ∑ k = 1 K ∣ C k ∣ ∣ D ∣ log ⁡ 2 ∣ C k ∣ ∣ D ∣ {\rm{H(D) = - }}\sum\limits_{ { \ rm{k}} = 1}^K { { { |{C_{_k}}|} \over {|D|}}{ { \log }_2}} { { |{C_{_k}}|} \ over {|D|}}H(D)=k=1KDCklog2DCk, n is the number of values ​​of feature A

注意:Decision trees have a high risk of overfitting and can theoretically separate the data completely. So in order to prevent overfitting, it is necessary to perform pruning operations:

  • Pre-pruning: the operation of pruning while building a decision tree
  • Post-pruning: pruning after the decision tree is established

Post-pruning: Through certain measurement standards, specifically: the pessimistic pruning method adopted by C4.5, using a recursive method from low to top for each non-leaf node, and evaluating the use of an optimal leaf node to replace this class Whether the subtree is beneficial. If the error rate after pruning is maintained or decreased compared with before pruning, the subtree can be replaced.

Post-prune decision trees have little risk of underfitting, and generalization performance is often better than pre-prune decision trees

1.3. CART(Classification And Regression Tree)

Classification And Regression Tree (CART) is a kind of decision tree.

This type of decision tree uses the Gini index to select attributes (classification), or the mean square error to select attributes (regression).

基尼指数:

G ini ( p ) = ∑ k = 1 K pk ( 1 − pk ) {\rm{G}}ini(p) = \sum\limits_{k = 1}^K { {p_k}} (1 - { p_k })Gini(p)=k=1Kpk(1pk)

If the target variable is discrete, it is called a classification tree. Finally, the mode of the leaf nodes is taken as the classification result.

If the target variable is continuous, it is called a regression tree. Continuity objects have no notion of categories and thus cannot be computed with entropy, but we can consider another measure of internal chaos in an object:方差

For any partition feature A, the corresponding arbitrary partition point s is divided into data sets D1 and D2 on both sides, and the corresponding features and characteristics are obtained to minimize the mean square error of the respective sets of D1 and D2, and at the same time the sum of the mean square errors of D1 and D2 is the smallest The value split point. The expression is:

min ⁡ a , s [ min ⁡ c 1 ∑ x i ∈ D 1 n ( y i − c 1 ) 2 + min ⁡ c 2 ∑ x i ∈ D 2 n ( y i − c 2 ) 2 ] {\min _{a,s}}[{\min _{ {c_1}}}\sum\limits_{ {x_i} \in {D_1}}^n { { {({y_i} - {c_1})}^2} + { {\min }_{c2}}\sum\limits_{ {x_i} \in {D_2}}^n { { {({y_i} - {c_2})}^2}} } ] a,smin[c1minxiD1n(yic1)2+minc2xiD2n(yic2)2 ]
where,c 1 { {c_1}}c1for D 1 { {D_1}}D1The sample output mean of the dataset, c 2 { {c_2}}c2for D 2 { {D_2}}D2The sample output mean of the dataset

Guess you like

Origin blog.csdn.net/qq_45801179/article/details/132392850