Decision tree machine learning algorithm ----

Decision Tree

1. Decision Tree: the training phase, the training data to construct a decision tree model. In the classification stage, that is, when test data, layer by layer according to the classification properties tree is divided down until you find the predicted results

2. The term tree: root, leaf nodes, internal nodes

3. entropy: entropy represents purity \ impurity; that is uncertainty. The larger the entropy value is, the greater the uncertainty, the time that the smaller the probability of occurrence.

 

4. Gini impurity: represent a randomly chosen sample in the subset of the possibility of being misclassified. Seen from the equation, the greater the probability of an event occurring, the smaller the Gini coefficient.

5.ID3 is the basic decision tree construction algorithm, as a decision tree algorithm to build a classic, simple structure, clear and easy to understand features

  ID3 disadvantages: ① the use of information gain is split by the accuracy of the information may not be employed to divide the high gain ratio can not handle continuous data ②, only by discrete continuous data into discrete ③ ④ can not handle the default value no tree pruning process, likely had problems fitting.

6. C4.5 and CART algorithm

  1. Continuous processing property values:

    The property values ​​are sorted, by less than 1 and greater than 1 is divided into two subsets, a division point is determined by maximizing the information gain, to be noted that,

   2. The default value of the attribute processing

    Analyzing the merits attributes (e.g., information gain calculation) consider only sample values ​​no default, and the default value of the proportion occupied by multiplying the corresponding weights do removed, the default value to ensure that not too much;

    The default value for the sample containing for dividing the property, according to the distribution of the attribute values of various, at different probabilities into different nodes (weighted sample after division).

7 . Three kinds of property division selection of indicators:

  (1) Information Gain Gain (D, a):

    The larger information gain, the higher the purity of the upgrade, you can choose to use the most simple attribute information gain

    Disadvantages: The more the value of property, with the information after it is divided by the gain of the more simple, therefore the method of property values ​​more properties have a preference.

  (2) information gain ratio Gain_ratio (Gain, a):

    Attribute information gain divided by the intrinsic value (a property to make random variables, entropy is calculated from the ratio of a set of attribute values ​​of the respective occupied), more preferably a numerical value, the greater the fixed value, to suppress to a certain extent the above-mentioned preferences.

    Disadvantages: preference for values less desirable properties, in fact, to use the information to find gain above average properties, and then from a high gain to find.

  (3) Gini index Gain (D):

    Gini index represents a randomly selected sample subset of the possibility of being misclassified. When a node is in a class all samples, the Gini impurity zero.

8. A method for preventing overfitting pruning ----

  (1) Pre-Pruning: pruning the tree structure at the same time. All decision tree construction method, are not in the case of further reducing the entropy branch creation process will stop, in order to avoid over-fitting, a threshold value may be set, reducing the number of entropy is less than this threshold value, even if the can continue to reduce entropy, stopped to continue creating branches.

  Advantages: fewer branches, reducing the risk of over-fitting, but also reduces training time and cost of test time and test time overhead expenses

  Cons: bring the risk of under-fitting (though not to enhance the current division of generalization, but continued to divide it?)

  Bottom-up examination of non-leaf nodes, the leaf node if the replaced speech performance can bring, after the replacement of the complete tree generated: (2) pruning.

   Advantages: pruning branches and more than expected, underfitting small risk, generalization is often better than the pre-pruning

   Disadvantages: training time cost of much larger

9. Decision Tree comparison algorithm used

algorithm

Support Model

Tree structure

Feature Selection

Processing successive values

Missing values

ID3

classification

Multi-tree

Information gain

not support

 not support

C4k5

classification

Multi-tree

Information gain ratio

stand by

 stand by

CART

Classification, regression

Binary Tree

Gini coefficient, mean square error

stand by

 stand by

10. improve the accuracy of a random forest

  A decision tree is based on the known historical data and probability, a decision tree predictions may be less accurate, the best way to increase the accuracy is to build a Random Forest (Random Forest)

  The so-called random forests is to generate multiple sampling history table from the historical data table by way of random sampling, each sample history table, and generates a decision tree for each sample history table. Because the sample table is generated every time a data summary table will be returned, thus is no correlation between each independent decision tree. The multiple decision trees form a random forest. When there is a new data is generated, so that the forest each judge separately according to estimates, up to vote as the final result of the judgment. To improve the probability of correct once.

11. Decision tree advantages and disadvantages:

  Tree advantages: ① the simple and intuitive, very intuitive decision tree generated. ② can handle both discrete values ​​can also handle continuous values. Many algorithms just focus on discrete values ​​or continuous values. ③ can handle multi-dimensional classification output. ④ classification model as compared to the black box or the like neural networks, decision trees can be well explained logically. ⑤ pruning cross validation may be selected model, thereby improving the ability of generalization. ⑥ good for outliers fault tolerance, high robustness.

  Tree disadvantages: ① the decision tree algorithm is easy to overfitting, by setting the number of sample nodes happened, impurity / entropy threshold limit to improve the depth of the decision tree. ② tree samples occur because of a little bit of change will lead to dramatic changes in the tree structure: random forest algorithms. ③ find the optimal decision tree is NP-hard, by heuristics, easy to fall into local optimum, improved by random forest and other methods. ④ some of the more complex relationships, decision tree is difficult to learn, such as XOR; solved by neural networks and other methods. ⑤ If the sample is too large proportion of certain features, these features tend to be easily generated decision tree: can be improved by adjusting the weight of sample weight.

12. The decision tree algorithm python library implementation

 

Guess you like

Origin www.cnblogs.com/zgl19991001/p/10978671.html