[Machine learning notes] (d) Decision Tree Decision Tree

(D) Decision Tree Decision Tree


basic concept

Decision tree (Decision Tree) is based on a known probability of occurrence of the various situations to strike a net present value of the expected value is greater than the probability by constructing a decision tree is equal to zero, the project risk evaluation, decision analysis method to determine its feasibility, is graphic method for intuitive use of probabilistic analysis. Because of this decision branches painted graphics like branches of a tree, so called decision tree.

Machine learning, decision tree is a predictive model; he represents a mapping relationship between the object and the object attribute value. Each tree node represents an object, and an attribute value for each possible path represents bifurcated, and each leaf node corresponding to the object from the root node to the leaf node the path indicated experienced by the value.

Only a single decision tree output, Ruoyu have complex output, you can create a separate decision tree to handle different output. Decision Tree Data mining is a technique often used, can be used to analyze the data, also be used to make predictions. In short, the decision tree (decision tree) is a basic classification and regression methods.

In the classification, based on the feature of the process instance represents classifying. It can be regarded as a set of if-then rules may be regarded as a conditional probability defined in the feature space with the distribution of space. The main model is somewhat readable, fast classification speed. Learning, using the training data, based on loss function to minimize build a decision tree model principles. When the prediction of new data, the decision to classify the use of the book model.


Data generated from the decision tree machine learning technique called decision tree learning. Decision tree learning typically comprises three steps: feature selection , decision tree generation and pruning .


Decision tree classification algorithm

Decision Tree Algorithm Algorithm Description
ID3 algorithm The core is in the levels of nodes of the decision tree, the method using information gain property as a selection criterion, it is determined that helps to generate the appropriate properties to be used in each node
C4.5 algorithm

C4.5 decision tree algorithm is a decision tree classification algorithm machine learning algorithm, the core algorithm is the ID3 algorithm. C4.5 algorithm inherits the advantages of ID3 and ID3 algorithm to improve in the following areas:

(1) with information gain ratio to select Properties, select the value to overcome the bias of many attributes with attribute information gain when selection is insufficient; 

(2) pruning the tree construction process;  

(3) to complete the processing of the discrete continuous attributes;  

(4) can process the data is incomplete.

C4.5 algorithm has the following advantages: generating classification rules easier to understand, high accuracy rate. The disadvantage is that: during the construction of the tree, it is necessary to set the data sequentially scanning and sorting a plurality of times, resulting in inefficient algorithm.

CART algorithm

CART (Classification And Regression Tree) That classification and regression tree algorithm, which is an implementation of the decision tree.

CART binary recursive algorithm is a segmentation technique, the current sample is divided into two sub-samples, such that each non-leaf node has two branches generated, so CART decision tree algorithm generates a binary tree structure is simple. As the CART algorithm is composed of a binary tree, which at every step of the decision-making can only be "yes" or "no", a feature even if there are multiple values, the data is divided into two parts. In the CART algorithm is divided into two steps:

(1) Decision Tree: A sample for recursive partitioning process contribution, decision tree generated as large as possible;

(2) Decision Tree Pruning: pruning with the verification data, as a function of time with minimal loss of pruning criteria.


 

The advantages and disadvantages of the decision tree

Advantages: computational complexity is not high, the output is easy to understand, deletion of the intermediate values ​​is insensitive data processing irrelevant features.

Cons: may cause over-matching problem.

Applicable Data Type: numeric and nominal type

When constructing a decision tree, the first question we need to address is the current data set which features play a decisive role in the division of data classification. To find the decisive feature, carved out the best results, we must assess each feature. After completing the test, the original data set was divided into several subsets of data. These data sub-assemblies distributed at the first decision point on all branches. If the data in a branch of the same type, no further division of the data set. If the data within the data subset does not belong to the same type, then the need to repeat the process data divided into subsets. Partitioning the same procedure data subsets, and partitioning the original data set until all the data are in the same type of data within a subset.

The following is a decision tree based on the generated gain data set watermelon 2.0. Quote from "machine learning" (Zhou Zhihua, Tsinghua University Press, P78)

 

Classification tree model is a description of the classification tree instance. Decision tree formed by the junction (node) to the side and with a composition (directed edge). There are two types of nodes: internal node (internal node) and leaf node (leaf node) node represents the internal characteristics or properties of a leaf node represents a class.

, Starting from the root decision tree classification of a characteristic example of a test according to test results, will be assigned to its child node instances; in this case, each child node corresponds to a value of the feature. So recursively for instance to test and distribution, until reaching a leaf node. The last instance of the class assigned to the leaf node.


Tree and if-then rules

Decision tree can be thought of as a set of if-then rules. A rule decision tree constructed from the root node to a leaf node; wherein the internal node corresponding condition of a rule, and the class corresponding to the leaf node of the rule conclusions.
Decision tree path or a corresponding set of if-then rules have an important property: exclusive and complete. That is, not an example one path or are covered by a rule, and the preparation of a path covered by a rule or.


Decision tree structure

Make predictions using a decision tree requires the following procedure:

Data collection: You can use any method. For example, want to build a dating system, we can get data from matchmaker there, or by visiting blind date. According to factor in their consideration and final selection result, we can get some for our use of the data.
Data preparation: data collection finished, we have to sort out, these all collected information sorted out according to certain rules and layout, to facilitate our follow-up treatment.
Data analysis: You can use any method, after the decision tree construction is completed, we can check whether a graphical decision tree line with expectations.
Training algorithm: This process is constructing a decision tree, the same can also be said decision tree learning, it is to construct a decision tree data structure.
Test algorithm: experience with tree calculation error rate. When the error rate of the acceptable range, the decision tree can be put to use.
Use algorithm: This procedure can be used for any supervised learning algorithm, a decision tree can be used to better understand the inner meaning of the data.

 

Decision tree learning algorithm is a recursive usually selected optimum characteristics, and the training data is divided in accordance with this feature, such that each process has a sub-data set of the best classification. This process corresponds to the division of the feature space, but also corresponds to a decision tree construction.

1) Start: Construction of the root node, the root node on all training data, to select an optimal feature, according to the features of the training data set is divided into subsets, such that each subset has a preferably under the current conditions Classification.

2) If a subset of these have been able to substantially correctly classified, then construct a leaf node, and the subsets assigned to the leaf node corresponding to.

3) If there is a subset can not be classified correctly, then these subset selection of the best new features, continue to be divided, construct the corresponding node, if recursively until all training data subset is essential correct classification, or until there is no suitable features.

4) Each subset is assigned to the leaf nodes that have a definite class, thus generating a decision tree.

In general, a decision tree comprising a root node, a plurality of internal nodes and a plurality of leaf nodes; leaf node corresponds to a book attribute volumes; each leaf node of the sample set included in the results of property tests are It is divided into sub-node; the root node containing the sample corpus, the path from the root to each leaf node of the drink had a test sequence determination. The purpose decision tree learning is to produce a strong generalization tree, which is the basic process is simple and just follow the "divide and rule" (divide-and-conquer) strategy, as shown below:

Obviously, the Decision Tree is a recursive process. In the basic decision tree algorithm, there are three situations can lead to a recursive return:

  1. The current sample node contains the full consent of belonging to the category, no division;
  2. The current set of attributes is empty, all samples or values ​​of the same on all the properties can not be divided;
  3. Sample set included in the current node is empty, it can not be divided.

In the second case, we labeled the current node is a leaf node, and its category set up for the sample type nodes contained; in the third scenario, the same mark the current node is a leaf node, it will set its category node contained in the sample up to its parent category. Note that in both cases a substantive difference: in the case of using two posterior distribution of the current node, and the case of three samples is the parent node prior distribution as the distribution of the current node.


Decision Tree and conditional probability distribution

Conditional probability distribution tree defined on a partition of the feature space. The feature space is divided into mutually exclusive cell or cell area region, and is defined in each cell a class probability distribution constitutes a conditional probability distribution.

A path decision tree corresponds to a unit of division. Conditional probability distribution of the decision tree represented in the class of conditional probability to the respective units under predetermined conditions distribution components.
X is a random variable assumed characteristics, X is a random variable representing the class, then the conditional probability distribution P (Y | X) .X taken from binding to the set of division but, Y values given in class . Conditional probability at each leaf node is generally at a certain probability the largest class.


Decision tree learning

Decision tree learning is a training data set to estimate the conditional probability model . Conditional probability model based on feature space into classes is an infinite number. Conditional probability model should we choose not only the training data have a good fit, but also to have a good predictor of unknown data .

Decision tree learning loss function expressed this goal. Decision tree learning of the loss of function is usually regularization of maximum likelihood function . Decision tree learning strategy is the loss of function of the objective function is minimized .
When the loss function OK, studying the problem becomes choosing the optimal decision tree in the sense of loss of function problems . Because select the best possible decision trees from decision tree is NP (Non-Polynomial) complete problem , so in reality the decision tree learning algorithms typically employ heuristics (heuristic) method , approximate solving the optimization problem. Such resulting decision tree is usually sub-optimal (sub-optimal) of.

Decision tree learning algorithm is a recursive usually selected optimum characteristics, and the training data is divided in accordance with this feature, such that each process has a sub-data set of the best classification. This process corresponds to the division of the feature space, but also corresponds to a decision tree construction.

Feature selection algorithm comprises a decision tree learning, generates decision tree pruning the tree. Since the decision tree represents a conditional probability distribution, different shades of the probability model corresponding to this tree of varying complexity. Decision book generated corresponding to partial selection model , decision tree pruning corresponding global selection model .

Decision tree learning objectives: the given training data set to build a decision tree model, making it possible to correctly classify instances.

The nature of the decision tree learning: from the training set summed up a set of classification rules, or by the training data set to estimate the conditional probability model.

Decision tree learning loss function: regularization of maximum likelihood function

Decision Tree Learning test: to minimize the loss function


Feature Selection

If you use a feature classification results with random assortment of demerit claimed that this feature is not very different no classification ability. Throw This feature has little effect on the accuracy of decision-making book on learning experience. Usually feature selection criterion is gain information or information gain ratio.

Feature selection is decided to use that feature to divide the feature space. Information Gain (information gain) can be a good representation of the visual criteria.


Entropy

Entropy is a measure of the purity of the sample, the current sample set Jiading ratio D k-th sample class accounts for pk (k = 1,2, ..., | y |), the entropy is defined as D :

Value (D) of the Ent smaller, the higher the purity of D.


Information gain

In general, the greater the gain information, it means that the use of a property to be obtained by dividing the greater "purity upgrade." Which is to select the ID3 division attribute information gain as a criterion. 

Disadvantage of Information Gain

The larger and more data on the class attribute information gain, such as on the primary key information gain is very large, but obviously will lead to overfitting, so there is some defect information gain 


Gain rate

In fact, the information gain criterion may have a preference for large values ​​of the number of attributes, such preferences may reduce adverse effects. C4.5 decision tree algorithm does not directly use the information gain, but the use of "gain rate" to select the optimal division of property, the gain is defined as:

The more likely a number of attribute values ​​(i.e., the larger V), the value of IV (a) is usually larger. It should be noted that the gain ratio criteria may be of less value attribute preference number, and therefore is not directly C4.5 algorithm to select the gain of the largest division of property of the candidate, but the use of a heuristic: start with the candidates divided property find the information gain above-average properties, and then choose the highest gain rate.

 


Gini index (GINI)

GINI index:
1, a measure of disparity;
2, normally used to measure income inequality can be used to measure any non-uniform distribution;
3, is a number between 0 and 1, 0 exactly equal, 1- completely equal;
4, within the overall category includes the more cluttered, GINI index greater (with the concept of entropy is very similar).

CART decision tree to select divide property. Purity D data set can be used to measure the Gini index

Thus, Gini smaller (D), the higher the purity of the data set D

 


Pruning process

Decision tree is prone to over-fitting, which is due to adapt too well to train the data set, but did not perform well on the test data set. This time we either control termination condition to avoid tree branches too small by the threshold, or that has been formed by the decision tree pruning to avoid over-fitting . Another means of overcoming the over-fitting is based on the idea of establishing Bootstrap Random Forest (Random Forest) .

Decision Tree pruning is the primary means of dealing with "over fitting". In the decision tree learning, in order to classify the sample as correct as possible, node partitioning process is repeated, sometimes resulting in excessive tree branch, then sometimes put their own characteristics as the general nature of all data has lead to over-fitting, therefore , can reduce the risk of over-fitting by actively remove some branches

The basic strategies before pruning and pruning, pruning refers to the process of decision tree for each node to be estimated before the division, if the current split node tree can not bring the generalization performance improvements, division stop flag of the current node is a leaf node; pruning is complete start the training set to generate a decision tree, and then from the bottom up to inspect non-leaf nodes, if the node corresponding to a leaf node replacing words Decision tree can bring generalization performance, then replace the sub-tree leaf node.

in short:

First pruning - during construction, when a node pruning conditions are satisfied, the immediate stop construction of this branch.

After pruning - to complete the full structure of the decision tree, then traverse the tree by pruning certain conditions.

In fact, pruning criterion is how to determine the size of the decision tree, pruning can refer to ideas are the following:

(1) set (Training Set) using a training set and a validation (Validation Set), to assess the effectiveness of the method of pruning pruning nodes;

(2) using all the training set for training, but using a statistical test to estimate whether the trim a particular node will improve the evaluation of performance data outside the training set, such as using Chi-Square (Quinlan, 1986) test to further expand the node whether can improve the performance of the entire classification data, or just improve the performance of the current training set data;
(3) use clear criteria to measure the complexity of the training examples and the decision tree, when the minimum code length, stop tree growth, such as MDL (Minimum Description Length) criterion.

Impact Analysis

First pruning makes a lot of branches is not expanded, which not only reduces the risk of over-fitting, but also significantly reduces the cost of the decision tree training time and testing time. However, although some branches currently not improve generalization. And may even lead to a temporary reduction generalization, but the subsequent division on its basis there could lead to a significant increase, so this greedy nature to pruning, tree poses a risk to the less fit.

Comparative tree after pruning and pre-pruning generated, it can be seen typically after pruning prune more branches than the pre-reservation, which underfitting little risk, so pruning generalization often due to pre tree pruning. But the pruning process is cut from the bottom up, so the training time before pruning overhead than larger.

Reference Bowen: https://www.jianshu.com/p/61a93017bb02?from=singlemessage


Decision tree classification boundary

Further, the decision tree classification boundary has formed a distinctive features: parallel axes, i.e. it has a free boundary and a plurality of parallel coordinate axes composed of segments.

 


Graphviz tool

Graphviz is open source graph visualization software . Is a graphical visualization of the configuration information is represented as an abstract view and FIG network method. It has important applications in networking, bioinformatics, software engineering, database and web design, machine learning, and visual interfaces in other technical fields. 

Graphviz: Visualization Tools Download:

https://graphviz.gitlab.io/_pages/Download/Download_windows.html

Conversion instruction entered at the command line:

-o des.pdf -Tpdf src.dot DOT
src.dot .dot file is represented with path 
des.pdf represented .pdf file is generated, and preferably also with a path

E.g:

dot -Tpdf G: \ MachineLearning \ tree.dot -o G: \ MachineLearning \ tree.pdf - successfully dot files into pdf file

dot -Tpng G: \ MachineLearning \ tree.dot -o G: \ MachineLearning \ tree.png - successfully dot files into png file.


 

 

Published 619 original articles · won praise 185 · views 660 000 +

Guess you like

Origin blog.csdn.net/seagal890/article/details/105153888