Opencv Advanced 18-Introduction to Decision Tree Based on Opencv

1. What is a decision tree?

Decision tree is one of the earliest machine learning algorithms, which originated from the imitation of certain human decision-making processes and belongs to supervised learning algorithms .

The advantage of decision trees is that they are easy to understand. Some decision trees can be used for both classification and regression. Two of the top ten data mining algorithms are decision trees [1]. There are many different versions of decision trees. The typical version is the earliest ID3 algorithm and the improved C4.5 algorithm. These two algorithms can be used for classification. Another branch of the improvement of the ID3 algorithm is the "Classification And Regression Trees (CART)" algorithm, which can be used for classification or regression . The CART algorithm provides the basis for important algorithms such as Random Forest and Boosting. In OpenCV, the decision tree implements the CART algorithm .

2. Principle of decision tree

2.1 Basic idea of ​​decision tree

The earliest decision tree used the if-then structure to split the data, where if means condition
and then means choice or decision.

2.2 Representation method of decision tree

In the decision tree, the features in the sample vector are usually called the attributes of the sample , and
the customary name of "attribute" will be used below.

The decision tree classifies the samples by arranging the samples from the root node to a certain leaf node. The root node is the position where the tree is split for the first time, and the leaf node is the classification label to which the sample belongs.
Each node on the tree represents a test for a certain attribute of the sample, and each subsequent branch of the node corresponds to a possible value of the attribute. The way to classify samples is to start at the root node of this tree, test the attribute specified by this node, and then move down the tree branch corresponding to the value of that attribute for a given sample. This process is repeated on the subtree rooted at the new node.

The decision tree [1] is illustrated below with an example of customers' willingness to wait for a table when dining at 271 restaurant. Table 5-1
is the attribute list of the data set samples, and Table 5-2 is the data set. There are 12
samples , and each sample has 10 attributes. Two classification problems. The structure of a decision tree inferred from an existing dataset is shown in Figure 5-1.

insert image description here
Table 5-2 Dataset of restaurant willingness to wait for a table

insert image description here

insert image description here
As shown in Figure 5-1, the one with a question mark in the box is the root node (slicing data according to a given attribute
), and the rest are leaf nodes (labels).

At the beginning, the decision tree divides the data set into 3 subsets according to the number of diners in the restaurant (Patrons attribute). The left subset is for no diners, the middle subset is for some diners, and the right subset is for full occupancy. The principle of the root node of other subtrees is similar.

When new unknown data appears, the tree can be simply traversed according to the characteristics and reach a certain label, that is, whether it is willing to wait for a station. Although the example is simple, it clearly demonstrates the interpretability of decision tree models, and the ability to learn simple rules.

From the examples, we can see that the core problem of the decision tree is: which attribute should be selected for each node from top to bottom
to obtain the best classifier? Therefore, choosing the best split attribute is the key of decision tree.

2.3 Selection of the Best Segmentation Attributes

The evaluation of the best segmentation properties is usually based on the idea of ​​sample impurity reduction (Impurity Reduction) or purity gain (Purity Gain). Sample purity refers to the homogeneity of sample types in the collection (Homogeneity). If there is only one type of sample in the data set, the data set has the highest sample purity and the lowest impurity. The more sample types in the data set, the lower the sample purity and the higher the impurity . Obviously, we hope that after the sample set is segmented according to a certain attribute, the sample purity will increase and the impurity will decrease, that is, the greater the purity gain of the sample set after segmentation, the better.

In order to calculate the purity gain, it is necessary to define the impurity measure (Impurity
Measure), that is, the method of calculating the impurity; then subtract the impurity after the node segmentation from the impurity before the node segmentation, and obtain the impurity reduction (that is, the purity gain) Index; finally select the attribute that reduces the impurity the most for segmentation.

Different types of decision trees use different methods to measure sample impurity, such as information entropy, Gini coefficient, square error, etc. The following introduces several commonly used methods for calculating impurity and selecting optimal segmentation attributes.

1. The information entropy
ID3 algorithm uses information gain (Information Gain) to select the best attribute to build a decision tree, that is, the attribute that can obtain the largest information gain is used as the best attribute to divide the current data set. To understand the calculation method of information gain, you first need to understand its concept.

Information Entropy (Information Entropy) was proposed by American information scientist Shannon in 1948
. Information is the elimination of uncertainty, which can be measured by probability. The higher the probability of an event, the lower its uncertainty, and vice versa, the higher the uncertainty. Information entropy is a measure of the amount of information required to eliminate uncertainty, that is, the amount of information that an unknown event may contain. The lower the probability of event occurrence and the greater the uncertainty, the greater the amount of information and the greater the entropy; the higher the probability and the smaller the uncertainty, the smaller the amount of information and the smaller the entropy .

The following introduces information entropy and information gain from the perspective of information theory [1]. The uncertainty function
I is defined to represent the amount of information of the event, and it and the probability p of the event should satisfy the following conditions:

◎ I§≥0, I(1)=0, that is, the information content of any event is non-negative, and the
information content of an event with probability 1 is 0.

◎I(p1∙p2)=I(p1)+I(p2), that is, the amount of information generated by two independent events should be equal to the sum of their
respective information amounts.

◎ I§ is continuous and is a monotonically decreasing function of the probability p, and a small change in the probability corresponds to a
small change in the amount of information.


It is found that the logarithmic function satisfies the above conditions at the same time, therefore, the information content of the event can be expressed by formula (5-1) :

insert image description here

In the formula, if a=2, it is the information unit bit (bit) that is often said. For example, the amount of information given by the event of flipping a normal coin and turning heads is −log2(0.5)=1bit. If there is a biased coin in the minting, and the probability of heads is 0.99, then the amount of information given by the event of flipping the coin and turning heads is

−log2(0.99)=0.0145bit。

If there are multiple events, how should the average entropy of these events be calculated?

Assume that the probability of events v1, …, vJ occurring is p1, …, pJ, where [p1, …, pJ] is a discrete
probability distribution. Then the average information content of multiple events is defined by formula (5-2):

insert image description here
In the formula, H§ is the information entropy of discrete distribution p, p=[p1, …, pJ]. In all entropy
calculations, define 0lg0=0. If there are only two types of events (Boolean), then the probability is
p=[p1, 1−p1].

insert image description here

Suppose D is a Boolean set with 14 samples, including 9 positive samples and 5 negative samples
. Calculate the information entropy of D according to formula (5-2) as follows:

insert image description here
If all members of D belong to the same class, that is, H(D)=H([1, 0]), then the information
entropy of D H(D)→0. If the occurrence probability of positive and negative samples in D is 0.5, then H(D)=H([0.5,
0.5])=1. It can be seen that in the Boolean set, the information entropy is the largest when the probability of positive and negative samples is equal.

Information entropy is a measure of the uncertainty of things. In decision trees, it can be used not only to
measure the uncertainty of categories, but also to measure the uncertainty of data samples containing different characteristics. The greater the information entropy of a feature column vector, the greater the uncertainty of the feature, that is, the greater the degree of confusion, so we should give priority to segmentation from this feature. Information entropy provides the most important basis and standard for the segmentation of decision trees.

  1. information gain

The information gain of attribute A to training data set D is G(D, A), which is the difference between the information entropy H(D) of set D
and the conditional entropy H(D|A) of D under the conditions given by attribute A:

5-3
In the formula, n indicates that for attribute A, the sample set is divided into n subsets (that is, n values ​​​​of attribute A); |Di| indicates the number of samples in the i-th subset after segmentation by attribute A, |D |Denotes the number of samples in the sample set D.


H(D) measures the uncertainty of D, and the conditional entropy H(D|A) measures the remaining uncertainty of D after knowing attribute A. It can be seen that H(D)−H(D|A) It measures the degree of uncertainty reduction of D after knowing attribute A. This measure is called mutual information in information theory and information gain in decision tree ID3 algorithm.

Going back to the restaurant example introduced above, the original data set D contains 12
samples [x1, …, x12], each sample’s feature vector has 10 attributes, and the label is whether to wait for the table, where
[x1, x3, x4, x6, x8, x12] 6 people decided to wait for the table, [x2x5, x7, x9, x110, x11] 6 people
decided not to wait for the table. As shown in Figure 5-2(a), according to the Patrons attribute, it can be divided into three subsets: None, Some, and Full, that is, n=3 (three values ​​of the Patrons attribute). The information gain divided by Patrons attribute is as follows:

insert image description here
As shown in Figure 5-2(b), according to the Type attribute, it can be divided into
four subsets: French, Italian, Thai, and Burger, that is, n=4 (four values ​​of the Type attribute). The information gain divided by Type attribute is as follows:

insert image description here

insert image description here

Obviously G(D, Patrons) > G(D, Type). In fact, the information gain of segmentation by Patrons attribute
is the highest. Therefore, select the Patrons attribute as the attribute of the root node to start splitting.

  1. information gain rate

There are many problems in ID3 algorithm using information gain. For example, under the same conditions, attributes with more values ​​(n is larger) have greater information gain than attributes with fewer values, that is, information gain as a standard tends to be biased towards features with more values. For example, one attribute has 2 values, each with a probability of 1/2; another attribute has 3 values, each with a probability of 1/3. In fact, they are all completely uncertain variables, but the information gain of taking 3 values ​​is greater than that of taking 2 values .

The C4.5 algorithm improves on this by using the Information Gain Ratio
as the segmentation criterion. The information gain rate is the ratio of information gain to characteristic entropy
:

insert image description here

In the formula, D is the sample set, A is the sample attribute, and the expression of the characteristic entropy HA(D) of attribute A is as follows:

insert image description here
According to the above formula, the feature entropy of the Patrons attribute in the example is:

insert image description here

That is to say, when selecting the optimal segmentation attribute through the information gain ratio, the Patrons attribute
should be selected compared with the Type attribute.

  1. Gini Coefficient

Both the ID3 algorithm and the C4.5 algorithm are based on the information entropy model and involve a large
number of logarithmic operations. In order to simplify the operation while retaining the advantages of the information entropy model, the CART algorithm proposes to use the Gini coefficient instead of the information gain rate. The Gini coefficient (Gini Index) represents the impurity of the model. The smaller the Gini coefficient, the lower the impurity of the model and the better the corresponding attribute .

Specifically, in the classification problem, assuming that there are J categories in the sample, and the probability of the jth category
is pj, the expression of the Gini coefficient is:
insert image description here

In the formula, |Dj| is the number of samples of the jth class, and |D| is the number of samples in the sample set D.

If it is a two-category problem, if the probability that the sample belongs to the first category is p, then
the expression of the Gini coefficient is:

insert image description here
If D is divided into n parts according to attribute A, then under the condition of attribute A,
the expression of Gini coefficient of D is:

insert image description here
The attribute with the smallest Gini coefficient is segmented.

The relationship curve between the impurity of the binary classification node and the probability p of a certain type of sample is shown in Figure 5-3. For the sake of comparison, the information entropy is reduced by two times, so that it and the Gini coefficient curve both pass through the (0.5, 0.5) point .

insert image description here

It can be seen from Figure 5-3 that the curves of Gini coefficient and information entropy are very close. Therefore, the Gini
coefficient can be used as an approximate substitute for the entropy model
. The CART algorithm uses the Gini coefficient to select the features of the decision tree . At the same time, in order to further simplify, the CART algorithm only divides the value of a certain feature into two each time, so that what the CART algorithm builds is a binary tree, which further simplifies the calculation.

  1. mean square error

The above are all calculation methods of classification trees, and the output is discrete categories, such as waiting for
a table in a restaurant or not waiting for a table. The regression tree outputs continuous values, such as predicted housing prices. In order to implement a regression tree, one needs to use an impurity measure suitable for regression. Mean Squared Error (Mean Squared Error, MSE) is mainly used in regression trees, which is the square of the difference between the observed value and the predicted value.

insert image description here
In the formula, |D| is the number of samples in the data set D, and yi and yi are the output value and predicted value of the i-th sample, respectively
. The predicted value can be replaced by the average of the output values ​​yi:

insert image description here
If the set D is divided into n subsets according to attribute A, the mean square error after segmentation is:

insert image description here
The CART regression model uses the measurement method of mean square error. Its goal is to find the
data sets D1 and D2 divided into data sets D1 and D2 on both sides of any node S corresponding to any segmentation attribute A, and to find the respective mean square error of data sets D1 and D2. At the same time, the attribute corresponding to the minimum sum of the mean square errors of D1 and D2 is expressed as:

insert image description here
Among them, c1 is the sample output mean of D1 data set, and c2 is the sample output mean of D2 data set.

The CART classification tree uses the category with the highest probability in the leaf node as the predicted
category of the current node. The output of the regression tree is not a category, it uses the mean or median of the final leaves to predict the output.

stop criteria

As mentioned earlier, decision trees follow a greedy recursion to split nodes, how
and when do they stop?

In fact, many strategies can be applied to define the stopping criteria (Stopping Criteria). The most common is the minimum number of data points , and if further splitting would violate this constraint, then stop splitting.
Another stopping criterion is the depth of the tree . Stopping criteria along with other parameters can help us achieve a decision tree model with good generalization ability. Decision trees that are very deep or have too many non-leaf nodes often lead to overfitting.

pruning

Since the establishment of the decision tree depends entirely on the training samples, the algorithm is easy to overfit the training set
, resulting in poor generalization ability. In order to solve the overfitting problem, the decision tree needs to be pruned (Pruning), that is, some nodes, including leaf nodes and intermediate nodes, are removed to simplify the decision tree. Pruning is similar to regularization in linear regression and can increase the generalization ability of decision trees .

There are two common methods of pruning: pre-pruning and post-pruning.

Pre-pruning is to terminate the growth of the decision tree in advance during the process of building the decision tree, so as to avoid generating too many nodes. This method is simple, but not very practical, because it is difficult to judge exactly when growth should be terminated.

Post-pruning is to remove some nodes after the decision tree is built. Common post-pruning methods include pessimistic error pruning, minimum error pruning, cost complexity pruning, and error-based pruning. The CART algorithm in OpenCV uses cost-complexity pruning, that is, first generates a decision tree, then generates all possible pruned CART trees, and finally uses cross-validation to test the effects of various prunings, and selects the best generalization ability. Good pruning strategy.

Guess you like

Origin blog.csdn.net/hai411741962/article/details/132445843