Classification problem study notes-decision tree


Decision tree

Case:

If I want to buy a watermelon now, I need to judge whether it is good or bad. After taking the watermelon, look at the texture first. If the texture is not clear, pass it directly. If the texture is clear, then look at the roots, touch and other characteristics. If I build a decision tree at this time, there is no problem, I can immediately know that it is a good melon. Still a bad melon, as shown below:
Insert picture description here

principle:

Decision tree (decision tree) is a tree structure (can be binary tree or non-binary tree). Each non-leaf node represents a test on a characteristic attribute, each branch represents the output of this characteristic attribute in a certain value range, and each leaf node stores a category. The process of using a decision tree to make a decision is to start from the root node, test the corresponding feature attributes in the items to be classified, and select the output branch according to its value until the leaf node is reached, and the category stored in the leaf node is used as the decision result.
In summary, the core of the decision tree model is the following parts:

  • Nodes and directed edges
  • There are two types of nodes: internal nodes and leaf nodes
  • The internal node represents a feature, and the leaf node represents a class

If efficiency etc. are not considered, then the cascade of all the characteristics of the sample will eventually divide a certain sample into a class termination block. In fact, some of the features of the sample play a decisive role in classification. The process of constructing a decision tree is to find these decisive features, and construct an inverted tree according to the degree of decisiveness-the most decisive feature is taken as The root node then recursively finds the second largest decisive feature in the sub-data set under each branch until all the data in the sub-data set belong to the same category. Therefore, the process of constructing a decision tree is essentially a recursive process of classifying data sets based on data features. The first problem we need to solve is which feature on the current data set plays a decisive role in classifying data. There is a problem here, that is, how to choose a feature as the root node? Which feature should be selected as the node for the next decision? The decision tree algorithm uses a method called information gain to measure the importance between a feature and a feature. The greater the information gain, the more important the feature is, and then the feature is prioritized for decision-making. In an ideal situation, each leaf node on the decision tree we build is a pure classification, that is, all the data that enters this leaf node through this path are of the same category, but this needs to be repeated Retrospectively modify the judgment conditions of non-leaf nodes, and divide more branches for processing, so in fact, when the decision tree is implemented, a greedy algorithm is used to find the nearest optimal solution instead of the global optimal solution .
The generation process of a decision tree is mainly divided into the following three parts:

  • Feature selection : Feature selection refers to selecting a feature from the many features in the training data as the splitting criterion for the current node. There are many different quantitative evaluation criteria for how to select a feature, which derives different decision tree algorithms.
  • Decision tree generation : According to the selected feature evaluation criteria, child nodes are generated recursively from top to bottom, and the decision tree stops growing until the data set is indivisible. In terms of tree structure, recursive structure is the easiest way to understand.
  • Pruning : Decision trees are prone to overfitting. Generally, pruning is needed to reduce the size of the tree structure and alleviate overfitting. There are two types of pruning techniques: pre-pruning and post-pruning.

Three decision tree algorithms based on information theory:

The biggest principle for dividing data sets is to make disordered data become orderly. If there are 20 features in a training data, which one is chosen as the basis for division? This must be judged by quantitative methods. There are multiple quantitative division methods, one of which is "information theory measurement information classification." Decision tree algorithms based on information theory include ID3, CART and C4.5, among which C4.5 and CART are derived from ID3 algorithm.

1. ID3 algorithm

Information entropy:

In probability theory, information entropy gives us a way to measure uncertainty. It is used to measure the uncertainty of random variables. Entropy is the expected value of information. Entropy measures the uncertainty of things, the more uncertain things are, the greater its entropy. Specifically, the expression of the entropy of the random variable X is as follows:
Insert picture description here
Similarly, for the continuous random variable Y, the entropy can be defined as:
Insert picture description here

When the random variable X is given, the entropy of the random variable Y can be defined as the conditional entropy H(Y|X): the
Insert picture description here
so-called information gain is the degree to which the information uncertainty of the class Y is reduced when the data obtains the information of the feature X. Assuming that the information entropy of data set D is H(D), and the conditional entropy after feature A is H(D|A), the information gain g(D,A) of feature A for the data set can be expressed as: g( D,A) = H(D)-H(D|A)

The greater the information gain, the greater the contribution of the feature to the certainty of the data set, indicating that the feature has a stronger ability to classify data.

Case:

Insert picture description here
We use love to learn, play games, and play ball to make decisions on data set D (whether it is an excellent student). We can calculate the information entropy of data set D through the above formula. First, calculate whether it is the information entropy of outstanding students:
Insert picture description here

When considering attribute division, we have multiple strategies. Here we choose the information gain index to judge:
d is the attribute corresponding to each node. For example, playing a game is divided into d1=playing a game, d2=not playing a game. Generally speaking, the greater the information gain, the greater the'purity improvement' obtained by using this attribute for division.
Let’s derive the
label of whether to play:
D1 means playing, D2 means not playing: the
Insert picture description here
information gain is 0, which is basically not helpful for judging whether a student is excellent

Insert picture description here
From the results of information gain, we find that learning well can greatly improve whether a student is excellent. Therefore, in the decision tree, we will first choose whether to learn well as a decision attribute, and then recurse on the remaining attributes.

Insufficiency of ID3 algorithm

a) ID3 does not consider continuous features, such as length and density are continuous values, which cannot be used in ID3. This greatly limits the use of ID3.
  b) ID3 adopts the features with large information gain to first establish the nodes of the decision tree. It was soon discovered that under the same conditions, features with more values ​​have greater information gains than features with fewer values. For example, one variable has 2 values, each of which is 1/2, and the other variable has 3 values, each of which is 1/3. In fact, they are all completely uncertain variables, but the ratio of 3 values ​​to 2 values The information gain is large.
  c) ID3 algorithm does not consider the case of missing values
  d) does not consider the problem of overfitting

2. C4.5 algorithm

There are four main deficiencies in the ID3 algorithm. One is that it cannot handle continuous features. The second is that it uses information gain as a standard and tends to favor features with more values. The last two are problems with missing value processing and overfitting. . Quinlan improved the above four problems in the C4.5 algorithm.

(1). For continuous value features that cannot be processed, C4.5 idea: Discretize continuous features.

Arrange m consecutive samples from small to large. (For example, there are m continuous features A of m samples. Arranging a1, a2,...am from small to large, taking the average of two adjacent sample values, will get m-1 division points. For these m-1 points, Calculate the information gain when the point is used as the binary classification point. Select the point with the largest information gain as the binary discrete classification point of the continuous feature. (For example, the point with the largest gain is at, then the value less than at is Category 1, the value greater than at is category 2, which achieves the discretization of continuous features. Note that, unlike discrete attributes, if the current node is a continuous attribute, the attribute can also participate in the generation and selection process of child nodes. .

(2) For the problem of information gain as a standard, it tends to be biased towards features with more values. Introduce an information gain ratio, which is the ratio of information gain to feature entropy (also called split information)
Insert picture description here
HA(D). For sample set D, use current feature A as a random variable (the value is each feature value of feature A) ), the obtained empirical entropy (personal feeling is the IV value): the
Insert picture description here
essence of the information gain ratio: is the information gain multiplied by a penalty parameter. When the number of features is large, the penalty parameter is small; when the number of features is small, the penalty parameter is large. (The penalty parameter is the inverse of the feature entropy)

Disadvantage: Feature with less information gain than bias value.
Reason: When the feature value is less, the value of HA(D) is smaller, so its reciprocal is larger, so the information gain is larger. Therefore, features with fewer values ​​are favored.

Use information gain ratio: Based on the above shortcomings, instead of directly selecting the feature with the highest information gain rate, we now find the feature with higher information gain than the average level among the candidate features, and then select the feature with the highest information gain rate among these features .

Shortcomings and improvements of the C4.5 algorithm:

(1) The decision tree algorithm is very easy to overfit, so the generated decision tree should be pruned. The pruning method of C4.5 has room for optimization. There are two main ideas, one is pre-pruning, that is, when the decision tree is generated, it is decided whether to prun or not. The other is post-pruning, that is, the decision tree is first generated and then pruned through cross-validation.
(2) C4.5 generates a multi-tree, and the binary tree model is more efficient than the multi-tree in the computer. Multi-tree to binary tree can improve efficiency.
(3) C4.5 can only be used for classification.
(4) Because C4.5 uses the entropy model, there are a lot of time-consuming logarithmic operations, and there are a lot of sorting operations for continuous values. If the model can be simplified to reduce the computational intensity without sacrificing too much accuracy, the Gini coefficient is used instead of the entropy model.
The C4.5 algorithm process is the same as the ID3 algorithm, except that the method of selecting features is changed from information gain to information gain ratio.

3. CART algorithm

Classification and regression tree algorithm: The CART (Classification And Regression Tree) algorithm uses a bipartite recursive segmentation technology to divide the current sample set into two sub-sample sets, so that each generated non-leaf node has two branches. Therefore, the decision tree generated by the CART algorithm is a binary tree with a simple structure.

分类树两个基本思想:第一个是将训练样本进行递归地划分自变量空间进行建树的想法,第二个想法是用验证数据进行剪枝。

The CART classification tree algorithm uses the Gini coefficient instead of the information gain ratio. The Gini coefficient represents the impurity of the model. The smaller the Gini coefficient, the lower the impurity and the better the characteristics. This is the opposite of information gain (ratio). In the classification problem, assuming there are K classes, the probability that the sample point belongs to the k-th class is Pk, then the Gini index of the probability distribution is defined as: (Pk represents the probability of the selected sample belonging to the k class, then the probability that this sample is wrong Is (1-Pk)):
Insert picture description here
where the Gini index Gini(D) represents the uncertainty of the collection, and the Gini index G(D,A) represents the uncertainty of the collection after A=a decomposition. The larger the Gini index, the greater the uncertainty of the sample collection.

advantage:

Very intuitive and extremely explainable. In the generated decision tree, each node has a clear judgment branch condition, so it is very easy to see why this is done. Compared with the black box processing of the neural network model, the highly interpretative model is very popular in the financial and insurance industry. . In the following hands-on links, we can see that the trained decision tree can be directly output, and graphically show us what the judgment conditions of each node of the generated decision tree look like.
The prediction speed is relatively fast. Since the final generated model is a tree structure, for a new data prediction, it only needs to be judged at each node according to the conditions. Generally speaking, the tree structure helps to increase the computing speed.
It can handle discrete values, continuous values, and missing values.

Disadvantages:

Easy to overfit. Imagine that in extreme cases, we generate the most perfect tree based on the sample, then every value that appears in the sample will have a path to fit, so if there is some problem data in the sample, or the sample and the test data have certain When there is a gap between the two, it will be seen that the generalization performance is not good, and the phenomenon of over-fitting appears.
Need to deal with the problem of sample imbalance. If the sample is not balanced and the sample proportion of some features is too large, the final model result will be more biased towards these features.
The change of the sample will cause a huge change in the tree structure.

About pruning:

One of the problems mentioned above is that the decision tree is easy to overfit, so we need to use pruning to make the model's generalization ability better, so pruning can be understood as simplifying our decision tree and removing unnecessary node paths To improve generalization ability. Pruning methods mainly include pre-pruning and post-pruning.
Pre-pruning: A threshold is set at the beginning of the construction of the decision tree. When the entropy threshold of the split node is less than the set value, no splitting will be performed; however, the actual effect of this method is not very good, because no one There is no way to predict that what we set is exactly what we want.
Post-pruning: The post-pruning method is to determine whether to merge some intermediate nodes according to the set conditions after our decision tree has been constructed, and use leaf nodes instead. In actual situations, the post-pruning scheme is usually adopted.

Python iris case:

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier#引入决策树算法包
import numpy as np 
np.random.seed(0)
#设置随机种子,不设置的话默认是按系统时间作为参数,设置后可以保证我们每次产生的随机数是一样的

iris=datasets.load_iris() #获取鸢尾花数据集
iris_x=iris.data #数据部分
iris_y=iris.target #类别部分
#从150条数据中选140条作为训练集,10条作为测试集。permutation 接收一个数作为参数(这里为数据集长度150),产生一个0-149乱序一维数组
randomarr= np.random.permutation(len(iris_x))
iris_x_train = iris_x[randomarr[:-10]] #训练集数据
iris_y_train = iris_y[randomarr[:-10]] #训练集标签
iris_x_test = iris_x[randomarr[-10:]] #测试集数据
iris_y_test = iris_y[randomarr[-10:]] #测试集标签# 在模型训练时,我们设置了树的最大深度为 4
clf = DecisionTreeClassifier(max_depth=4)
clf.fit(iris_x_train, iris_y_train)

#引入画图相关的包 
from IPython.display import Image
from sklearn import tree
#dot是一个程式化生成流程图的简单语言
import pydotplus
dot_data = tree.export_graphviz(clf, out_file=None,
                                feature_names=iris.feature_names,
                                class_names=iris.target_names,
                                filled=True, rounded=True,
                                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())

iris_y_predict = clf.predict(iris_x_test)
score=clf.score(iris_x_test,iris_y_test,sample_weight=None)
print('iris_y_predict = ')
print(iris_y_predict)
print('iris_y_test = ')
print(iris_y_test)
print('Accuracy:',score)

"""
iris_y_predict = 
[1 2 1 0 0 0 2 1 2 0]
iris_y_test = 
[1 1 1 0 0 0 2 1 2 0]
Accuracy: 0.9
可以看到第二个测试样本预测错误了,其他的都预测正确,准确率是 90%
"""

After running the above code, the following picture will be output. You can see the judgment conditions and Gini coefficient for each time, as well as the number of samples and classification categories that can fall into this decision.
Insert picture description here

Guess you like

Origin blog.csdn.net/Pioo_/article/details/109725012