Machine Learning Practical Tutorial (6): Decision Tree

decision tree

What is a decision tree? Decision tree is a basic classification and regression method. To give an easy-to-understand example, the flow chart shown below is a decision tree. The rectangle represents the decision block and the oval represents the terminating block, indicating that a conclusion has been reached and the operation can be terminated. The left and right arrows leading from the judgment module are called branches, which can reach another judgment module or termination module. We can also understand that the classification decision tree model is a tree structure that describes the classification of instances. A decision tree consists of nodes and directed edges. There are two types of nodes: internal nodes and leaf nodes. Internal nodes represent a feature or attribute, and leaf nodes represent a class. Are you confused? ? In the decision tree shown in the figure below, rectangles and ovals are nodes. Rectangular nodes are internal nodes, elliptical nodes are leaf nodes, and the left and right arrows drawn from the nodes are directed edges. The top node is the root node of the decision tree. In this way, the node statement corresponds to the module statement, and it is easy to understand.

Most of the text in this article is reproduced from https://cuijiahua.com/blog/2017/11/ml_2_decision_tree_1.html, the code and some original

Insert image description here
Let’s go back to this flow chart. Yes, you read it right, this is an imaginary dating target classification system. It first detects whether the blind date partner has a house. If you have a room, you can consider further contact with this blind date. If there is no room, then observe whether the blind date is motivated. If not, just say Goodbye. At this time, you can say: "You are a very nice person, but we are not suitable." If so, you can put the blind date on the candidate list. , a nice way to call it a candidate list, but to put it somewhat imperfectly, it’s a spare tire.

But this is just a simple classification system for blind dates, just a simple classification. The real situation may be much more complex and the considerations may be multifarious. Good temper? Can you cook? Are you willing to do housework? How many children are there in the family? What do parents do? Oh my God, I don’t want to say any more, it’s so scary just thinking about it.

We can think of the decision tree as a set of if-then rules. The process of converting the decision tree into if-then rules is as follows: from the root node of the decision tree to the leaf node. Each path constructs a rule; the characteristics of the internal nodes on the path correspond to the conditions of the rule, and the class of the leaf node corresponds to the conclusion of the rule. The path of a decision tree or its corresponding set of if-then rules has an important property: mutually exclusive and complete. That is, every instance is covered by one path or rule, and only one path or rule. What is covered here means that the characteristics of the instance are consistent with the characteristics on the path or the instance satisfies the conditions of the rule.

Using decision trees to make predictions requires the following process:

  • Collect data: Any method can be used. For example, if we want to build a blind date system, we can obtain data from matchmakers or by interviewing blind dates. Based on the factors they considered and the final selection results, we can get some data for us to use.
  • Prepare data: After collecting the data, we need to organize it, sort out all the collected information according to certain rules, and format it to facilitate our subsequent processing.
  • Analyze the data: Any method can be used. After the decision tree is constructed, we can check whether the decision tree graph meets expectations.
    Training algorithm: This process is to construct a decision tree. It can also be said to be decision tree learning, which is to construct a data structure of a decision tree.
  • Testing Algorithms: Calculating Error Rates Using Empirical Trees. When the error rate reaches an acceptable range, the decision tree can be put into use.
    Use an algorithm: This step can use any supervised learning algorithm, and using decision trees can better understand the inherent meaning of the data.

Preparations for building decision trees

Every step of using a decision tree to make predictions is important. Inadequate data collection will result in insufficient features for us to build a decision tree with a low error rate. The data features are sufficient, but if you don’t know which features to use, it will lead to the inability to build a decision tree model with good classification effect. From an algorithmic perspective, the construction of decision trees is our core content.

How to construct a decision tree? Generally, this process can be summarized into 3 steps: feature selection, decision tree generation and decision tree pruning.

Feature selection

Feature selection consists in selecting features that have the ability to classify training data. This can improve the efficiency of decision tree learning. If the result of classification using a feature is not very different from the result of random classification, then this feature is said to have no classification ability. Empirically, throwing away such features has little impact on the accuracy of decision tree learning. Usually the standard for feature selection is information gain (information gain) or information gain ratio. For simplicity, this article uses information gain as the standard for feature selection. So, what is information gain? Before explaining information gain, let's look at a set of examples, a loan application sample data table.

Insert image description here
It is hoped that the decision tree of a loan application can be learned through the given training data and used to classify future loan applications. That is, when a new customer applies for a loan, the decision tree is used to decide whether to approve the loan application based on the characteristics of the applicant.

Feature selection is to decide which features to use to divide the feature space. For example, we obtained two possible decision trees through the above data table, each consisting of two root nodes with different characteristics.
Insert image description here
The root node shown in Figure (a) is characterized by age, has three values, and has different child nodes corresponding to different values. The root node shown in Figure (b) is characterized by work, has two values, and has different child nodes corresponding to different values. Both decision trees can be continued from here. The question is: which feature is better to choose? This requires determining criteria for selecting features. Intuitively, if a feature has better classification ability, or in other words, the training data set is divided into subsets according to this feature, so that each subset has the best classification under the current conditions, then this feature should be selected. . Information gain is a good representation of this intuitive criterion.

What is information gain? The change in information after dividing the data set is called information gain. Knowing how to calculate information gain, we can calculate the information gain obtained by dividing the data set for each feature value. Obtaining the feature with the highest information gain is the best choice.
Information gain = uncertainty of the entire data - uncertainty of a certain feature condition = how much certainty this feature enhances

So how to determine the uncertainty of data leads to the concept of Shannon entropy

Shannon entropy

Before we can evaluate which data partition is the best, we must learn how to calculate information gain. The measure of collective information is called Shannon entropy or simply entropy. The name comes from Claude Shannon, the father of information theory.

If you don’t understand what information gain and entropy are, please don’t worry, because they are destined to be very confusing to the world from the day they were born. After Claude Shannon wrote information theory, John von Neumann suggested using the term "entropy" because no one knew what it meant.

If you want to thoroughly understand the principle of information entropy, refer to the graphical principle : the Lagrange multiplier method
used in the derivation. At the same time, understand the characteristics of the logarithmic function:

  1. If ax =N (a>0, and a≠1), then the number x is called the logarithm of base N with a as the base, recorded as The base of N is called a real number.
  2. If Y=logaX, it means that the multiplication of Y a's equals X
  3. The base a is between 0 and 1 and is monotonically decreasing. If it is greater than 1, it is monotonically increasing.
  4. If a>1 x is between 0-1 y is a negative number x=1 y=0 x>1 y is a positive number
  5. ln is an operator, which means to find the natural logarithm, that is, the logarithm with e as the base.
    e is a constant, equal to 2.71828183...
    lnx can be understood as ln(x), that is, the logarithm of x with e as the base, that is, how many powers of e are equal to x.
    lnx=loge^x, logeE=Ine=1

Insert image description here
Entropy is defined as the expected value of information. In information theory and probability statistics, entropy is a measure of uncertainty in a random variable. If the things to be classified may be divided into multiple categories, the information of the symbol xi is defined as:
Insert image description here
where p(xi) is the probability of selecting this category. For example, among a team of 10 people, the genders are boys (3) and girls (7 ) two categories, the logarithm in the above formula is based on 2, or e can be the base (natural logarithm).

男生熵=-log2 3/10 =  1.7369655941662063  
女生熵=-log2 7/10 = 0.5145731728297583
整个团队的熵=男生熵+女生熵 =2.2515387669959646


Note that log2 p ( lower than girls.

Through the above formula, we can get all categories of information. In order to calculate entropy, we need to calculate the expected value of information (mathematical expectation) contained in all possible values ​​of all categories, which is obtained by the following formula:
Insert image description here
Midterm n is the number of categories. The greater the entropy, the greater the uncertainty of the random variable.

When the probability in entropy is obtained from data estimation (especially maximum likelihood estimation), the corresponding entropy is called empirical entropy. What is estimation from data? For example, there are 10 data and there are two categories, Category A and Category B. Among them, 7 data belong to category A, then the probability of category A is seven out of ten. Among them, 3 data belong to category B, then the probability of category B is three tenths. The simple explanation is that the probability is calculated based on the data. We define the data in the loan application sample data table as the training data set D, then the empirical entropy of the training data set D is H(D), and |D| represents its sample capacity and the number of samples. Assume K classes Ck, = 1,2,3,…,K,|Ck| is the number of samples belonging to class Ck, so the empirical entropy formula can be written as: Calculate the empirical entropy H(
Insert image description here
D) according to this formula, Analyze the data from the sample loan application data sheet. The final classification result is only two categories, namely lending and not lending. According to the data statistics in the table, among the 15 data, the result of 9 data is lending, and the result of 6 data is not lending. Therefore, the empirical entropy H(D) of data set D is:
Insert image description here
After calculation, it can be seen that the value of empirical entropy H(D) of data set D is 0.971.

Write code to calculate empirical entropy

Before writing the code, we first annotate the attributes of the data set.

  • Age: 0 represents young, 1 represents middle-aged, 2 represents old;
  • Have a job: 0 means no, 1 means yes;
  • Own your own house: 0 means no, 1 means yes;
  • Credit status: 0 means average, 1 means good, 2 means very good;
  • Category (whether to give a loan): no means no, yes means yes.
  • After determining these, we can create the data set and calculate the empirical entropy. The code is written as follows:
#%%

#数据集,yes表示放贷,no表示不放贷
'''
具体参考:案例图
 特征1表示年龄 0表示青年,1表示中间,2表示老年
 特征2表示是否有工作  0表示否,1表示有
 特征3表示是否有自己的房子 0表示否 1表示有
 特征4是信贷情况 0表示一般 1表示好 2表示非常好。
'''
import numpy as np
dataSet = np.array([[0, 0, 0, 0, 'no'],         
            [0, 0, 0, 1, 'no'],
            [0, 1, 0, 1, 'yes'],
            [0, 1, 1, 0, 'yes'],
            [0, 0, 0, 0, 'no'],
            [1, 0, 0, 0, 'no'],
            [1, 0, 0, 1, 'no'],
            [1, 1, 1, 1, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [2, 0, 1, 2, 'yes'],
            [2, 0, 1, 1, 'yes'],
            [2, 1, 0, 1, 'yes'],
            [2, 1, 0, 2, 'yes'],
            [2, 0, 0, 0, 'no']])
labels = ['不放贷', '放贷']
'''
  计算经验熵
  D代表传入的数据集
'''
def ShannonEnt(D):
    #去除重复元素的结论的分类:[yes,no]
    kArray=np.unique(D[:,4].reshape(1,len(D)))
    #计算出最终分类的个数
    k=len(kArray)
    #获取整个样本集的个数
    D_count=len(D)
    #经验熵
    HD=0;
    #循环多个分类,计算这个分类的熵,最后求和
    for i in range(k):
        #获取等于当前分类的数据行
        ck=[row for row in D if row[4]==kArray[i]]
        HD-=len(ck)/D_count *np.log2(len(ck)/D_count) 
    return HD;
HD_=ShannonEnt(dataSet)
print("整个数据经验熵:",HD_)  

Output result:
Entire data empirical entropy: 0.9709505944546686

Conditional entropy

We know what entropy is, but what the hell is conditional entropy? Conditional entropy H(Y|X) represents the uncertainty of random variable Y under the condition of known random variable , defined as the mathematical expectation of the entropy
Insert image description here
of the conditional probability distribution of Y under the given conditions of D is divided into n subsets {D1, D2,···,Dn}, |Di| is the number of samples of Di. Let the set of samples belonging to Ck in the subset Di be Dik, that is, Dik = Di ∩ Ck, and |Dik| is the number of samples in Dik. So the formula of empirical conditional entropy can be:
Insert image description here
In fact, it is: summation (obtain the probability of the same data set with the current characteristics in the entire data * the entropy of the final classification in this data set)

'''
  计算条件熵
  H(D|0) 计算某个特征列的条件熵,当年龄特征的情况下,是否房贷不确定的情况,越大越不确定
'''     
def calcConditionShannon(D,index):
    #去除重复元素的index列的数组
    featureType=np.unique(D[:,index].reshape(1,len(D)))
    featureTypeCount=len(featureType)
    #获取整个样本集的个数
    D_count=len(D)
    HDA=0;
    for i in range(featureTypeCount):
        Di=np.array([row for row in D if row[index]==featureType[i]])
        HDA+=len(Di)/D_count*ShannonEnt(Di)
    return HDA;
print("年龄特征条件熵",calcConditionShannon(dataSet,0))

Output: Age feature conditional entropy 0.8879430945988998

information gain

Information gain = uncertainty of the entire data - uncertainty of a certain feature condition = how much certainty this feature enhances.
Information gain = empirical entropy - current feature condition entropy.
Information gain is relative to the feature. The greater the information gain, The greater the impact of features on the final classification result, we should choose the feature that has the greatest impact on the final classification result as our classification feature.

The concepts of conditional entropy and empirical conditional entropy are clarified. Next, let’s talk about information gain. As mentioned earlier, information gain is relative to features. Therefore, the information gain g(D,A) of feature A on training data set D is defined as the difference between the empirical entropy H(D) of set D and the empirical conditional entropy H(D|A) of D under the given conditions of feature A. , that is:
Insert image description here
after talking about so many conceptual things, it doesn’t matter if you don’t understand it. Just give a few examples, come back and look at the concepts, and you will understand.

Take the loan application sample data sheet as an example. Look at the data in the age column, which is feature A1. There are three categories: young, middle-aged and old. We only look at the data with the age of young people. There are 5 data with the age of young people. Therefore, the probability of the data with the age of young people appearing in the training data set is 5/15, which is one-third. In the same way, the probability of middle-aged and elderly data appearing in the training data set is one-third. Now we only look at the data of young people, the probability of finally getting a loan is two-fifths, because among the five data, only two data show that the final loan was obtained. Similarly, the data of middle-aged and old people The probabilities of eventually getting a loan are three-fifths and four-fifths respectively. Therefore, the process of calculating the information gain of age is as follows:
Insert image description here
In the same way, calculate the information gain of other features g(D,A2), g(D,A3) and g(D,A4). They are:
Insert image description here
Finally, compare the information gain of the features. Since the information gain value of feature A3 (having your own house) is the largest, A3 is selected as the optimal feature.

We have learned to calculate information gain through formulas, let's write code to calculate information gain.

'''
    计算某个特征的信息增益
    信息增益=整个数据的不确定性-某个特征条件的不确定=这个特征增强了多少确定性
'''
def calaInfoGrain(D,index):
    return ShannonEnt(dataSet)-calcConditionShannon(D,index)
print("年龄的信息增益",HD_-calcConditionShannon(dataSet,0))
print("工作的信息增益",calaInfoGrain(dataSet,1))

feature_count=len(dataSet[0])
for i in range(feature_count-1):
    print("第"+str(i)+"个特征的信息增益",HD_-calcConditionShannon(dataSet,i))

Output:
Information gain of age 0.08300749985576883
Information gain of work 0.32365019815155627 Information gain
of 0th feature 0.08300749985576883 Information gain of 1st feature
0.32365019815155627
Information gain of 2nd feature 0.419973094021 9749The
information gain of the third feature is 0.36298956253708536

Comparing the results of our own calculations, we found that the results were completely correct! The index value of the optimal feature is 2, which is feature A3 (has its own house).

Generation of decision trees

We have learned the sub-functional modules required to construct a decision tree algorithm from a data set, including the calculation of empirical entropy and the selection of optimal features. Its working principle is as follows: get the original data set, and then divide the data set based on the best attribute values , since there may be more than two eigenvalues, there may be partitions of the data set that are larger than two branches. After the first partition, the data set is passed down to the next node in the branch of the tree. At this node, we can divide the data again. Therefore we can use the recursive principle to process the data set.

There are many algorithms for building decision trees, such as C4.5, ID3, and CART. These algorithms do not always consume features every time they divide the data into groups when running. Since the number of features does not decrease every time the data is divided into groups, these algorithms may cause certain problems when used in practice. We don't need to consider this issue at the moment. We just need to count the number of columns before the algorithm starts running to see if the algorithm uses all attributes.

Decision tree generation algorithms recursively generate decision trees until they can no longer continue. The trees generated in this way often classify the training data very accurately, but the classification of unknown test data is not so accurate, that is, overfitting occurs. The reason for overfitting is that too much consideration is given to how to improve the correct classification of training data during learning, thereby building an overly complex decision tree. The solution to this problem is to consider the complexity of the decision tree and simplify the generated decision tree.

Decision tree construction

The core of the ID3 algorithm is to select features corresponding to the information gain criterion at each node of the decision tree and recursively construct the decision tree. The specific method is: starting from the root node, calculate the information gain of all possible features for the node, select the feature with the largest information gain as the feature of the node, and establish sub-nodes based on different values ​​of the feature; then Call the above method recursively on the child nodes to build a decision tree; until the information gain of all features is very small or no features can be selected. Finally, a decision tree is obtained. ID3 is equivalent to using the maximum likelihood method to select a probabilistic model.

ID3 algorithm

Before using ID3 to construct a decision tree, let's analyze the data.
Insert image description here
Data sorted by third column

print(dataSet[np.argsort(dataSet[:,2])])
输出:
[['0' '0' '0' '0' 'no']
 ['0' '0' '0' '1' 'no']
 ['0' '1' '0' '1' 'yes']
 ['0' '0' '0' '0' 'no']
 ['1' '0' '0' '0' 'no']
 ['1' '0' '0' '1' 'no']
 ['2' '1' '0' '1' 'yes']
 ['2' '1' '0' '2' 'yes']
 ['2' '0' '0' '0' 'no']
 ['0' '1' '1' '0' 'yes']
 ['1' '1' '1' '1' 'yes']
 ['1' '0' '1' '2' 'yes']
 ['1' '0' '1' '2' 'yes']
 ['2' '0' '1' '2' 'yes']
 ['2' '0' '1' '1' 'yes']]

Since feature A3 (has its own house) has the largest information gain value, feature A3 is selected as the feature of the root node. It divides the training set D into two subsets D1 (the value of A3 is "yes") and D2 (the value of A3 is "no"). Since D1 only has sample points of the same class, it becomes a leaf node, and the class mark of the node is "Yes".
Among them, D1 is

 ['0' '1' '1' '0' 'yes']
 ['1' '1' '1' '1' 'yes']
 ['1' '0' '1' '2' 'yes']
 ['1' '0' '1' '2' 'yes']
 ['2' '0' '1' '2' 'yes']
 ['2' '0' '1' '1' 'yes']]

Since there is only one classification conclusion yes when D1=1, it is a leaf node. If there is no bifurcation,
D2 is

['0' '0' '0' '0' 'no']
 ['0' '0' '0' '1' 'no']
 ['0' '1' '0' '1' 'yes']
 ['0' '0' '0' '0' 'no']
 ['1' '0' '0' '0' 'no']
 ['1' '0' '0' '1' 'no']
 ['2' '1' '0' '1' 'yes']
 ['2' '1' '0' '2' 'yes']
 ['2' '0' '0' '0' 'no']

For D2, new features need to be selected from features A1 (age), A2 (employed) and A4 (credit situation), and the information gain of each feature is calculated: According to the calculation, the feature A2 (employed) with the largest information gain is selected
Insert image description here
as Node characteristics. Since A2 has two possible values, two sub-nodes are derived from this node: a sub-node corresponding to "Yes" (has a job), containing 3 samples, they belong to the same category, so this is a leaf node point, the class mark is "Yes"; the other is a child node corresponding to "No" (no work), containing 6 samples, they also belong to the same class, so this is also a leaf node, the class mark is "No" .
The remaining data is sorted according to A2

[['0' '0' '0' '0' 'no']
 ['0' '0' '0' '1' 'no']
 ['0' '0' '0' '0' 'no']
 ['1' '0' '0' '0' 'no']
 ['1' '0' '0' '1' 'no']
 ['2' '0' '0' '0' 'no']
 ['0' '1' '0' '1' 'yes']
 ['2' '1' '0' '1' 'yes']
 ['2' '1' '0' '2' 'yes']]

It is found that the results of A2=0 are all no, and the results of A1 equal to 1 are yes, so there is no other classification. The node ends at the work point. It can be understood that the leaf node is the conclusion
Insert image description here
whether to loan or not, and the bifurcation is the value of the feature.

Write code to build a decision tree

We use a dictionary to store the structure of the decision tree. For example, the decision tree we analyzed in the previous section can be expressed as:

{'有自己的房子': {0: {'有工作': {0: 'no', 1: 'yes'}}, 1: 'yes'}}

The code is implemented as follows

#%%

'''
  将数据按照值指定特征列分组,,比如有房子=1的数据行和无房子=0的数据行
  {
     0:[[]]
     1:[[]]
  }
'''
colLabels=["年龄","有工作","有自己的房子","信贷情况"]
def splitData(D,index):
    kArray=np.unique(D[:,index].reshape(1,len(D)))
    #循环多个分类,计算这个分类的熵,最后求和
    returnJSon={};
    for i in range(len(kArray)):
        #获取等于当前分类的数据行
        ck=[row for row in D if row[index]==kArray[i]]
        returnJSon[i]=np.array(ck)
    return returnJSon;
    
def createDecisionTree(D):
    buildTree=None
    #如果传入的D没有数据或者第5列(是否贷款)只有一个分类值,就说明已经是叶子节点了,直接返回结论值
    resultUniqueArray=np.unique(D[:,4].reshape(1,len(D)))
    print(resultUniqueArray,len(D),len(resultUniqueArray))
    if(len(D)==0 or len(resultUniqueArray)==1):
        return resultUniqueArray[0]
    #获取特征数
    feature_count=D.shape[1]
    #算出每个特征的信息增益
    grain=[calaInfoGrain(D,i)for i in range(feature_count-1)]
    #获取信息增益最大的特征值
    maxFeatureIndex=np.argmax(grain);
    #创建一个json对象,里面有个当前特征名称的对象:比如{'有自己的房子': {}}
    buildTree={colLabels[maxFeatureIndex]:{}};
    #循环每个独立的特征值 
    featureGroup=splitData(D,maxFeatureIndex)
    for featureValue in featureGroup:
        buildTree[colLabels[maxFeatureIndex]][featureValue]=createDecisionTree(featureGroup[featureValue])
    return buildTree;
        
    
print(createDecisionTree(dataSet))

Decision tree visualization

Graphviz is simple and easy to understand. Here we use graphviz for visualization.
Download graphviz and choose to add the environment to PATH.
python installation components

pip install graphviz

code drawing

from graphviz import Digraph
import uuid

def graphDecisionTree(dot,treeNode,parentName,lineName):
    for key in treeNode:
        if type(key)==int:
            if type(treeNode[key])==str or type(treeNode[key])==np.str_:
                #因为会出现两个yes,所以可能不能出现一个分叉而直接指向了,所以名字加上个uuid区分
                node_name=str(treeNode[key])+str(uuid.uuid1())
                dot.node(name=node_name, label=str(treeNode[key]), color='red',fontname="Microsoft YaHei")
                dot.edge(str(parentName),str(node_name), label=str(key), color='red')
            else:
                graphDecisionTree(dot,treeNode[key],parentName,key)
        elif type(treeNode[key])==dict:
            graphDecisionTree(dot,treeNode[key],key,None)
        if type(key)==str or type(treeNode[key])==str:
            dot.node(name=key, label=str(key), color='red',fontname="Microsoft YaHei")
        if parentName is not None and lineName is not None:
            dot.edge(parentName,key, label=str(lineName), color='red')
dot = Digraph(name="pic", comment="测试", format="png")
graphDecisionTree(dot,decisionTreeJson,None,None)
dot.render(filename='my_pic',
               directory='.',  # 当前目录
               view=True)

Output flow chart
Insert image description here

Perform classification using decision trees

After constructing the decision tree based on the training data, we can use it to classify the actual data. When performing data classification, a decision tree is required along with the label vectors used to construct the tree. Then, the program compares the test data with the values ​​on the decision tree, and recursively executes the process until it enters the leaf node; finally, the test data is defined as the type to which the leaf node belongs. In the code for building the decision tree, you can see that there is a featLabels parameter. What is it used for? It is used to record each classification node. When using the decision tree to make predictions, we can enter the attribute values ​​of the required classification nodes in order. For example, if I use the above-trained decision tree for classification, then I only need to provide two pieces of information: whether the person has a house and whether he has a job, without providing redundant information.

The code for classification using decision trees is very simple. The code is written as follows:

#%%
'''
    在决策树中判断传入的特征是否贷款
'''
def classfiy(decisionTreeJson,featureLabel,vecTest,index):
    if type(decisionTreeJson)==str or type(decisionTreeJson)==np.str_:
        return decisionTreeJson
    elif type(decisionTreeJson[featureLabel[index]])==dict :
        return classfiy(decisionTreeJson[featureLabel[index]][vecTest[index]],featureLabel,vecTest,index+1)
    else :
        return decisionTreeJson

print("是" if classfiy(decisionTreeJson,featureLabel,[1,0],0)=='yes' else "否")

Decision tree storage

Just use serialization

  import pickle
  #写入
  with open(filename, 'wb') as fw:
        pickle.dump(inputTree, fw)
#读取
 fr = open(filename, 'rb')
    json=pickle.load(fr)

Sklearn uses decision trees to predict contact eye types

Practical background

Let’s get to the point of this article: How do ophthalmologists determine the type of contact lenses a patient needs to wear? Once we understand how decision trees work, we can even help people determine the type of lenses they need to wear.

The contact lens dataset is a very famous dataset that contains many observation conditions that change the eye state and the type of contact lenses recommended by doctors. Contact lens types include hard, soft and no lenses. Data source and UCI database, data set download address: https://github.com/lzeqian/machinelearntry/blob/master/sklearn_decisiontree/lenses.txt

There are 24 sets of data in total. The Labels of the data are age, prescript, astigmatic, tearRate, and class. That is, the first column is the age, the second column is the symptoms, the third column is whether astigmatism is present, the fourth column is the number of tears, and the fourth column is the number of tears. The five columns are the final classification labels. The data is shown in the figure below:
Insert image description here
You can use an already written Python program to build a decision tree, but for the purpose of continuing learning, this article uses Sklearn.

Build a decision tree using Sklearn

Official English document address: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

The sklearn.tree module provides decision tree models for solving classification and regression problems. The method is shown in the figure below:
Insert image description here
This practical content uses DecisionTreeClassifier and export_graphviz. The former is used for decision tree construction and the latter is used for decision tree visualization.

DecisionTreeClassifier builds a decision tree:

Let us first take a look at the DecisionTreeClassifier function, which has a total of 12 parameters:

Insert image description here

Parameter description is as follows:

  • criterion: feature selection criterion, optional parameter, the default is gini, and can be set to entropy. Gini is Gini impurity, which is the expected error rate of randomly applying a certain result from a set to a certain data item. It is an idea based on statistics. Entropy is Shannon entropy, which is what was discussed in the previous article. It is an idea based on information theory. Sklearn sets gini as the default parameter. It should have been considered accordingly. Maybe the accuracy is higher? The ID3 algorithm uses entropy, and the CART algorithm uses gini.
  • splitter: Feature division point selection standard, optional parameter, the default is best, and can be set to random. Selection strategy for each node. The best parameter selects the best segmentation features based on the algorithm, such as gini and entropy. Random randomly finds the local optimal dividing point among some dividing points. The default "best" is suitable when the sample size is small, and if the sample data size is very large, "random" is recommended for decision tree construction.
    -max_features: The maximum number of features considered when dividing, optional parameter, default is None. The maximum number of features considered when looking for the best segmentation (n_features is the total number of features), there are the following 6 situations:
  1. If max_features is an integer number, max_features features are considered;
  2. If max_features is a floating-point number, consider int(max_features * n_features) features;
  3. If max_features is set to auto, then max_features = sqrt(n_features);
  4. If max_features is set to sqrt, then max_featrues = sqrt(n_features), the same as auto;
  5. If max_features is set to log2, then max_features = log2(n_features);
  6. If max_features is set to None, then max_features = n_features, that is, all features are used.
  7. Generally speaking, if the number of sample features is not many, such as less than 50, we can use the default "None". If the number of features is very large, we can flexibly use other values ​​just described to control the largest feature considered when dividing. number to control the generation time of the decision tree.
  • max_depth: the maximum depth of the decision tree, optional parameter, default is None. This parameter is the number of levels of the tree. The concept of levels is that, for example, in the loan example, the number of levels of the decision tree is 2 levels. If this parameter is set to None, the decision tree will not limit the depth of the subtree when building the subtree. Generally speaking, you can ignore this value when there is little data or few features. Or until there are fewer than min_smaples_split samples if the min_samples_slipt parameter is set. If the model has a large sample size and many features, it is recommended to limit the maximum depth. The specific value depends on the distribution of the data. Commonly used values ​​can range from 10-100.
  • min_samples_split: The minimum number of samples required for internal node re-division, optional parameter, default is 2. This value limits the conditions under which the subtree can continue to be divided. If min_samples_split is an integer, then min_samples_split is used as the minimum number of samples when splitting internal nodes. That is to say, if the samples are less than min_samples_split samples, stop splitting. If min_samples_split is a floating point number, then min_samples_split is a percentage, ceil(min_samples_split * n_samples), and the number is rounded up. If the sample size is not large, there is no need to worry about this value. If the sample size is of very large magnitude, it is recommended to increase this value.
  • min_samples_leaf: Minimum number of samples for leaf nodes, optional parameter, default is 1. This value limits the minimum number of samples for leaf nodes. If the number of a leaf node is less than the number of samples, it will be pruned together with its sibling nodes. The leaf node requires the minimum number of samples, that is, how many samples are needed to finally reach the leaf node to count as a leaf node. If set to 1, a decision tree will be constructed even if there is only 1 sample in this category. If min_samples_leaf is an integer, then min_samples_leaf is used as the minimum number of samples. If it is a floating point number, then min_samples_leaf is a percentage. Same as above, celi(min_samples_leaf * n_samples), the number is rounded up. If the sample size is not large, there is no need to worry about this value. If the sample size is of very large magnitude, it is recommended to increase this value.
  • min_weight_fraction_leaf: The minimum sample weight sum of leaf nodes, optional parameter, default is 0. This value limits the minimum value of the weight sum of all samples of a leaf node. If it is less than this value, it will be pruned together with its sibling nodes. Generally speaking, if we have many samples with missing values, or the distribution category deviation of the classification tree samples is large, sample weights will be introduced, and we should pay attention to this value.
  • max_leaf_nodes: Maximum number of leaf nodes, optional parameter, default is None. By limiting the maximum number of leaf nodes, overfitting can be prevented. If restrictions are added, the algorithm will build an optimal decision tree within the maximum number of leaf nodes. If there are not many features, this value does not need to be considered, but if there are many features, it can be restricted, and the specific value can be obtained through cross-validation.
  • class_weight: category weight, optional parameter, the default is None, it can also be dictionary, dictionary list, or balanced. Specifying the weight of each category of samples is mainly to prevent the training set from having too many samples of certain categories, causing the trained decision tree to be too biased towards these categories. The weight of the category can be given in the format {class_label:weight}. Here you can specify the weight of each sample yourself, or use balanced. If you use balanced, the algorithm will calculate the weight by itself. The sample weight corresponding to the category with a small sample size will be high. Of course, if your sample category distribution does not have obvious bias, you can ignore this parameter and select the default None.
  • random_state: optional parameter, default is None. Random number seed. If it is a certificate, random_state will be used as the random number seed for the random number generator. Random number seed. If no random number is set, the random number is related to the current system time and is different at each moment. If a random number seed is set, the same random number generated at different times will be the same with the same random number seed. If it is a RandomState instance, then random_state is the random number generator. If None, the random number generator uses np.random.
  • min_impurity_split: minimum impurity of node division, optional parameter, default is 1e-7. This is a threshold, which limits the growth of the decision tree. If the impurity (Gini coefficient, information gain, mean square error, absolute difference) of a node is less than this threshold, the node will no longer generate child nodes. That is the leaf node.
    presort: Whether the data is pre-sorted, optional parameter, the default is False, this value is a Boolean value, the default is False not to sort. Generally speaking, if the sample size is small or a decision tree with a small depth is limited, setting it to true can make the selection of dividing points faster and the decision tree built faster. If the sample size is too large, there will be no benefit. The problem is that when the sample size is small, my speed is not slow. So this value is generally ignored.

In addition to paying attention to these parameters, other points to pay attention to when adjusting parameters include:

  • When the number of samples is small but the sample features are very large, the decision tree is easy to overfit. Generally speaking, it is easier to build a robust model when the number of samples is larger than the number of features.
  • If the number of samples is small but the sample features are very large, before fitting the decision tree model, it is recommended to perform dimensional reduction, such as principal component analysis (PCA), feature selection (Losso) or independent component analysis (ICA). The dimensionality of such features will be greatly reduced. Then it will be better to fit the decision tree model.
  • It is recommended to use the visualization of decision trees, and at the same time limit the depth of the decision tree first, so that you can first observe the preliminary fitting of the data in the generated decision tree, and then decide whether to increase the depth.
  • When training the model, pay attention to the category status of the samples (mainly referring to the classification tree). If the category distribution is very uneven, consider using class_weight to limit the model from being too biased towards categories with many samples.
  • The array of the decision tree uses numpy's float32 type. If the training data is not in this format, the algorithm will copy it first and then run it.
  • If the input sample matrix is ​​sparse, it is recommended to call csc_matrix sparse before fitting and csr_matrix sparse before prediction.

sklearn.tree.DecisionTreeClassifier() provides some methods for us to use, as shown in the figure below:
Insert image description here
Knowing this, we can write code.
Note: Because the fit() function cannot receive string type data, you can see from the printed information that the data are all string type. Before using the fit() function, we need to encode the data set. Two methods can be used here:

  • LabelEncoder: Convert string to incremental value
  • OneHotEncoder: Convert string to integer using One-of-K algorithm

In order to serialize string type data, we need to generate pandas data first, which will facilitate our serialization work. The method I use here is, original data->dictionary->pandas data, and the code is as follows:

#%%

import numpy as np
import pandas as pd
fr = open('lenses.txt')
lenses = np.array([inst.strip().split('\t') for inst in fr.readlines()])
print(lenses)
#四个特征一列是:年龄,第二列是症状,第三列是是否散光,第四列是眼泪数量
#,第五列是最终的分类标签,隐形眼镜类型包括硬材质(hard)、软材质(soft)以及不适合佩戴隐形眼镜(no lenses)
lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']
#最终分类在最后一列
lenses_target = [each[-1] for each in lenses]                                                        
print(lenses_target)

#%%

#组装成带有表头的数据格式
lensesDataFrame=np.concatenate((np.array([lensesLabels]),lenses[:,0:4]))
'''
注意dataframe的用法
df['a']#取a列
df[['a','b']]#取a、b列
默认的表头是0,1,2这样的序号,如果需要自定义表头需要定义json
{
   "age":[young,pre],
   "prescript":["myope","myope"]
}
'''
jsonData= {l:lenses[:,i]for i,l in enumerate(lensesLabels)}
lenses_pd = pd.DataFrame(jsonData)                                    #生成pandas.DataFrame
print(lenses_pd)


#%%

from sklearn.preprocessing import LabelEncoder
# 将所有的label 比如young转换成0,pre转为成1这样的数字编码
le = LabelEncoder()          
#传入一个一维的数字,在这个数组里,相同的字符串转换为相同的数字
for i in lenses_pd.columns:
    lenses_pd[i]=le.fit_transform(lenses_pd[i])
print(lenses_pd)

Insert image description here

Visualizing decision trees using Graphviz

graphviz has been installed before, install a pydotplus library

pip3 install pydotplus

Write code


#使用sklearn决策树
from sklearn import tree
import pydotplus
from io import StringIO
clf = tree.DecisionTreeClassifier(max_depth = 4)                        #创建DecisionTreeClassifier()类
clf = clf.fit(lenses_pd.values.tolist(), lenses_target)                    #使用数据,构建决策树
dot_data = StringIO()
tree.export_graphviz(clf, out_file = dot_data,                            #绘制决策树
                    feature_names = lenses_pd.keys(),
                    class_names = clf.classes_,
                    filled=True, rounded=True,
                    special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("tree.pdf")     

Run the code and a PDF file named tree will be generated in the same directory where the python file is saved. Open the file and we can see the visualization of the decision tree.
Insert image description here
After determining the decision tree, we can make predictions. You can take a look at what kind of contact lens material is suitable for you based on your eye condition, age and other characteristics. Use the following code to see the prediction results:

print(clf.predict([[1,1,1,0]]))   

Summarize

Some advantages of decision trees:

  • Easy to understand and explain. Decision trees can be visualized.
  • Little data preprocessing is required. Other methods often require data standardization, creating dummy variables and removing missing values. Decision trees do not yet support missing values.
  • The cost of using a tree (e.g. predicting data) is the logarithm of the number of training data points.
  • Can handle both numerical and categorical variables. Most other methods are suitable for analyzing a set of variables.
  • Can handle multi-valued output variable issues.
  • Use a white box model. This rule is easily expressed using logical judgments if a situation is observed. In contrast, in the case of a black-box model (such as an artificial neural network), the results will be very difficult to interpret.
  • Even for real models, it can be applied well even when the assumptions are invalid.

Some disadvantages of decision trees:

  • Decision tree learning may create a tree that is too complex and does not predict the data well. That is overfitting. Pruning mechanism (not supported now), setting the minimum number of samples required for a leaf node, or the maximum depth of the number, can avoid overfitting.
  • Decision trees can be unstable because even very small mutations can produce a completely different tree. This problem is alleviated by decision trees with an ensemble.
  • Concepts are difficult to learn because decision trees do not explain them well, for example, XOR, parity or multiplexer problems.
  • If certain classifications dominate, the decision tree will create a biased tree. Therefore, it is recommended to sample to balance the samples before training.
    other:

Guess you like

Origin blog.csdn.net/liaomin416100569/article/details/129089200