[Machine Learning Combat] 3. Decision Tree


This blog also has multiple super-detailed overviews, and interested friends can move to:

Convolutional Neural Networks: A Super Detailed Introduction to Convolutional Neural Networks

Object Detection: Object Detection Super Detailed Introduction

Semantic Segmentation: A Super Detailed Introduction to Semantic Segmentation

NMS: Let you understand and see the whole NMS and its variants

Data Augmentation: An article to understand data augmentation in computer vision

Loss function: loss function and evaluation index in classification detection segmentation

Transformer:A Survey of Visual Transformers

Machine Learning Practical Series: Decision Trees

YOLO series:v1v2v3v4scaled-v4v5v6v7yolofyoloxyolosyolop


decision tree

(Disclaimer: The content of this article comes from machine learning practice and statistical learning methods. It is an integration of the two, not from a single book)

Decision tree (decision tree) : It is a basic classification and regression method, and the decision tree of classification is mainly discussed here.

In the classification problem, the process of classifying instances based on features can be considered as a set of if-then, or as a conditional probability distribution defined on the feature space and class space.

Decision trees usually have three steps : feature selection, decision tree generation, and decision tree pruning.

Classify with a decision tree: start from the root node, test a certain feature of the instance, and assign the instance to its child nodes according to the test results. At this time, each child node corresponds to a value of the feature, so recursively classify the instance Test and assign until a leaf node is reached, and finally assign the instance to the class of the leaf node.

The following figure is a schematic diagram of a decision tree, dots - internal nodes, boxes - leaf nodes

insert image description here

  • The goal of decision tree learning is to build a decision tree model based on a given training data set so that it can correctly classify instances.

  • The essence of decision tree learning: a set of classification rules is induced from the training set, or the conditional probability model is estimated from the training data set.

  • Loss Functions for Decision Tree Learning: Regularized Maximum Likelihood Functions

  • A Test of Decision Tree Learning: Minimizing a Loss Function

  • The goal of decision tree learning: the problem of selecting an optimal decision tree in the sense of a loss function.

  • The principle of the decision tree is similar to the game of guessing the result of the question and answer. According to a series of data, the answer of the game is given.
    insert image description here

The figure above is a decision tree flow chart. The square represents the judgment module, and the ellipse represents the termination module, indicating that a conclusion has been drawn and the operation can be terminated. The left and right arrows are called branches.

The k-nearest neighbor algorithm introduced in the previous section can complete many classification tasks, but its biggest disadvantage is that it cannot give the inherent meaning of the data. The advantage of the decision tree is that the data form is very easy to understand.

3.1 Construction of decision tree

The algorithm of decision tree learning is usually a process of recursively selecting the optimal feature, and dividing the training data according to the feature, so that each sub-data set has the best classification process. This process corresponds to the division of the feature space and also corresponds to the construction of the decision tree.

1) Start: build the root node, put all the training data on the root node, select an optimal feature, and divide the training data set into subsets according to this feature, so that each subset has the best under the current conditions Classification.

2) If these subsets can be basically classified correctly, then construct leaf nodes and assign these subsets to the corresponding leaf nodes.

3) If there are still subsets that cannot be correctly classified, then select new optimal features for these subsets, continue to divide them, and build corresponding nodes. If it is performed recursively, until all training data subsets are basically correct classification, or no suitable features.

4) Each subset is divided into leaf nodes, that is, there are clear classes, so a decision tree is generated.

Features of decision trees:

  • Advantages: The computational complexity is not high, the output results are easy to understand, insensitive to the absence of intermediate values, and can handle irrelevant feature data.
  • Disadvantages: May cause over-matching problems
  • Applicable data types: numeric and nominal

process:

First, determine the decisive features on the current data set. In order to obtain the decisive features, each feature must be evaluated. After the test is completed, the original data set is divided into several data subsets, which will be distributed in the first decision On all branches of the point, if the data under a certain branch belong to the same type, the current unordered read spam has been correctly divided into data categories, and there is no need to further divide the data set. If it does not belong to the same category, it must be divided repeatedly Data subsets until all data of the same type are within a data subset.

The pseudocode for creating a branch createBranch()is shown in the figure below:

Check whether each subitem in a dataset belongs to the same class:

If so return 类标签:
Else
     寻找划分数据集的最好特征
     划分数据集
     创建分支节点
         for 每个划分的子集
             调用函数createBranch()并增加返回结果到分支节点中
         return 分支节点

Making predictions using a decision tree requires the following process:

  • Data collection: Any method can be used. For example, if we want to build a blind date system, we can obtain data from the matchmaker or by visiting the blind date. According to the factors they consider and the final selection results, we can get some data for our use.
  • Data preparation: After collecting the data, we need to sort out all the collected information according to certain rules, and type them to facilitate our subsequent processing.
  • Analyzing the data: Any method can be used, after the decision tree construction is complete, we can check whether the decision tree graph is as expected.
  • Training algorithm: This process is to construct a decision tree, which can also be said to be decision tree learning, which is to construct a data structure of a decision tree.
  • Test Algorithm: Calculate the error rate using an empirical tree. When the error rate reaches an acceptable range, the decision tree can be put into use.
  • Using an Algorithm: This step can be used with any supervised learning algorithm, and using a decision tree gives a better understanding of the intrinsic meaning of the data.

This section uses the ID3 algorithm to divide the dataset, which deals with how to divide the dataset and when to stop dividing the dataset.

3.1.1 Information Gain

The general principle of dividing data sets is to make unordered data more orderly, but each method has its own advantages and disadvantages. Information theory is a branch of science that quantifies and processes information. The changes in information before and after dividing data sets are called information Gain, the feature with the highest information gain is the best choice, so you must first learn how to calculate information gain. The measurement method of collective information is called Shannon entropy, or entropy for short.

It is hoped that a loan application decision tree can be learned through the given training data to classify future loan applications, that is, when a new customer applies for a loan, the decision tree is used to decide whether to approve the loan application according to the characteristics of the applicant.

Feature selection is to decide which features to use to divide the feature space. For example, we obtain two possible decision trees through the above data table, which are composed of two root nodes with different characteristics.

insert image description here

The characteristic of the root node shown in Figure (a) is age, which has 3 values, and there are different child nodes corresponding to different values.

The characteristic of the root node shown in figure (b) is work, which has 2 values, corresponding to different values, there are different child nodes. Both decision trees can continue from here on.

The question is: Which feature is better to choose? This requires the determination of criteria for selecting features.

Intuitively, if a feature has better classification ability, or in other words, divide the training data set into subsets according to this feature, so that each subset has the best classification under the current conditions, then this feature should be selected .

Information gain is a good representation of this intuitive criterion.

What is information gain? The change in information before and after dividing the data set becomes information gain. Knowing how to calculate information gain, we can calculate the information gain obtained by dividing the data set for each eigenvalue. The feature with the highest information gain is the best choice.

Entropy is defined as the expected value of information. If the things to be classified may be divided into multiple categories, the symbol xi x_ixiThe information of is defined as:
l ( xi ) = − log 2 p ( xi ) l(x_i)=-log_2 p(x_i)l(xi)=log2p(xi)
, wherep ( xi ) p(x_i)p(xi) is the probability of selecting that class.

In order to calculate entropy, we need to calculate the expected value of information contained in all possible values ​​​​of all categories, which is obtained by the following formula:
H = − Σ i = 1 np ( xi ) log 2 p ( xi ) H=-\Sigma_{i=1} ^np(x_i)log_2 p(x_i)H=Si=1np(xi)log2p(xi)
among them,nnn is the number of categories, the greater the entropy, the greater the uncertainty of the random variable.

When the probability in entropy is obtained by data estimation (especially maximum likelihood estimation), the corresponding entropy is called empirical entropy.

What is estimated from data? For example, there are 10 data, and there are two categories, A category and B category. Among them, 7 data belong to category A, then the probability of category A is 7 out of 10.

Among them, 3 data belong to class B, then the probability of this class B is 3 out of 10. The simple explanation is that this probability is calculated based on the data.

We define the data in the loan application sample data table as the training data set D, then the experience entropy of the training data set D is H(D), and |D| represents its sample size and number of samples.

Assuming K classes Ck, k = 1,2,3,...,K, |Ck| is the number of samples belonging to class Ck, this empirical entropy formula can be written as: H ( D ) = − Σ ∣
ck ∣ ∣ D ∣ log 2 ∣ ck ∣ ∣ D ∣ H(D)=-\Sigma \frac{|c_k|}{|D|}log_2\frac{|c_k|}{|D|}H(D)=SDcklog2Dck

Calculate the empirical entropy H(D) according to this formula, and analyze the data in the loan application sample data table. There are only two types of final classification results, namely lending and non-lending. According to the statistics of the data in the table, among the 15 data, the result of 9 data is lending, and the result of 6 data is not lending. So the empirical entropy H(D) of data set D is:
H ( D ) = − 9 15 log 2 9 15 − 6 15 log 2 6 15 = 0.971 H(D)=-\frac{9}{15} log_2\ frac{9}{15}-\frac{6}{15} log_2\frac{6}{15}=0.971H(D)=159log2159156log2156=0.971
After calculation, it can be seen that the value of the empirical entropy H(D) of the data set D is 0.971.

Before understanding information gain, be clear - conditional entropy

Information gain represents the degree to which the information uncertainty of class Y is reduced by knowing the information of feature X.

Conditional entropy H ( Y ∣ X ) H(Y|X)H ( Y X ) represents the uncertainty of random variable Y under the condition of known random variable X, and the conditional entropy (conditional entropy) H(Y|X) of random variable Y under the given condition of random variable X, defined The entropy of the conditional probability distribution of Y under the given conditions of X is the mathematical expectation of X:
H ( Y ∣ X ) = ∑ i = 1 npi H ( Y ∣ X = xi ) H(Y|X)=\sum_{i= 1}^n p_iH(Y|X=x_i)H(YX)=i=1npiH(YX=xi)
, wherepi = P ( X = xi ) p_i=P(X=x_i)pi=P(X=xi)

When the probability in entropy and conditional entropy is obtained by data estimation (especially maximum likelihood estimation), the corresponding ones are empirical entropy and empirical conditional entropy respectively. If there is 0 probability at this time, let 0 log 0 = 0 0log0 = 00 l o g 0=0

Information Gain : Information gain is relative to features. Therefore, the information gain g(D,A) of feature A on training data set D is defined as the difference between the empirical entropy H(D) of set D and the empirical conditional entropy H(D|A) of D under the given conditions of feature A ,Right now:

g ( D , A ) = H ( D ) − H ( D ∣ A ) g(D,A)=H(D)-H(D|A) g(D,A)=H(D)H(DA)

Generally, the difference between entropy H(D) and conditional entropy H(D|A) becomes mutual information. Information gain in decision tree learning is equivalent to the mutual information of classes and features in the training dataset.

The size of the information gain value has no absolute meaning relative to the training data set. When the classification problem is difficult, that is to say, when the experience entropy of the training data set is large, the information gain value will be too large, otherwise the information gain value will be This problem can be corrected using the information gain ratio, which is another criterion for feature selection.

Information Gain Ratio : Feature AAThe information gain ratio of A to training data set D g R ( D , A ) g_R(D,A)gR(D,A ) is defined as its information gaing ( D , A ) g(D,A)g(D,A ) and training data setDDThe ratio of the empirical entropy of D :

g R ( D , A ) = g ( D , A ) H ( D ) g_R(D,A)=\frac{g(D,A)}{H(D)} gR(D,A)=H(D)g(D,A)

3.1.2 Write code to calculate empirical entropy

insert image description here
Before writing the code, we first annotate the attributes of the dataset.

  • Age: 0 represents youth, 1 represents middle age, and 2 represents old age;
  • Have a job: 0 means no, 1 means yes;
  • Own house: 0 means no, 1 means yes;
  • Credit status: 0 means average, 1 means good, 2 means very good;
  • Category (whether to give a loan): no means no, yes means yes.

The code to create a data set and calculate the empirical entropy is as follows:

from math import log

"""
函数说明:创建测试数据集
Parameters:无
Returns:
    dataSet:数据集
    labels:分类属性
Modify:
    2018-03-12

"""
def creatDataSet():
    # 数据集
    dataSet=[[0, 0, 0, 0, 'no'],
            [0, 0, 0, 1, 'no'],
            [0, 1, 0, 1, 'yes'],
            [0, 1, 1, 0, 'yes'],
            [0, 0, 0, 0, 'no'],
            [1, 0, 0, 0, 'no'],
            [1, 0, 0, 1, 'no'],
            [1, 1, 1, 1, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [2, 0, 1, 2, 'yes'],
            [2, 0, 1, 1, 'yes'],
            [2, 1, 0, 1, 'yes'],
            [2, 1, 0, 2, 'yes'],
            [2, 0, 0, 0, 'no']]
    #分类属性
    labels=['年龄','有工作','有自己的房子','信贷情况']
    #返回数据集和分类属性
    return dataSet,labels

"""
函数说明:计算给定数据集的经验熵(香农熵)
Parameters:
    dataSet:数据集
Returns:
    shannonEnt:经验熵
Modify:
    2018-03-12

"""
def calcShannonEnt(dataSet):
    #返回数据集行数
    numEntries=len(dataSet)
    #保存每个标签(label)出现次数的字典
    labelCounts={
    
    }
    #对每组特征向量进行统计
    for featVec in dataSet:
        currentLabel=featVec[-1]                     #提取标签信息
        if currentLabel not in labelCounts.keys():   #如果标签没有放入统计次数的字典,添加进去
            labelCounts[currentLabel]=0
        labelCounts[currentLabel]+=1                 #label计数

    shannonEnt=0.0                                   #经验熵
    #计算经验熵
    for key in labelCounts:
        prob=float(labelCounts[key])/numEntries      #选择该标签的概率
        shannonEnt-=prob*log(prob,2)                 #利用公式计算
    return shannonEnt                                #返回经验熵

#main函数
if __name__=='__main__':
    dataSet,features=creatDataSet()
    print(dataSet)
    print(calcShannonEnt(dataSet))

result:

0个特征的增益为0.0831个特征的增益为0.3242个特征的增益为0.4203个特征的增益为0.3630个特征的增益为0.2521个特征的增益为0.9182个特征的增益为0.474
{
    
    '有自己的房子': {
    
    0: {
    
    '有工作': {
    
    0: 'no', 1: 'yes'}}, 1: 'yes'}}

3.1.4 Using code to calculate information gain


from math import log

"""
函数说明:创建测试数据集
Parameters:无
Returns:
    dataSet:数据集
    labels:分类属性
Modify:
    2018-03-12

"""
def creatDataSet():
    # 数据集
    dataSet=[[0, 0, 0, 0, 'no'],
            [0, 0, 0, 1, 'no'],
            [0, 1, 0, 1, 'yes'],
            [0, 1, 1, 0, 'yes'],
            [0, 0, 0, 0, 'no'],
            [1, 0, 0, 0, 'no'],
            [1, 0, 0, 1, 'no'],
            [1, 1, 1, 1, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [2, 0, 1, 2, 'yes'],
            [2, 0, 1, 1, 'yes'],
            [2, 1, 0, 1, 'yes'],
            [2, 1, 0, 2, 'yes'],
            [2, 0, 0, 0, 'no']]
    #分类属性
    labels=['年龄','有工作','有自己的房子','信贷情况']
    #返回数据集和分类属性
    return dataSet,labels


"""
函数说明:计算给定数据集的经验熵(香农熵)
Parameters:
    dataSet:数据集
Returns:
    shannonEnt:经验熵
Modify:
    2018-03-12

"""
def calcShannonEnt(dataSet):
    #返回数据集行数
    numEntries=len(dataSet)
    #保存每个标签(label)出现次数的字典
    labelCounts={
    
    }
    #对每组特征向量进行统计
    for featVec in dataSet:
        currentLabel=featVec[-1]                     #提取标签信息
        if currentLabel not in labelCounts.keys():   #如果标签没有放入统计次数的字典,添加进去
            labelCounts[currentLabel]=0
        labelCounts[currentLabel]+=1                 #label计数

    shannonEnt=0.0                                   #经验熵
    #计算经验熵
    for key in labelCounts:
        prob=float(labelCounts[key])/numEntries      #选择该标签的概率
        shannonEnt-=prob*log(prob,2)                 #利用公式计算
    return shannonEnt                                #返回经验熵


"""
函数说明:计算给定数据集的经验熵(香农熵)
Parameters:
    dataSet:数据集
Returns:
    shannonEnt:信息增益最大特征的索引值
Modify:
    2018-03-12

"""


def chooseBestFeatureToSplit(dataSet):
    #特征数量
    numFeatures = len(dataSet[0]) - 1
    #计数数据集的香农熵
    baseEntropy = calcShannonEnt(dataSet)
    #信息增益
    bestInfoGain = 0.0
    #最优特征的索引值
    bestFeature = -1
    #遍历所有特征
    for i in range(numFeatures):
        # 获取dataSet的第i个所有特征
        featList = [example[i] for example in dataSet]
        #创建set集合{},元素不可重复
        uniqueVals = set(featList)
        #经验条件熵
        newEntropy = 0.0
        #计算信息增益
        for value in uniqueVals:
            #subDataSet划分后的子集
            subDataSet = splitDataSet(dataSet, i, value)
            #计算子集的概率
            prob = len(subDataSet) / float(len(dataSet))
            #根据公式计算经验条件熵
            newEntropy += prob * calcShannonEnt((subDataSet))
        #信息增益
        infoGain = baseEntropy - newEntropy
        #打印每个特征的信息增益
        print("第%d个特征的增益为%.3f" % (i, infoGain))
        #计算信息增益
        if (infoGain > bestInfoGain):
            #更新信息增益,找到最大的信息增益
            bestInfoGain = infoGain
            #记录信息增益最大的特征的索引值
            bestFeature = i
            #返回信息增益最大特征的索引值
    return bestFeature

"""
函数说明:按照给定特征划分数据集
Parameters:
    dataSet:待划分的数据集
    axis:划分数据集的特征
    value:需要返回的特征的值
Returns:
    shannonEnt:经验熵
Modify:
    2018-03-12

"""
def splitDataSet(dataSet,axis,value):
    retDataSet=[]
    for featVec in dataSet:
        if featVec[axis]==value:
            reducedFeatVec=featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet


#main函数
if __name__=='__main__':
    dataSet,features=creatDataSet()
    # print(dataSet)
    # print(calcShannonEnt(dataSet))
    print("最优索引值:"+str(chooseBestFeatureToSplit(dataSet)))

result:

0个特征的增益为0.0831个特征的增益为0.3242个特征的增益为0.4203个特征的增益为0.363
最优索引值:2

Compared with the results calculated by ourselves, we found that the results are completely correct! The index value of the optimal feature is 2, which is feature A3 (has its own house).

3.2 Generation and pruning of decision tree

We have learned the sub-functional modules required to construct a decision tree algorithm from a data set, including the calculation of empirical entropy and the selection of optimal features.

It works as follows:

Get the original data set, and then divide the data set based on the best attribute value. Since there may be more than two eigenvalues, there may be more than two branches of the data set division.

After the first split, the dataset is passed down to the next node in the branch of the tree. At this node, we can partition the data again.

Therefore, we can use the principle of recursion to process the data set.

There are many algorithms for building decision trees, such as C4.5, ID3, and CART, and these algorithms do not always consume features every time the data is divided into groups at runtime.

Since the number of features does not decrease each time the data is divided into groups, these algorithms may cause certain problems in actual use.

We don't need to think about this at the moment, we just need to count the number of columns before the algorithm starts running to see if the algorithm uses all the attributes.

Decision tree generation algorithms generate decision trees recursively until they cannot go any further. The tree generated in this way is often very accurate in the classification of training data, but not so accurate in the classification of unknown test data, that is, over-fitting occurs.

The reason for overfitting is that too much consideration is given to how to improve the correct classification of training data during learning, thereby constructing an overly complex decision tree. The solution to this problem is to simplify the generated decision tree considering the complexity of the decision tree.

3.2.1 Construction of decision tree

1. ID3 Algorithm

The core of the ID3 algorithm is to select features corresponding to the information gain criterion on each node of the decision tree, and construct the decision tree recursively.

The specific method is:

1) Starting from the root node, calculate the information gain of all possible features for the node, and select the feature with the largest information gain as the feature of the node.

2) Create child nodes from different values ​​of the feature, and then recursively call the above method on the child nodes to build a decision tree; until the information gain of all features is small or no features can be selected;

3) Finally get a decision tree.

ID3 is equivalent to using the maximum likelihood method to select a probability model

Algorithm steps:

insert image description here
insert image description here

analyze data:

It has been obtained above that the information gain of feature A3 (with its own house) is the largest, so choose A3 as the feature of the root node

It divides the training set D into two subsets D1 (A3 takes the value "Yes") D2 (A3 takes the value "No")

Since D1 only has sample points of the same class, it becomes a leaf node, and the class of the node is marked as "Yes".

For D2, it is necessary to select new features from features A1 (age), A2 (job) and A4 (credit status), and calculate the information gain of each feature:

g ( D 2 , A 1 ) = H ( D 2 ) − H ( D 2 ∣ A 1 ) = 0.251 g(D2,A1)=H(D2)-H(D2|A1)=0.251 g(D2,A 1 )=H(D2)H(D2∣A1)=0.251
g ( D 2 , A 2 ) = H ( D 2 ) − H ( D 2 ∣ A 2 ) = 0.918 g(D2,A2)=H(D2)-H(D2|A2)=0.918 g(D2,A2 ) _=H(D2)H(D2∣A2)=0.918
g ( D 2 , A 3 ) = H ( D 2 ) − H ( D 2 ∣ A 3 ) = 0.474 g(D2,A3)=H(D2)-H(D2|A3)=0.474 g(D2,A3 ) _=H(D2)H(D2∣A3)=0.474

According to the calculation, A2 with the largest information gain is selected as the feature of the node. Since it has two possible values, it leads to two child nodes:

① Corresponds to "Yes" (has a job), contains three samples, and belongs to the same class, so it is a leaf node, and the class is marked as "Yes"

②Corresponding to "No" (no work), it contains six samples, input the same class, so it is a leaf node, and the class is marked as "No"

This generates a decision tree that uses only two features (with two internal nodes), and the generated decision tree is shown in the following figure:

insert image description here

2. The generation algorithm of C4.5

It is similar to the ID3 algorithm, but it has been improved, and the information gain ratio is used as the criterion for selecting features.

Build a decision tree recursively:

The working principle of the sub-function modules required to construct the decision tree algorithm from the data set is as follows: get the original data set, and then divide the data set based on the best attribute value. Since there may be more than two eigenvalues, there may be more than two branches. Data set division, after the first division, the data will be passed down to the next node of the tree branch, where the data is divided, so the data set can be processed using the principle of recursion.

The condition for recursion to end is:

The program completely traverses the attributes of all partitioned data sets, or all instances under each branch have the same classification, if all instances have the same classification, a leaf node or termination block is obtained, and any data that reaches the leaf node must belong to the leaf Classification of nodes.

Write code for ID3 algorithm

from math import log
import operator

"""
函数说明:计算给定数据集的经验熵(香农熵)
Parameters:
    dataSet:数据集
Returns:
    shannonEnt:经验熵
Modify:
    2018-03-12

"""
def calcShannonEnt(dataSet):
    #返回数据集行数
    numEntries=len(dataSet)
    #保存每个标签(label)出现次数的字典
    labelCounts={
    
    }
    #对每组特征向量进行统计
    for featVec in dataSet:
        currentLabel=featVec[-1]                     #提取标签信息
        if currentLabel not in labelCounts.keys():   #如果标签没有放入统计次数的字典,添加进去
            labelCounts[currentLabel]=0
        labelCounts[currentLabel]+=1                 #label计数

    shannonEnt=0.0                                   #经验熵
    #计算经验熵
    for key in labelCounts:
        prob=float(labelCounts[key])/numEntries      #选择该标签的概率
        shannonEnt-=prob*log(prob,2)                 #利用公式计算
    return shannonEnt                                #返回经验熵

"""
函数说明:创建测试数据集
Parameters:无
Returns:
    dataSet:数据集
    labels:分类属性
Modify:
    2018-03-13

"""
def createDataSet():
    # 数据集
    dataSet=[[0, 0, 0, 0, 'no'],
            [0, 0, 0, 1, 'no'],
            [0, 1, 0, 1, 'yes'],
            [0, 1, 1, 0, 'yes'],
            [0, 0, 0, 0, 'no'],
            [1, 0, 0, 0, 'no'],
            [1, 0, 0, 1, 'no'],
            [1, 1, 1, 1, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [2, 0, 1, 2, 'yes'],
            [2, 0, 1, 1, 'yes'],
            [2, 1, 0, 1, 'yes'],
            [2, 1, 0, 2, 'yes'],
            [2, 0, 0, 0, 'no']]
    #分类属性
    labels=['年龄','有工作','有自己的房子','信贷情况']
    #返回数据集和分类属性
    return dataSet,labels

"""
函数说明:按照给定特征划分数据集

Parameters:
    dataSet:待划分的数据集
    axis:划分数据集的特征
    value:需要返回的特征值
Returns:
    无
Modify:
    2018-03-13

"""
def splitDataSet(dataSet,axis,value):
    #创建返回的数据集列表
    retDataSet=[]
    #遍历数据集
    for featVec in dataSet:
        if featVec[axis]==value:
            #去掉axis特征
            reduceFeatVec=featVec[:axis]
            #将符合条件的添加到返回的数据集
            reduceFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reduceFeatVec)
    #返回划分后的数据集
    return retDataSet

"""
函数说明:计算给定数据集的经验熵(香农熵)
Parameters:
    dataSet:数据集
Returns:
    shannonEnt:信息增益最大特征的索引值
Modify:
    2018-03-13

"""


def chooseBestFeatureToSplit(dataSet):
    #特征数量
    numFeatures = len(dataSet[0]) - 1
    #计数数据集的香农熵
    baseEntropy = calcShannonEnt(dataSet)
    #信息增益
    bestInfoGain = 0.0
    #最优特征的索引值
    bestFeature = -1
    #遍历所有特征
    for i in range(numFeatures):
        # 获取dataSet的第i个所有特征
        featList = [example[i] for example in dataSet]
        #创建set集合{},元素不可重复
        uniqueVals = set(featList)
        #经验条件熵
        newEntropy = 0.0
        #计算信息增益
        for value in uniqueVals:
            #subDataSet划分后的子集
            subDataSet = splitDataSet(dataSet, i, value)
            #计算子集的概率
            prob = len(subDataSet) / float(len(dataSet))
            #根据公式计算经验条件熵
            newEntropy += prob * calcShannonEnt((subDataSet))
        #信息增益
        infoGain = baseEntropy - newEntropy
        #打印每个特征的信息增益
        print("第%d个特征的增益为%.3f" % (i, infoGain))
        #计算信息增益
        if (infoGain > bestInfoGain):
            #更新信息增益,找到最大的信息增益
            bestInfoGain = infoGain
            #记录信息增益最大的特征的索引值
            bestFeature = i
            #返回信息增益最大特征的索引值
    return bestFeature

"""
函数说明:统计classList中出现次数最多的元素(类标签)
Parameters:
    classList:类标签列表
Returns:
    sortedClassCount[0][0]:出现次数最多的元素(类标签)
Modify:
    2018-03-13

"""
def majorityCnt(classList):
    classCount={
    
    }
    #统计classList中每个元素出现的次数
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote]=0
            classCount[vote]+=1
        #根据字典的值降序排列
        sortedClassCount=sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
        return sortedClassCount[0][0]

"""
函数说明:创建决策树

Parameters:
    dataSet:训练数据集
    labels:分类属性标签
    featLabels:存储选择的最优特征标签
Returns:
    myTree:决策树
Modify:
    2018-03-13

"""
def createTree(dataSet,labels,featLabels):
    #取分类标签(是否放贷:yes or no)
    classList=[example[-1] for example in dataSet]
    #如果类别完全相同,则停止继续划分
    if classList.count(classList[0])==len(classList):
        return classList[0]
    #遍历完所有特征时返回出现次数最多的类标签
    if len(dataSet[0])==1:
        return majorityCnt(classList)
    #选择最优特征
    bestFeat=chooseBestFeatureToSplit(dataSet)
    #最优特征的标签
    bestFeatLabel=labels[bestFeat]
    featLabels.append(bestFeatLabel)
    #根据最优特征的标签生成树
    myTree={
    
    bestFeatLabel:{
    
    }}
    #删除已经使用的特征标签
    del(labels[bestFeat])
    #得到训练集中所有最优特征的属性值
    featValues=[example[bestFeat] for example in dataSet]
    #去掉重复的属性值
    uniqueVls=set(featValues)
    #遍历特征,创建决策树
    for value in uniqueVls:
        myTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),
                                               labels,featLabels)
    return myTree

if __name__=='__main__':
    dataSet,labels=createDataSet()
    featLabels=[]
    myTree=createTree(dataSet,labels,featLabels)
    print(myTree)



result:

0个特征的增益为0.0831个特征的增益为0.3242个特征的增益为0.4203个特征的增益为0.3630个特征的增益为0.2521个特征的增益为0.9182个特征的增益为0.474
{
    
    '有自己的房子': {
    
    0: {
    
    '有工作': {
    
    0: 'no', 1: 'yes'}}, 1: 'yes'}}

3. Pruning of decision tree

The decision tree generation algorithm recursively generates a decision tree until it cannot continue. The generated tree is often very accurate in the classification of the training data, but the classification of the unknown test data is not so accurate, that is, the phenomenon of overfitting will occur.

The reason for overfitting is that too much consideration is given to how to improve the correct classification of training data during learning, thereby constructing an overly complex decision tree. The solution is to consider the complexity of the decision tree and simplify the generated tree.

Pruning : cut some subtrees or leaf nodes from the generated tree, and use its root node or parent node as a new leaf node, thereby simplifying the classification tree model.

Implementation method : Minimize the overall loss function or cost function of the decision tree to achieve

The loss function for decision tree learning is defined as:

C α ( T ) = ∑ t = 1 ∣ T ∣ N t H t ( T ) + α ∣ T ∣ C_\alpha(T)=\sum_{t=1}^{|T|}N_tH_t(T)+\alpha|T| Ca(T)=t=1TNtHt(T)+αT

in:

parameter significance
T T T Represents the leaf node of this subtree,
H t ( T ) H_t(T) Ht(T) Indicates the ttThe entropy of t leaves, [External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-QHgonDQQ-1663157039225)(//img-blog.csdn.net/20180314091812955?watermark /2/text/Ly9ibG9nLmNzZG4ubmV0L2ppYW95YW5nd20=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)]
N t N_t Nt Indicates the number of training samples contained in the leaf,
a \alphaa penalty factor,
∣ T ∣ |T| T Indicates the number of leaf nodes of the subtree.

because:

insert image description here

So: C α ( T ) = C ( T ) + α ∣ T ∣ C_\alpha(T)=C(T)+\alpha|T|Ca(T)=C(T)+αT

in:

parameter significance
C ( T ) C(T) C(T) Indicates the prediction error of the model on the training data, that is, the degree of fitting between the model and the training data;
∣ T ∣ |T| T Indicates the model complexity
a \alphaa Parameter α >= 0 \alpha>=0a>=0 controls the influence between the two, a largerα \alphaα promotes the selection of simpler models (trees), smallerα \alphaα promotes the selection of more complex models (trees),α = 0 \alpha=0a=0 means that only the fitting degree of the model and the training data is considered, and the complexity of the model is not considered.

Pruning is when α \alphaWhen α is determined, the model with the smallest loss function is selected, that is, the subtree with the smallest loss function.

  • When α \alphaWhen the α value is determined, the larger the subtree, the better the fit with the training data, but the higher the complexity of the model;
  • The smaller the subtree, the lower the complexity of the model, but it often does not fit the training data well
  • The loss function expresses exactly the balance between the two.

The loss function believes that the degree of uncertainty for each classification end point (leaf node) is the classification loss factor, and the number of leaf nodes is the complexity of the model. As a penalty item, the first item of the loss function is the training error of the sample , the second term is the complexity of the model.

If the loss function value of a subtree is larger, it means that the subtree is worse, so we hope to make the loss function value of each subtree as small as possible, and the loss function minimization is to use regularized maximum likelihood estimation. The process of model selection.

The pruning process (generalization process) of the decision tree is to recurse from the leaf node, record the subtree after all the child nodes are retracted by its parent node as Tb (the classification value takes the feature value with the largest category ratio), and the unretracted The subtree is T a TaT a,ifC a ( T a ) ≥ C a ( T b ) C_α(T_a)≥C_α(T_b)Ca(Ta)Ca(Tb) shows that the loss function is reduced after the retraction, then the subtree should be retracted and recursed until it cannot be retracted, so that pruning with the "greedy" idea can reduce the value of the loss function, and also make the decision tree get a general change.

It can be seen that the generation of the decision tree only considers better fitting of the training data by increasing the information gain, while the pruning of the decision tree also considers reducing the complexity of the model by optimizing the loss function.

Formula C α ( T ) = C ( T ) + α ∣ T ∣ C_\alpha(T)=C(T)+\alpha|T|Ca(T)=C(T)+The minimization of the loss function defined by α T is equivalent to the regularized maximum likelihood estimation, and the schematic diagram of the pruning process:

insert image description here

The decision tree algorithm is easy to overfitting, and the pruning algorithm is used to prevent the decision tree from overfitting and improve the performance of Panhua.

Pruning is divided into pre-pruning and post-pruning.

Pre-pruning means that in the process of generating a decision tree, each node is evaluated before being divided. If the current division cannot improve the generalization performance, the division is stopped and the current node is marked as a leaf node.

Post-pruning refers to generating a complete decision tree from the training set, and then inspecting the non-leaf nodes from the bottom up. If the subtree corresponding to the node is replaced by a leaf node, the generalization performance can be improved, then Replace that subtree with a leaf node.

So how to judge whether it brings generalization performance improvement? The simplest is the hold-out method, which is to reserve a part of the data as a validation set for performance evaluation.

3.2.2 Decision Tree Visualization

The code here is all about Matplotlib. If you don’t know about Matplotlib, you can learn it first. The content of Matplotlib will not be repeated here. Functions needed for visualization:

getNumLeafs: Get the number of decision tree leaf nodes

getTreeDepth: Get the number of layers of the decision tree

plotNode: draw nodes

plotMidText: Label the value of the directed edge attribute

plotTree: Draw a decision tree

createPlot: Create a drawing panel

from math import log
import operator
from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
"""
函数说明:计算给定数据集的经验熵(香农熵)
Parameters:
    dataSet:数据集
Returns:
    shannonEnt:经验熵
Modify:
    2018-03-12

"""
def calcShannonEnt(dataSet):
    #返回数据集行数
    numEntries=len(dataSet)
    #保存每个标签(label)出现次数的字典
    labelCounts={
    
    }
    #对每组特征向量进行统计
    for featVec in dataSet:
        currentLabel=featVec[-1]                     #提取标签信息
        if currentLabel not in labelCounts.keys():   #如果标签没有放入统计次数的字典,添加进去
            labelCounts[currentLabel]=0
        labelCounts[currentLabel]+=1                 #label计数

    shannonEnt=0.0                                   #经验熵
    #计算经验熵
    for key in labelCounts:
        prob=float(labelCounts[key])/numEntries      #选择该标签的概率
        shannonEnt-=prob*log(prob,2)                 #利用公式计算
    return shannonEnt                                #返回经验熵

"""
函数说明:创建测试数据集
Parameters:无
Returns:
    dataSet:数据集
    labels:分类属性
Modify:
    2018-03-13

"""
def createDataSet():
    # 数据集
    dataSet=[[0, 0, 0, 0, 'no'],
            [0, 0, 0, 1, 'no'],
            [0, 1, 0, 1, 'yes'],
            [0, 1, 1, 0, 'yes'],
            [0, 0, 0, 0, 'no'],
            [1, 0, 0, 0, 'no'],
            [1, 0, 0, 1, 'no'],
            [1, 1, 1, 1, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [2, 0, 1, 2, 'yes'],
            [2, 0, 1, 1, 'yes'],
            [2, 1, 0, 1, 'yes'],
            [2, 1, 0, 2, 'yes'],
            [2, 0, 0, 0, 'no']]
    #分类属性
    labels=['年龄','有工作','有自己的房子','信贷情况']
    #返回数据集和分类属性
    return dataSet,labels

"""
函数说明:按照给定特征划分数据集

Parameters:
    dataSet:待划分的数据集
    axis:划分数据集的特征
    value:需要返回的特征值
Returns:
    无
Modify:
    2018-03-13

"""
def splitDataSet(dataSet,axis,value):
    #创建返回的数据集列表
    retDataSet=[]
    #遍历数据集
    for featVec in dataSet:
        if featVec[axis]==value:
            #去掉axis特征
            reduceFeatVec=featVec[:axis]
            #将符合条件的添加到返回的数据集
            reduceFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reduceFeatVec)
    #返回划分后的数据集
    return retDataSet

"""
函数说明:计算给定数据集的经验熵(香农熵)
Parameters:
    dataSet:数据集
Returns:
    shannonEnt:信息增益最大特征的索引值
Modify:
    2018-03-13

"""


def chooseBestFeatureToSplit(dataSet):
    #特征数量
    numFeatures = len(dataSet[0]) - 1
    #计数数据集的香农熵
    baseEntropy = calcShannonEnt(dataSet)
    #信息增益
    bestInfoGain = 0.0
    #最优特征的索引值
    bestFeature = -1
    #遍历所有特征
    for i in range(numFeatures):
        # 获取dataSet的第i个所有特征
        featList = [example[i] for example in dataSet]
        #创建set集合{},元素不可重复
        uniqueVals = set(featList)
        #经验条件熵
        newEntropy = 0.0
        #计算信息增益
        for value in uniqueVals:
            #subDataSet划分后的子集
            subDataSet = splitDataSet(dataSet, i, value)
            #计算子集的概率
            prob = len(subDataSet) / float(len(dataSet))
            #根据公式计算经验条件熵
            newEntropy += prob * calcShannonEnt((subDataSet))
        #信息增益
        infoGain = baseEntropy - newEntropy
        #打印每个特征的信息增益
        print("第%d个特征的增益为%.3f" % (i, infoGain))
        #计算信息增益
        if (infoGain > bestInfoGain):
            #更新信息增益,找到最大的信息增益
            bestInfoGain = infoGain
            #记录信息增益最大的特征的索引值
            bestFeature = i
            #返回信息增益最大特征的索引值
    return bestFeature

"""
函数说明:统计classList中出现次数最多的元素(类标签)
Parameters:
    classList:类标签列表
Returns:
    sortedClassCount[0][0]:出现次数最多的元素(类标签)
Modify:
    2018-03-13

"""
def majorityCnt(classList):
    classCount={
    
    }
    #统计classList中每个元素出现的次数
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote]=0
            classCount[vote]+=1
        #根据字典的值降序排列
        sortedClassCount=sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
        return sortedClassCount[0][0]

"""
函数说明:创建决策树

Parameters:
    dataSet:训练数据集
    labels:分类属性标签
    featLabels:存储选择的最优特征标签
Returns:
    myTree:决策树
Modify:
    2018-03-13

"""
def createTree(dataSet,labels,featLabels):
    #取分类标签(是否放贷:yes or no)
    classList=[example[-1] for example in dataSet]
    #如果类别完全相同,则停止继续划分
    if classList.count(classList[0])==len(classList):
        return classList[0]
    #遍历完所有特征时返回出现次数最多的类标签
    if len(dataSet[0])==1:
        return majorityCnt(classList)
    #选择最优特征
    bestFeat=chooseBestFeatureToSplit(dataSet)
    #最优特征的标签
    bestFeatLabel=labels[bestFeat]
    featLabels.append(bestFeatLabel)
    #根据最优特征的标签生成树
    myTree={
    
    bestFeatLabel:{
    
    }}
    #删除已经使用的特征标签
    del(labels[bestFeat])
    #得到训练集中所有最优特征的属性值
    featValues=[example[bestFeat] for example in dataSet]
    #去掉重复的属性值
    uniqueVls=set(featValues)
    #遍历特征,创建决策树
    for value in uniqueVls:
        myTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),
                                               labels,featLabels)
    return myTree

"""
函数说明:获取决策树叶子节点的数目

Parameters:
    myTree:决策树
Returns:
    numLeafs:决策树的叶子节点的数目
Modify:
    2018-03-13

"""

def getNumLeafs(myTree):
    numLeafs=0
    firstStr=next(iter(myTree))
    secondDict=myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':
            numLeafs+=getNumLeafs(secondDict[key])
        else: numLeafs+=1
    return numLeafs

"""
函数说明:获取决策树的层数

Parameters:
    myTree:决策树
Returns:
    maxDepth:决策树的层数

Modify:
    2018-03-13
"""
def getTreeDepth(myTree):
    maxDepth = 0                                                #初始化决策树深度
    firstStr = next(iter(myTree))                                #python3中myTree.keys()返回的是dict_keys,不在是list,所以不能使用myTree.keys()[0]的方法获取结点属性,可以使用list(myTree.keys())[0]
    secondDict = myTree[firstStr]                                #获取下一个字典
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':                #测试该结点是否为字典,如果不是字典,代表此结点为叶子结点
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else:   thisDepth = 1
        if thisDepth > maxDepth: maxDepth = thisDepth            #更新层数
    return maxDepth

"""
函数说明:绘制结点

Parameters:
    nodeTxt - 结点名
    centerPt - 文本位置
    parentPt - 标注的箭头位置
    nodeType - 结点格式
Returns:
    无
Modify:
    2018-03-13
"""
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    arrow_args = dict(arrowstyle="<-")                                            #定义箭头格式
    font = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=14)        #设置中文字体
    createPlot.ax1.annotate(nodeTxt, xy=parentPt,  xycoords='axes fraction',    #绘制结点
        xytext=centerPt, textcoords='axes fraction',
        va="center", ha="center", bbox=nodeType, arrowprops=arrow_args, FontProperties=font)

"""
函数说明:标注有向边属性值

Parameters:
    cntrPt、parentPt - 用于计算标注位置
    txtString - 标注的内容
Returns:
    无
Modify:
    2018-03-13
"""
def plotMidText(cntrPt, parentPt, txtString):
    xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]                                            #计算标注位置
    yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
    createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)

"""
函数说明:绘制决策树

Parameters:
    myTree - 决策树(字典)
    parentPt - 标注的内容
    nodeTxt - 结点名
Returns:
    无
Modify:
    2018-03-13
"""
def plotTree(myTree, parentPt, nodeTxt):
    decisionNode = dict(boxstyle="sawtooth", fc="0.8")                                        #设置结点格式
    leafNode = dict(boxstyle="round4", fc="0.8")                                            #设置叶结点格式
    numLeafs = getNumLeafs(myTree)                                                          #获取决策树叶结点数目,决定了树的宽度
    depth = getTreeDepth(myTree)                                                            #获取决策树层数
    firstStr = next(iter(myTree))                                                            #下个字典
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)    #中心位置
    plotMidText(cntrPt, parentPt, nodeTxt)                                                    #标注有向边属性值
    plotNode(firstStr, cntrPt, parentPt, decisionNode)                                        #绘制结点
    secondDict = myTree[firstStr]                                                            #下一个字典,也就是继续绘制子结点
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD                                        #y偏移
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':                                            #测试该结点是否为字典,如果不是字典,代表此结点为叶子结点
            plotTree(secondDict[key],cntrPt,str(key))                                        #不是叶结点,递归调用继续绘制
        else:                                                                                #如果是叶结点,绘制叶结点,并标注有向边属性值
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD

"""
函数说明:创建绘制面板

Parameters:
    inTree - 决策树(字典)
Returns:
    无
Modify:
    2018-03-13
"""
def createPlot(inTree):
    fig = plt.figure(1, facecolor='white')#创建fig
    fig.clf()#清空fig
    axprops = dict(xticks=[], yticks=[])
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)#去掉x、y轴
    plotTree.totalW = float(getNumLeafs(inTree))#获取决策树叶结点数目
    plotTree.totalD = float(getTreeDepth(inTree))#获取决策树层数
    plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0#x偏移
    plotTree(inTree, (0.5,1.0), '')#绘制决策树
    plt.show()#显示绘制结果

if __name__ == '__main__':
    dataSet, labels = createDataSet()
    featLabels = []
    myTree = createTree(dataSet, labels, featLabels)
    print(myTree)
    createPlot(myTree)

if __name__=='__main__':
    dataSet,labels=createDataSet()
    featLabels=[]
    myTree=createTree(dataSet,labels,featLabels)
    print(myTree)

3.2.3 The difference between ID3, C4.5 and CART

These three are very famous decision tree algorithms. To put it simply and roughly, ID3 uses information gain as a criterion for selecting features; C4.5 uses information gain ratio as a criterion for selecting features; CART uses Gini index as a criterion for selecting features.

1. ID3
entropy represents the amount of information contained in the data. The smaller the entropy, the higher the purity of the data, that is to say, the more consistent the data is, which is what we want each child node to look like after division.

Information gain = entropy before division - entropy after division. The greater the information gain, the greater the "purity improvement" obtained by using attribute a for division. That is to say, using the attribute a to divide the training set, the purity of the obtained results is relatively high.

ID3 is only suitable for binary classification problems. ID3 can only handle discrete attributes.

Two, C4.5

C4.5 overcomes the problem that ID3 can only deal with discrete attributes, and the problem that information gain tends to select features with more values, and uses information gain ratio to select features. Information Gain Ratio = Information Gain / Entropy Before Division Select the one with the largest information gain ratio as the optimal feature.

C4.5 To deal with continuous features, first sort the feature values, and use the middle value of two consecutive values ​​as the division standard. Try each division and calculate the corrected information gain, and select the split point with the largest information gain as the split point for this attribute.

3. CART

The difference between CART and ID3, C4.5 is that the tree generated by CART must be a binary tree. That is to say, whether it is a regression or classification problem, no matter whether the feature is discrete or continuous, no matter whether there are multiple or two attribute values, internal nodes can only be divided into two according to the attribute value.

CART stands for Classification and Regression Tree. As you should know from the name, CART can be used for both classification and regression problems.

In the regression tree, the square error minimization criterion is used to select features and divide them. The predicted value given by each leaf node is the mean value of all sample target values ​​divided into the leaf node, so that the square error is only minimized under the given division.

To determine the optimal score, it is also necessary to traverse all attributes and all their values ​​to try to divide and calculate the minimum square error in this division, and select the smallest as the basis for this division. Since the regression tree is generated using the square error minimization criterion, it is also called the least squares regression tree.

Classify tree species, use the Gini index minimization criterion to select features and divide them;

The Gini index represents the uncertainty, or impurity, of a set. The larger the Gini index, the higher the set uncertainty and the greater the impurity. This is similar to entropy. Another way to understand the Gini index is that the Gini index is to minimize the probability of misclassification.

Information Gain vs Information Gain Ratio

The information gain ratio was introduced because of a disadvantage of information gain. That is: information gain is always biased towards selecting attributes with more values. The information gain ratio adds a penalty on this basis to solve this problem.

Gini Index vs Entropy

Since both of these can represent data uncertainty, impurity. So what's the difference between the two?

  • The calculation of the Gini index does not require logarithmic operations, which is more efficient;
  • The Gini index is more biased towards continuous attributes, and the entropy is more biased towards discrete attributes.

3.3 Classification using decision trees

After constructing a decision tree from the training data, we can use it to classify the actual data.

  • When performing data classification, a decision tree is required along with the label vectors used to construct the tree.
  • Then, the program compares the test data with the values ​​on the decision tree, and performs the process recursively until it enters a leaf node
  • Finally, the test data is defined as the type to which the leaf node belongs.

In the code for building the decision tree, you can see that there is a featLabels parameter. What is it for? It is used to record each classification node. When using the decision tree to make predictions, we can input the attribute values ​​​​of the required classification nodes in order.

For example, if I use the above-mentioned trained decision tree for classification, then I only need to provide the two information of whether the person has a house and whether he has a job, and there is no need to provide redundant information.

from math import log
import operator

"""
函数说明:计算给定数据集的经验熵(香农熵)
Parameters:
    dataSet:数据集
Returns:
    shannonEnt:经验熵
Modify:
    2018-03-12

"""
def calcShannonEnt(dataSet):
    #返回数据集行数
    numEntries=len(dataSet)
    #保存每个标签(label)出现次数的字典
    labelCounts={
    
    }
    #对每组特征向量进行统计
    for featVec in dataSet:
        currentLabel=featVec[-1]                     #提取标签信息
        if currentLabel not in labelCounts.keys():   #如果标签没有放入统计次数的字典,添加进去
            labelCounts[currentLabel]=0
        labelCounts[currentLabel]+=1                 #label计数

    shannonEnt=0.0                                   #经验熵
    #计算经验熵
    for key in labelCounts:
        prob=float(labelCounts[key])/numEntries      #选择该标签的概率
        shannonEnt-=prob*log(prob,2)                 #利用公式计算
    return shannonEnt                                #返回经验熵

"""
函数说明:创建测试数据集
Parameters:无
Returns:
    dataSet:数据集
    labels:分类属性
Modify:
    2018-03-13

"""
def createDataSet():
    # 数据集
    dataSet=[[0, 0, 0, 0, 'no'],
            [0, 0, 0, 1, 'no'],
            [0, 1, 0, 1, 'yes'],
            [0, 1, 1, 0, 'yes'],
            [0, 0, 0, 0, 'no'],
            [1, 0, 0, 0, 'no'],
            [1, 0, 0, 1, 'no'],
            [1, 1, 1, 1, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [2, 0, 1, 2, 'yes'],
            [2, 0, 1, 1, 'yes'],
            [2, 1, 0, 1, 'yes'],
            [2, 1, 0, 2, 'yes'],
            [2, 0, 0, 0, 'no']]
    #分类属性
    labels=['年龄','有工作','有自己的房子','信贷情况']
    #返回数据集和分类属性
    return dataSet,labels

"""
函数说明:按照给定特征划分数据集

Parameters:
    dataSet:待划分的数据集
    axis:划分数据集的特征
    value:需要返回的特征值
Returns:
    无
Modify:
    2018-03-13

"""
def splitDataSet(dataSet,axis,value):
    #创建返回的数据集列表
    retDataSet=[]
    #遍历数据集
    for featVec in dataSet:
        if featVec[axis]==value:
            #去掉axis特征
            reduceFeatVec=featVec[:axis]
            #将符合条件的添加到返回的数据集
            reduceFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reduceFeatVec)
    #返回划分后的数据集
    return retDataSet

"""
函数说明:计算给定数据集的经验熵(香农熵)
Parameters:
    dataSet:数据集
Returns:
    shannonEnt:信息增益最大特征的索引值
Modify:
    2018-03-13

"""


def chooseBestFeatureToSplit(dataSet):
    #特征数量
    numFeatures = len(dataSet[0]) - 1
    #计数数据集的香农熵
    baseEntropy = calcShannonEnt(dataSet)
    #信息增益
    bestInfoGain = 0.0
    #最优特征的索引值
    bestFeature = -1
    #遍历所有特征
    for i in range(numFeatures):
        # 获取dataSet的第i个所有特征
        featList = [example[i] for example in dataSet]
        #创建set集合{},元素不可重复
        uniqueVals = set(featList)
        #经验条件熵
        newEntropy = 0.0
        #计算信息增益
        for value in uniqueVals:
            #subDataSet划分后的子集
            subDataSet = splitDataSet(dataSet, i, value)
            #计算子集的概率
            prob = len(subDataSet) / float(len(dataSet))
            #根据公式计算经验条件熵
            newEntropy += prob * calcShannonEnt((subDataSet))
        #信息增益
        infoGain = baseEntropy - newEntropy
        #打印每个特征的信息增益
        print("第%d个特征的增益为%.3f" % (i, infoGain))
        #计算信息增益
        if (infoGain > bestInfoGain):
            #更新信息增益,找到最大的信息增益
            bestInfoGain = infoGain
            #记录信息增益最大的特征的索引值
            bestFeature = i
            #返回信息增益最大特征的索引值
    return bestFeature

"""
函数说明:统计classList中出现次数最多的元素(类标签)
Parameters:
    classList:类标签列表
Returns:
    sortedClassCount[0][0]:出现次数最多的元素(类标签)
Modify:
    2018-03-13

"""
def majorityCnt(classList):
    classCount={
    
    }
    #统计classList中每个元素出现的次数
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote]=0
            classCount[vote]+=1
        #根据字典的值降序排列
        sortedClassCount=sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
        return sortedClassCount[0][0]

"""
函数说明:创建决策树

Parameters:
    dataSet:训练数据集
    labels:分类属性标签
    featLabels:存储选择的最优特征标签
Returns:
    myTree:决策树
Modify:
    2018-03-13

"""
def createTree(dataSet,labels,featLabels):
    #取分类标签(是否放贷:yes or no)
    classList=[example[-1] for example in dataSet]
    #如果类别完全相同,则停止继续划分
    if classList.count(classList[0])==len(classList):
        return classList[0]
    #遍历完所有特征时返回出现次数最多的类标签
    if len(dataSet[0])==1:
        return majorityCnt(classList)
    #选择最优特征
    bestFeat=chooseBestFeatureToSplit(dataSet)
    #最优特征的标签
    bestFeatLabel=labels[bestFeat]
    featLabels.append(bestFeatLabel)
    #根据最优特征的标签生成树
    myTree={
    
    bestFeatLabel:{
    
    }}
    #删除已经使用的特征标签
    del(labels[bestFeat])
    #得到训练集中所有最优特征的属性值
    featValues=[example[bestFeat] for example in dataSet]
    #去掉重复的属性值
    uniqueVls=set(featValues)
    #遍历特征,创建决策树
    for value in uniqueVls:
        myTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),
                                               labels,featLabels)
    return myTree



"""
使用决策树进行分类
Parameters:
    inputTree;已经生成的决策树
    featLabels:存储选择的最优特征标签
    testVec:测试数据列表,顺序对应最优特征标签
Returns:
    classLabel:分类结果
Modify:2018-03-13

"""
def classify(inputTree,featLabels,testVec):
    #获取决策树节点
    firstStr=next(iter(inputTree))
    #下一个字典
    secondDict=inputTree[firstStr]
    featIndex=featLabels.index(firstStr)

    for key in secondDict.keys():
        if testVec[featIndex]==key:
            if type(secondDict[key]).__name__=='dict':
                classLabel=classify(secondDict[key],featLabels,testVec)
            else: classLabel=secondDict[key]
    return classLabel

if __name__=='__main__':
    dataSet,labels=createDataSet()
    featLabels=[]
    myTree=createTree(dataSet,labels,featLabels)
    #测试数据
    testVec=[0,1]
    result=classify(myTree,featLabels,testVec)

    if result=='yes':
        print('放贷')
    if result=='no':
        print('不放贷')

result:

0个特征的增益为0.0831个特征的增益为0.3242个特征的增益为0.4203个特征的增益为0.3630个特征的增益为0.2521个特征的增益为0.9182个特征的增益为0.474
放贷

3.4 Storage of decision tree

Constructing a decision tree is a time-consuming task. Even processing a small data set, such as the previous sample data, takes a few seconds. If the data set is large, it will consume a lot of computing time. However, solving classification problems with a well-created decision tree can be done very quickly.

Therefore, in order to save computing time, it is best to be able to call the already constructed decision tree every time the classification is performed. To fix this, objects need to be serialized using the Python module pickle. Serialized objects can save objects on disk and read them out when needed.

Assuming we've got a decision tree {'有自己的房子': {0: {'有工作': {0: 'no', 1: 'yes'}}, 1: 'yes'}}, use pickle.dumpStore Decision Tree.

import pickle
"""
函数说明:存储决策树
Parameters:
    inputTree:已经生成的决策树
    filename:决策树的存储文件名
Returns:
    无
Modify:
    2018-03-13

"""
def storeTree(inputTree,filename):
    with open(filename,'wb') as fw:
        pickle.dump(inputTree,fw)

if __name__=='__main__':
    myTree={
    
    '有自己的房子':{
    
    0:{
    
    '有工作':{
    
    0:'no',1:'yes'}},1:'yes'}}
    storeTree(myTree,'classifierStorage.txt')



Run the code, in the same directory as the Python file, a classifierStorage.txttxt file will be generated, which stores our decision tree in binary.

It is very simple to use pickle.loadand load, write the code as follows:

import pickle

"""
函数说明:读取决策树

Parameters:
    filename:决策树的存储文件名
Returns:
    pickle.load(fr):决策树字典
Modify:
    2018-03-13
"""
def grabTree(filename):
    fr = open(filename, 'rb')
    return pickle.load(fr)

if __name__ == '__main__':
    myTree = grabTree('classifierStorage.txt')
    print(myTree)

3.5 sklearn - use decision tree to predict contact lens type

Dataset download

step:

Collect data: use the small dataset provided in the book

Prepare data: preprocess the data in the text, such as parsing data rows

Analyzing the data: Quickly inspect the data and use the createPlot() function to draw the final dendrogram

Training decision tree: use createTree() function to train

Test decision tree: Write a simple test function to verify the output of the decision tree & drawing results

Use decision tree: This part can choose to store the trained decision tree for use at any time

insert image description here

insert image description here

3.5.1 Using sklearn to build a decision tree

Official website

sklearn.tree- Provides decision tree models for solving classification and regression problems

class sklearn.tree.DecisionTreeClassifier(criterion=’gini’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)[source]

The parameters are described as follows:

criterion: Feature selection criteria, optional parameters, default is gini, can be set to entropy. Gini is the Gini impurity, which is the expected error rate of randomly applying a certain result from a set to a data item. It is a statistically based idea. Entropy is Shannon entropy, which is what was mentioned in the previous article, and it is an idea based on information theory. Sklearn sets gini as the default parameter, and it should be considered accordingly. Maybe the accuracy is higher? The ID3 algorithm uses entropy, and the CART algorithm uses gini.

splitter: Feature division point selection criteria, optional parameter, default is best, can be set to random. The selection strategy for each node. The best parameter is to select the best segmentation feature according to the algorithm, such as gini and entropy. Random randomly finds the local optimal dividing point among some dividing points. The default "best" is suitable when the sample size is not large, and if the sample data size is very large, at this time the decision tree construction recommends "random".

max_features: The maximum number of features considered when dividing, optional parameter, the default is None. The maximum number of features considered when looking for the best segmentation (n_features is the total number of features), there are the following 6 situations:

  • If max_features is an integer number, consider max_features features;
    if max_features is a floating-point number, consider int(max_features * n_features) features;
  • If max_features is set to auto, then max_features = sqrt(n_features);
  • If max_features is set to sqrt, then max_features = sqrt(n_features), the same as auto;
  • If max_features is set to log2, then max_features = log2(n_features);
  • If max_features is set to None, then max_features = n_features, that is, all features are used.
  • Generally speaking, if the number of sample features is small, such as less than 50, we can use the default "None". If the number of features is very large, we can flexibly use other values ​​just described to control the maximum feature considered in the division number to control the generation time of the decision tree.

max_depth: The maximum depth of the decision tree, an optional parameter, the default is None. This parameter is the number of layers of the tree. The concept of layers is, for example, in the loan example, the number of layers of the decision tree is 2 layers. If this parameter is set to None, the decision tree will not limit the depth of subtrees when building subtrees. Generally speaking, this value can be ignored when there is little data or few features. Or if the min_samples_slipt parameter is set, then until there are fewer than min_smaples_split samples. If the model has a large number of samples and many features, it is recommended to limit the maximum depth. The specific value depends on the distribution of the data. Commonly used values ​​can be between 10-100.

min_samples_split: The minimum number of samples required for internal node subdivision, an optional parameter, and the default is 2. This value limits the conditions under which subtrees continue to be divided.

  • If min_samples_split is an integer, then min_samples_split is used as the minimum number of samples when splitting internal nodes, that is, if the samples are less than min_samples_split samples, stop splitting.
  • If min_samples_split is a floating point number, then min_samples_split is a percentage, ceil(min_samples_split * n_samples), and the number is rounded up. If the sample size is not large, do not need to care about this value. It is recommended to increase this value if the sample size is of very large order of magnitude.

min_weight_fraction_leaf: The minimum sample weight sum of leaf nodes, an optional parameter, the default is 0. This value limits the minimum value of the sum of all sample weights of leaf nodes. If it is less than this value, it will be pruned together with sibling nodes. Generally speaking, if we have many samples with missing values, or the distribution category deviation of the classification tree samples is very large, the sample weight will be introduced, and we should pay attention to this value at this time.

max_leaf_nodes: The maximum number of leaf nodes, an optional parameter, the default is None. By limiting the maximum number of leaf nodes, overfitting can be prevented. If restrictions are added, the algorithm will build the optimal decision tree within the maximum number of leaf nodes. If there are not many features, this value can be ignored, but if there are many features, it can be limited, and the specific value can be obtained through cross-validation.

class_weight: Category weight, an optional parameter, the default is None, and it can also be a dictionary, a list of dictionaries, or balanced. Specifying the weights of each category of samples is mainly to prevent too many samples of certain categories in the training set, causing the trained decision tree to be too biased towards these categories. The weight of the category can be given in the format {class_label:weight}. Here, you can specify the weight of each sample yourself, or use balanced. If you use balanced, the algorithm will calculate the weight by itself, and the sample weight corresponding to the category with a small sample size will be high. Of course, if your sample category distribution has no obvious bias, you can ignore this parameter and choose the default None.

random_state: Optional parameter, default is None. random number seed. If it is a certificate, then random_state will be used as a random number seed for the random number generator. Random number seed, if no random number is set, the random number is related to the current system time, and each moment is different. If a random number seed is set, the same random number seed will generate the same random number at different times. If RandomState instance, then random_state is the random number generator. If None, the random number generator uses np.random.

min_impurity_split: Node division minimum impurity, optional parameter, the default is 1e-7. This is a threshold, which limits the growth of the decision tree. If the impurity of a node (Gini coefficient, information gain, mean square error, absolute difference) is less than this threshold, the node will no longer generate child nodes. It is a leaf node.

presort: Whether the data is pre-sorted, an optional parameter, the default is False, this value is a boolean value, the default is False not to sort. Generally speaking, if the sample size is small or a decision tree with a small depth is limited, setting it to true can make the division point selection faster and the decision tree establishment faster. If the sample size is too large, there is no benefit. The problem is that when the sample size is small, my speed is not slow. So this value is generally too lazy to care about it.
In addition to paying attention to these parameters, other points to pay attention to when tuning parameters are:

When the number of samples is small but the sample features are very large, the decision tree is easy to overfit. Generally speaking, it is easier to build a robust model if the number of samples is more than the number of features.

If the number of samples is small but the sample features are very large, it is recommended to do dimension reduction before fitting the decision tree model, such as principal component analysis (PCA), feature selection (Losso) or independent component analysis (ICA). The dimensionality of such features will be greatly reduced. It will be better to fit the decision tree model again.

It is recommended to use the visualization of the decision tree more often, and at the same time limit the depth of the decision tree first, so that you can first observe the preliminary fitting of the data in the generated decision tree, and then decide whether to increase the depth.

When training the model, pay attention to the category of the samples (mainly referring to the classification tree). If the category distribution is very uneven, consider using class_weight to limit the model's bias towards categories with more samples.
The array of the decision tree uses the float32 type of numpy. If the training data is not in this format, the algorithm will first copy and then run.

If the input sample matrix is ​​sparse, it is recommended to call csc_matrix sparse before fitting, and call csr_matrix sparse before prediction.

sklearn.tree.DecisionTreeClassifier()Provides some methods for us to use

Data preprocessing: stringEncoding data sets of type:

  • LabelEncoder: converts a string to an incremental value
  • OneHotEncoder: Converts strings to integers using the One-of-K algorithm

In order to serialize string type data, pandas data needs to be generated first, which is convenient for our serialization work. The method I use here is, original data -> dictionary -> pandas data, write the code as follows:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# import pydotplus
# from sklearn.externals.six import StringIO

if __name__ == '__main__':
    # 加载文件
    with open('lenses.txt', 'r') as fr:
        # 处理文件
        lenses = [inst.strip().split('\t') for inst in fr.readlines()]
    # 提取每组数据的类别,保存在列表里
    lenses_target = []
    for each in lenses:
        lenses_target.append(each[-1])
    # 特征标签
    lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']
    # 保存lenses数据的临时列表
    lenses_list = []
    # 保存lenses数据的字典,用于生成pandas
    lenses_dict = {
    
    }
    # 提取信息,生成字典
    for each_label in lensesLabels:
        for each in lenses:
            lenses_list.append(each[lensesLabels.index(each_label)])
        lenses_dict[each_label] = lenses_list
        lenses_list = []
        # 打印字典信息
    print(lenses_dict)
    #生成pandas.DataFrame
    lenses_pd = pd.DataFrame(lenses_dict)
    # print(lenses_pd)
    # 生成pandas.DataFrame
    lenses_pd = pd.DataFrame(lenses_dict)
    # 打印pandas.DataFrame
    print(lenses_pd)
    # 创建LabelEncoder()对象,用于序列化
    le = LabelEncoder()
    # 为每一列序列化
    for col in lenses_pd.columns:
        lenses_pd[col] = le.fit_transform(lenses_pd[col])
    print(lenses_pd)

3.5.2 Visualizing decision trees using Graphviz

1) pydotplusinstall

Enter directly under the anaconda prompt:

pip install pydotplus

2) Graphviz download

Installation steps install, after win+R->sysdm.cpl


Find "PATH" in "System Variables" in "Environment Variables", and then set the path

C:\Anaconda\pkgs\graphviz-2.38.0-4\Library\binAdd it to the back and restart Pycharm.

3.6 Summary

advantage:

  • Easy to understand and interpret, decision trees can be visualized.
  • Little to no data preprocessing is required. Other methods often require data normalization, creation of dummy variables and removal of missing values. Decision trees do not yet support missing values.
  • The cost of using a tree (eg predicting data) is the logarithm of the number of training data points.
  • Both numerical and categorical variables can be handled. Other methods are mostly applicable to the analysis of a set of variables.
  • Can handle multivalued output variable problems.
  • Use a white box model. Such rules are easily expressed using logical judgments if a situation is observed. Conversely, with black-box models such as artificial neural networks, the results are very difficult to interpret.
  • Even for the real model, it can be applied well when the assumption is invalid.

shortcoming:

  • Decision tree learning can create an overly complex tree that doesn't predict the data well. That is overfitting. Pruning mechanism (not supported now), setting the minimum number of samples required by a leaf node, or the maximum depth of the number, can avoid overfitting.
  • Decision trees can be unstable because even very small mutations can produce a completely different tree. This problem is alleviated by decision trees with an ensemble.
  • Learning an optimal decision tree is an NP-complete problem under several aspects of optimality and even for simple concepts. Therefore, traditional decision tree algorithms are based on heuristic algorithms, such as greedy algorithms, that is, each node creates an optimal decision. These algorithms cannot produce a family-optimal decision tree. Random sampling of samples and features reduces overall effect bias.
  • Concepts are difficult to learn because decision trees do not explain them well, e.g. XOR, parity or multiplexer problems.
  • Decision Trees will create a biased tree if some classes are dominant. Therefore, it is recommended to sample to balance the samples before training.

The decision tree algorithm mainly includes three parts: feature selection, tree generation, and tree pruning. Commonly used algorithms are ID3, C4.5, and CART.

  • feature selection. The purpose of feature selection is to select features that can classify the training set. The key to feature selection is the criterion: information gain, information gain ratio, Gini index;

  • Generation of decision trees. Usually, the maximum information gain, the maximum information gain ratio, and the minimum Gini index are used as the criteria for feature selection. Starting from the root node, a decision tree is recursively generated. It is equivalent to continuously selecting local optimal features, or dividing the training set into subsets that can basically be classified correctly;

  • Pruning of decision trees. The pruning of the decision tree is to prevent the overfitting of the tree and enhance its generalization ability. Includes pre-pruning and post-pruning.

3.7 Gradient Boosting Decision Tree (GBDT)

3.7.1 Overview of GBDT

Gradient Boosting Decision Tree GBDT (Gradient Boosting Decision Tree), also known as MART (Multiple Additive Regression Tree) or GBRT (Gradient Boosting Regression Tree), is also a decision tree model based on integrated thinking, but it is different from Random Forest There is an essential difference.

It has to be mentioned that GBDT is the most commonly used machine learning algorithm in current competitions, because it can not only be applied to a variety of scenarios, but more commendable is that GBDT has outstanding accuracy. This is why many people call GBDT the "Dragon Slaying Knife" in the field of machine learning.

Boosting, iteration, that is, to make joint decisions by iterating multiple trees. How can this be achieved?

Is it possible to train each tree independently, for example, for person A, the first tree is considered to be 10 years old, the second tree is considered to be 0 years old, and the third tree is considered to be 20 years old, so we take the average of 10 years old as the final in conclusion?

of course not! Not to mention that this is a voting method and not GBDT. As long as the training set remains unchanged, the three trees independently trained three times must be exactly the same, which is completely meaningless.

As I said before, GBDT accumulates the conclusions of all trees to make the final conclusion, so it can be imagined that the conclusion of each tree is not the age itself, but an accumulation of age.

The core of GBDT lies in :

What each tree learns is the residual of the conclusion sum of all previous trees. This residual is a cumulative amount that can get the real value after adding the predicted value.

  • For example, the real age of A is 18 years old, but the predicted age of the first tree is 12 years old, the difference is 6 years old, that is, the residual is 6 years old.
  • Then in the second tree, we set the age of A to be 6 years old to learn. If the second tree can really divide A into the leaf node of 6 years old, then the conclusion of accumulating the two trees is the real age of A.
  • If the conclusion of the second tree is 5 years old, A still has a 1-year-old residual, and the age of A in the third tree becomes 1 year old, continue to learn
  • If our number of iteration rounds is not over, we can continue to iterate below, and the error of the fitting age will decrease in each round of iteration. This is the meaning of Gradient Boosting in GBDT.

In fact, from here we can see the essential difference between GBDT and Random Forest. GBDT is not only simply using the integration idea, but also based on the learning of residuals. We use a classic example of GBDT to explain here.

In the iteration of GBDT, assume that the strong learner we obtained in the previous iteration is ft − 1 ( x ) f_{t−1}(x)ft1( x ) , the loss function isL ( y , ft − 1 ( x ) ) L ( y , ft − 1 ( x ) ) L(y,f_{t−1}(x))L(y,f_{t −1}(x))L ( y ,ft1(x))L(y,ft1( x )) , our goal of this iteration is to find a weak learnerht of the CART regression tree model ( x ) h_t(x)ht( x ) , let the current round loss lossL ( y , ft ( x ) = L ( y , ft − 1 ( x ) + ht ( x ) ) L(y,ft(x)=L(y,f_{ t−1}(x)+h_t(x))L ( y ,ft(x)=L ( y ,ft1(x)+ht( x )) minimum.

That is to say, this round of iterations finds the decision tree, and the loss of samples should be made as small as possible.

The main advantages of GBDT are:

  • It can flexibly handle various types of data, including continuous and discrete values.

  • In the case of relatively little tuning time, the forecast readiness rate can also be relatively high. This is relative to SVM.

  • Very robust to outliers using some robust loss functions. Such as Huber loss function and Quantile loss function.

The main disadvantages of GBDT are:

  • Due to dependencies among weak learners, it is difficult to train data in parallel. However, partial parallelism can be achieved through self-sampling SGBTs.

Refer to the summary of the principle of gradient boosting tree (GBDT)

This blog is also very good machine learning algorithm GBDT

Guess you like

Origin blog.csdn.net/jiaoyangwm/article/details/79525237