Those trees machine learning - a decision tree (three, CART trees)

Foreword

From the article article it has been nine months orz. . Taking advantage of the end of the review, the blog mend. .

Describes the decision tree ID3, C4.5 algorithm in the previous article. We know the ID3 algorithm is based on the size of the information gain of each node \ (\ operatorname {Gain} ( D, a) = \ operatorname {Ent} (D) - \ sum_ {v} \ frac {\ left | D ^ { v} \ right |} {| D |} \ operatorname {Ent} \ left (D ^ {v} \ right) \) to be divided, but the presence of bias select more eigenvalue problem, thus proposes C4. 5 algorithm, i.e., information gain ratio of standard divided \ (\ operatorname {gain} _ {-} \ operatorname {ratio} (D, a) = \ frac {\ operatorname {gain} (D, a)} {IV (a)} \) where \ (IV (a) = - \ sum_ {v = 1} ^ {V} \ frac {\ left | D ^ {v} \ right |} {| D |} \ log \ frac {\ left | V D ^ {} \ right |} {| D |} \) . However, you may have noticed, ID3 and C4.5 algorithms are not able to do regression. This article will introduce the principle of CART (Classification and Regression Tree) tree, and its implementation.

CART tree

Gini Coefficient

Different decision tree described earlier, CART binary tree, which is divided based on the Gini coefficient (Gini Index).

Let's look at the Gini value
\ [\ operatorname {Gini} ( D) = \ sum_ {k = 1} ^ {K} \ sum_ {k ^ {\ prime} \ neq k} p_ {k} (1 - p_ { k}) = 1- \ sum_ {
k = 1} ^ {K} p_ {k} ^ {2} \] on the probability of reaction formula 2 concentration samples taken from any data inconsistency intuitively from its category, which The smaller the value, the higher the purity.

Gini coefficient
\ [Gini \ _Index (D, a) = \ sum_ {v = 1} ^ {V} \ frac {| D ^ v |} {| D |} Gini (D ^ v) \]

Division manner

Discrete values

Perhaps you have already discovered, CART trees do division of discrete values ​​of time, if the feature only two property values, it is very easy, while one kind of good, but the time when the property is greater than 3 is equal to it? For example, [ 'Youth', 'middle-aged', 'elderly'], how do you divide this time? Of course all the way to traverse all friends again, there are the following three conditions [(( 'Youth'), ( 'middle-aged', 'elderly')), (( 'middle-aged'), ( 'Youth', ' elderly ')), ((' aged '), (' middle-aged ',' young '))]. Here I thought these questions:

  1. In doing data mining competition, bigwigs often say do cross features can help make a better decision tree division, it is not because of this division.
  2. This division is not some way not suitable for data with high cardinality categorical variables? So there are times when the use of statistical characteristics do count on these and other variables, when there will be a larger increase

Continuous value

Before we deal with discrete values ​​are, then, when faced with continuous values, CART trees is how to deal with it? Because it is a binary tree, so it must be to select a value greater than a node assigned to the value of less than the assigned another node.

So, here is related to a specific operation, this column would first feature values ​​sorted at the time of dividing, if there are N samples, then there will be up to N - 1 case, traverse from beginning to end, two each selection value as the midpoint of the division point, and Gini coefficients, and finally do the selected value of the minimum division.

If you are concerned about the complexity of the algorithm, you will find CART tree every time when the division needs to traverse all cases, the speed will be very slow. In XGBoost and LightGBM, the strategy seems to be using this calculation has been accelerated (dig a hole, the rear view XGBoost and LightGBM time to make up).

CART regression tree

Classification believe with CART to do with the foundation ID3 and C4.5, together with the above description, certainly it is easy to know how to do it. Here I say something about how to do regression problem with CART tree.

Think about a problem, not as a linear model tree model that can calculate a value y, then how do we determine the predictive value of each leaf of it? Mathematically, the regression tree can be seen as a piecewise function, each leaf node to determine a segment interval, the output value of the function is a leaf node on the node, and the value is a constant value.

Suppose CART trees feature space into T | T | regions \ (R_i \) , and in each region corresponding to a value of \ (B_i \) , the corresponding function is assumed
\ [h (x) = \ sum_ { i = 1} ^ {| T
|} b_ {i} \ mathbb {I} \ left (x \ in R_ {i} \ right) \] so, the problem here becomes how to divide the region \ (R_i \ ) and how to determine each region \ (R_i \) values corresponding to the \ (B_i \) .

Suppose region \ (R_i \) is known, then we can use a least squares loss \ (\ sum_ {x ^ { (i)} \ in R_j} (y ^ {(i)} - h (x ^ {i}) ) ^ 2 = \ sum_ {X ^ {(I)} \ in r_j} (Y ^ {(I)} - b_j) ^ 2 \) , to find the corresponding \ (b_j \) , apparently \ (b_j = AVG ({^ Y (I)} | ^ {X (I)} \ in r_j) \) .

For the divided area, heuristic methods may be employed, selection of \ (U \) attributes and the corresponding value \ (V \) , attributes, and as the division division threshold, define two areas \ (R_1 (u, v) = \ {x | x_u \ le v \} \) and \ (R_2 = \ {X | x_u> V \} \) , then the formula to find the optimal partitioning properties and the division threshold by at solving
\ [\ min _ {u , v} \ left [\ min _ {b_ {1}} \ sum_ {x ^ {(i)} \ in R_ {1} (u, v)} \ left (y ^ {(i)} - b_ { 1} \ right) ^ {2 } + \ min _ {b_ {2}} \ sum_ {x ^ {(i)} \ in R_ {2} (u, v)} \ left (y ^ {(i) } -b_ {2} \ right)
^ {2} \ right] \\ b_i = avg (y ^ {(i)} | x ^ {(i)} \ in R_i) \] then repeat the above two regions division, until the condition stops.

achieve

The following code went to a pleasant time, and here I only wrote the classification and regression trees just inside the Gini coefficient changed using the above formula can be minimized.

def createDataSetIris():
    '''
    函数:获取鸢尾花数据集,以及预处理
    返回:
        Data:构建决策树的数据集(因打乱有一定随机性)
        Data_test:手动划分的测试集
        featrues:特征名列表
        labels:标签名列表
    '''
    labels = ["setosa","versicolor","virginica"]
    with open('iris.csv','r') as f:
        rawData = np.array(list(csv.reader(f)))
        features = np.array(rawData[0,1:-1]) 
        dataSet = np.array(rawData[1:,1:]) #去除序号和特征列
        np.random.shuffle(dataSet) #打乱(之前如果不加array()得到的会是引用,rawData会被一并打乱)
        data = dataSet[0:,1:] 
    return rawData[1:,1:], data, features, labels

rawData, data, features, labels = createDataSetIris()

def calcGiniIndex(dataSet):
    '''
    函数:计算数据集基尼值
    参数:dataSet:数据集
    返回: Gini值
    ''' 
    counts = [] #每个标签在数据集中出现的次数
    count = len(dataSet) #数据集长度
    for label in labels:
        counts.append([d[-1] == label for d in dataSet].count(True))
    
    gini = 0
    for value in counts:
        gini += (value / count) ** 2
    
    return 1 - gini

def binarySplitDataSet(dataSet, feature, value):
    '''
    函数:将数据集按特征列的某一取值换分为左右两个子数据集
    参数:dataSet:数据集
        feature:数据集中某一特征列
        value:该特征列中的某个取值
    返回:左右子数据集
    '''
    matLeft = [d for d in dataSet if d[feature] <= value]
    matRight = [d for d in dataSet if d[feature] > value]
    return matLeft,matRight

def classifyLeaf(dataSet, labels):
    '''
    函数:求数据集最多的标签,用于结点分类
    参数:dataSet:数据集
        labels:标签名列表
    返回:该标签的index
    '''
    counts = [] 
    for label in labels:
        counts.append([d[-1] == label for d in dataSet].count(True))
    return np.argmax(counts) #argmax:使counts取最大值的下标

def chooseBestSplit(dataSet, labels, leafType=classifyLeaf, errType=calcGiniIndex, threshold=(0.01,4)):
    '''
    函数:利用基尼系数选择最佳划分特征及相应的划分点
    参数:dataSet:数据集
        leafType:叶结点输出函数(当前实验为分类)
        errType:损失函数,选择划分的依据(分类问题用的就是GiniIndex)
        threshold: Gini阈值,样本阈值(结点Gini或样本数低于阈值时停止)
    返回:bestFeatureIndex:划分特征
        bestFeatureValue:最优特征划分点
    '''
    thresholdErr = threshold[0] #Gini阈值
    thresholdSamples = threshold[1] #样本阈值
    err = errType(dataSet)
    bestErr = np.inf
    bestFeatureIndex = 0 #最优特征的index
    bestFeatureValue = 0 #最优特征划分点

    #当数据中输出值都相等时,返回叶结点(即feature=None,value=结点分类)
    if err == 0:
        return None, dataSet[0][-1]
    #检验数据集的样本数是否小于2倍阈值,若是则不再划分,返回叶结点
    if len(dataSet) < 2 * thresholdSamples:
        return None, labels[leafType(dataSet, labels)] #dataSet[0][-1]
    #尝试所有特征的所有取值,二分数据集,计算err(本实验为Gini),保留bestErr
    for i in range(len(dataSet[0]) - 1):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList) #第i个特征的可能取值
        for value in uniqueVals:
            leftSet,rightSet = binarySplitDataSet(dataSet, i, value)
            if len(leftSet) < thresholdSamples or len(rightSet) < thresholdSamples:
                continue
#             print(len(leftSet), len(rightSet))
            gini = (len(leftSet) * calcGiniIndex(leftSet) + len(rightSet) * calcGiniIndex(rightSet)) / (len(leftSet) + len(rightSet))
            if gini < bestErr:
                bestErr = gini
                bestFeatureIndex = i
                bestFeatureValue = value
    #检验Gini阈值,若是则不再划分,返回叶结点
    
    if err - bestErr < thresholdErr:
                return None, labels[leafType(dataSet, labels)] 
    
    return bestFeatureIndex,bestFeatureValue

def createTree_CART(dataSet, labels, leafType=classifyLeaf, errType=calcGiniIndex, threshold=(0.01,4)):

    '''
    函数:建立CART树
    参数:同上
    返回:CART树
    '''
    feature,value = chooseBestSplit(dataSet, labels, leafType, errType, threshold)
#     print(features[feature])
    #是叶结点则返回决策分类(chooseBestSplit返回None时表明这里是叶结点)
    if feature is None:
        return value
    #否则创建分支,递归生成子树
#     print(feature, value, len(dataSet))
    leftSet,rightSet = binarySplitDataSet(dataSet, feature, value)   
    myTree = {}
    myTree[features[feature]] = {}
    myTree[features[feature]]['<=' + str(value) + ' contains' + str(len(leftSet))] = createTree_CART(leftSet, np.array(leftSet)[:,-1], leafType, errType,threshold)
    myTree[features[feature]]['>' + str(value) + ' contains' + str(len(rightSet))] = createTree_CART(rightSet, np.array(rightSet)[:,-1], leafType, errType,threshold)
    
    return myTree

CARTTree = createTree_CART(data, labels, classifyLeaf, calcGiniIndex, (0.01,4))
treePlotter.createPlot(CARTTree)

Guess you like

Origin www.cnblogs.com/csu-lmw/p/12099511.html