[Basics of Machine Learning] Decision Tree Classification Algorithm

Decision Tree Classification Algorithm

The essence of a decision tree is a tree, each of its leaf nodes represents a certain classification. By selecting the branches of the entire tree and finally reaching the leaf node, you can get what kind of classification it is.


1. How to choose the best decision

1. Principle of Occam's Razor

"If it is not necessary, do not increase its value." That is to say, when making a decision, we have to choose the way that can get the fastest result. To put it more bluntly is "You can use three points, don't move five to succeed."

2. Information Entropy

To apply Occam's razor principle to decision trees, we need to introduce information entropy. I won’t elaborate on information entropy here. Just need to understand that information entropy represents a state of confusion of information. The more orderly a data set, the smaller the information entropy, so to get the state of the smallest information entropy, you can choose the direction with the largest change in information entropy.
Information entropy calculation method:
X={x1, x2… xn} represents a set of variables, and P(xi) represents the corresponding probability.
Information entropy calculation formula

If there are many features for a set of data, then there is conditional information entropy for each feature . Then the calculation formula of conditional entropy is:
T={t1, t2,… tm} represents the feature T, and the probability of his appearance in the sample for different features ti is |s|/S The
Conditional entropy calculation formula
information entropy increment calculation formula is as follows:
that is the basis Information entropy minus conditional entropy.
Information entropy increment

3. Decision tree construction

Through all the information entropy increments on the data set, the feature of the largest information entropy increment is selected. The entire data set is divided by this feature. Get a smaller data set, continue to divide on this data set, until you find the final predicted value. (This may be a bit abstract, see the examples and codes below)

2. Examples

The data is selected from a bank data set on kaggle . The purpose is to predict whether the customer will make the expected deposit. The data set can be downloaded at the above link. The following is an example of the data. The final deposit is the value to be predicted.

age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome deposit
0 59 admin. married secondary no 2343 yes no unknown 5 may 1042 1 -1 0 unknown yes
1 56 admin. married secondary no 45 no no unknown 5 may 1467 1 -1 0 unknown yes
2 41 technician married secondary no 1270 yes no unknown 5 may 1389 1 -1 0 unknown yes
3 55 services married secondary no 2476 yes no unknown 5 may 579 1 -1 0 unknown yes
4 54 admin. married tertiary no 184 no no unknown 5 may 673 2 -1 0 unknown yes

Statistics of this data set can get characteristic variables and their values ​​(divided into two types).
1. The characteristic value is a string
[1] job : admin,technician, services, management, retired, blue-collar, unemployed, entrepreneur, housemaid, unknown, self-employed, student
[2] marital : married, single, divorced
[3] education : secondary, tertiary, primary, unknown
[4] default : yes, no
[5] housing : yes, no
[6] loan : yes, no
[7] deposit : yes, no (Dependent Variable)
[8 ] contact : unknown, cellular, telephone
[9] month: jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec
[10] poutcome : unknown, other, failure, success
2. The characteristic value is a number
[1] age
[2] balance
[3] day
[4] duration
[5] campaign
[6] pdays
[7] previous
In this data set, we need to select a feature to divide the data set. How to choose features for division? The answer is to choose the feature with the largest increase in information entropy. The maximum entropy increment for the first division of this data set is balance (calculated). Through the value of balance, it can be divided into several categories, the information entropy increment under the division of several categories can be calculated, and the data set can be divided continuously until the final result is obtained.
[ Explanation ] Since some features of this data set are numerical values, the division is too detailed when making a decision, so that a better result cannot be obtained, so the data can be discretized, and the data within a range can be regarded as It is a kind of data. Discretization is not done in this article.

Two, code implementation

# 决策树预测

import numpy as np
import pandas as pd
import math

class BankPredict():

    def __init__(self):
        pass

    # 读取数据
    def import_data(self, filename):
        input_csvdata = pd.read_csv(filename)
        feature_name = np.array(input_csvdata.columns.values).tolist()
        input_data_set = np.array(input_csvdata).tolist()

        # print(feature_name)
        # print(input_data_set)

        return input_data_set, feature_name

    # 划分数据集 (通过某个特征的取值进行划分)
    def split_data(self, data_set, axis, class_value):
        ret_data_set = []

        for feat_vector in data_set:
            if feat_vector[axis] == class_value: # 获取到相同的取值,然后进行划分,返回相同分类的列表
                reduce_feat_vector = feat_vector[0:axis]
                reduce_feat_vector.extend(feat_vector[axis+1:])
                ret_data_set.append(reduce_feat_vector)
        # print (ret_data_set)
        return ret_data_set

    # 计算某个特定特征的信息熵
    def cal_shannon_evt(self, sub_data_set):
        # 计算指定列(特征)的某一种类别信息熵 对于本问题来说,结果只会有YES/NO sub_data_set里面只存储最后一列的信息 大小为n*1
        class_count = {
    
    }
        for item in sub_data_set:
            class_count[item[-1]] = class_count.get(item[-1], 0) + 1
        # print(class_count)
        # 计算此特征本种分类下的信息熵
        shannon_evt = 0.0
        data_size = len(sub_data_set)
        for class_item in class_count:
            pxi = (float(class_count[class_item])/float(data_size))
            shannon_evt = shannon_evt - pxi*math.log2(pxi)
        return shannon_evt
    
    # 计算条件熵
    def cal_condition_evt(self, data_set, axis, class_value):
        # 计算在某个特征的划分下,划分之后的条件熵(用信息熵*特定特征分类的出现的概率) axis可表示特征  class_value可表示特征内的分类情况
        condition_evt = 0.0
        
        data_size = len(data_set)
        for value in class_value:
            sub_data_set = self.split_data(data_set, axis, value)
            sub_shannon_evt = self.cal_shannon_evt(sub_data_set)

            # 计算条件熵
            condition_evt = condition_evt + (float(len(sub_data_set))/data_size)*sub_shannon_evt
        
        return condition_evt

    # 计算熵增量
    def inc_evt(self, data_set, base_evt, axis):
        # 获取某一列
        feature_list = [item[axis] for item in data_set]
        # print(feature_list)
        class_value = set(feature_list)
        new_evt = self.cal_condition_evt(data_set, axis, class_value)
        # 计算熵增 信息熵-条件熵
        ie = base_evt - new_evt 
        return ie

    # 选择熵增最大的特征进行划分
    def choose_best_feature(self, data_set):
        feature_num = len(data_set[0]) - 1 # 排除最后一列
        base_evt = self.cal_shannon_evt(data_set)
        best_evt = 0.0
        best_feature = -1
        for axis in range(feature_num):
            axis_feature_evt = self.inc_evt(data_set, base_evt, axis)
            if axis_feature_evt > best_evt:
                best_evt = axis_feature_evt
                best_feature = axis
        
        # 返回熵增最大的特征行
        return best_feature

    # 当只有决策到只有一个类别时,输出出现次数最多的类别
    def majority_class(self, class_list):
        class_count = {
    
    }
        for item in class_list:
            class_count[item] = class_list[item]
        temp_num = 0
        result_class = ""
        for item in class_count:
            if temp_num < class_count[item]:
                temp_num = class_count[item]
                result_class = item
        return result_class

    # 构建决策树
    def create_decision_tree(self, data_set, label):
        # 总分类列表,本题中只有YES/NO
        class_list = [example[-1] for example in data_set]
        '''
        决策成功的两种情况
        1. 本次划分之后分类列表中只有一种分类,直接结束。
        2. 本次划分使用完了所有特征还是不能划分成一个分类,选择出现次数最多的分类结束。
        '''
        # 情况1
        if class_list.count(class_list[0]) == len(class_list):
            return class_list[0]
        # 情况2
        if len(data_set[0]) == 1:
            return self.majority_class(class_list)

        best_feature = self.choose_best_feature(data_set)
        best_feature_label = label[best_feature]
        my_tree = {
    
    best_feature_label:{
    
    }} # 对于某个特征的树
        # 已经使用过的特征进行删除标记
        del(label[best_feature]) #这里删除只会删掉引用

        feature_value = [example[best_feature] for example in data_set]
        values = set(feature_value)
        # print(best_feature_label, values)
        # 对于每一种划分的不同类别都进行建树
        for value in values:
            sub_label = label[:]
            # print(best_feature, value)
            my_tree[best_feature_label][value] = self.create_decision_tree(self.split_data(data_set, best_feature, value), sub_label)

        return my_tree

    # 测试单个节点
    def single_test(self, my_tree, testcase, labels):
        # 获取根节点
        root_key = list(my_tree.keys())[0]
        # 根节点下的所有子树
        all_child_tree = my_tree[root_key]
        
        # 和测试节点进行比较
        feature_index = labels.index(root_key)
        testcase_key = testcase[feature_index]
        # print('-------------------')
        # print(labels)
        # print('root_key: ', root_key, '/all_child_tree: ', all_child_tree, '/feature_index: ', feature_index, '/testcase_key: ', testcase_key)
        # 获取测试节点对应子树
        child_tree = all_child_tree[testcase_key]
        
        # print('root_key: ', root_key, '/all_child_tree: ', all_child_tree, '/testcase_key: ', testcase_key, '/child_tree: ', child_tree)

        if isinstance(child_tree, dict):
            result = self.single_test(child_tree, testcase, labels)
        else:
            result = child_tree

        return result

_DEBUG = True



if __name__ == "__main__":
    FILE_NAME = r'2020\ML\ML_action\\2.DecisionTree\data\bank.csv'
    bp = BankPredict()
    print("data loading...")
    train_size = 11000
    import_data_set, feature_name = bp.import_data(FILE_NAME)
    label = feature_name.copy()

    data_set = import_data_set[0:train_size]

    print("data load over.")
    print('building tree...')
    my_tree = bp.create_decision_tree(data_set, label)
    print('build tree end.')


    if _DEBUG == True:
        # 测试
        print("test result = ", bp.single_test(my_tree, data_set[2], feature_name))
        print("real result = ", data_set[2][-1])

to sum up

references

  1. Machine learning practical books
  2. https://www.kaggle.com/shirantha/bank-marketing-data-a-decision-tree-approach/data
  3. https://github.com/apachecn/AiLearning/blob/master/docs/ml/3.%E5%86%B3%E7%AD%96%E6%A0%91.md
  4. https://blog.csdn.net/colourful_sky/article/details/82056125
  5. https://www.cnblogs.com/starfire86/p/5749328.html

Guess you like

Origin blog.csdn.net/qq_37753409/article/details/108884162