机器学习之决策树（上）

终于做到了决策树，这是一个很有意思的分类模型

什么是决策树

尽管我一直将决策树理解为一个分类模型，但实际上，他也是可以解决回归问题的，如何理解决策树呢？
我们应该从以下几个方面理解：

1、他是一棵树
2、他是一堆If - then规则的集合
3、定义在特征空间与类空间上的条件概率分布，决策树实际上是将特征空间划分成了互不相交的单元，每个从根到叶的路径对应着一个单元。决策树所表示的条件概率分布由各个单元给定条件下类的条件概率分布组成。实际中，哪个类别有较高的条件概率，就把该单元中的实例强行划分为该类别。

决策树有什么优点

模型具有可解释性，容易向业务部门人员描述。
分类速度快

如何用决策树分类

1、特征选择
2、决策树的生成
3、决策树的剪枝
4、分类

如何进行特征选择

先说为什么要进行特征选择?

我认为，我们要搭建决策树，就要找到一系列的If-then条件，而If-then条件又是依赖于特征空间的，所以我们就要找到真正“管事”的特征。

所以我们该如何找到一个管事的特征呢？

很容易知道，如果我们设计的分类规则带来的结果与随机分类的结果相差不大，那么就说明这个我们选取的特征不管事。

所以在这里，我们引入两个概念：

信息增益
信息增益率
我们要理解这两个概念的话，首先要明白什么是熵：

熵，表示随机变量的不确定性

Entropy

现在，来解释什么是信息增益：
按照上面讲的，理所当然，我们设置分类规则后，这个点属于的类别的不确定性应该或多或少的有所减少，也就是说，设置分类规则后的熵（条件熵）应该比未设置分类规则前的熵要小一些，而小的部分就是信息增益。

information gain

但是呢，信息增益有一些缺点，就是某个类别的样本数如果很大的话，那么他的信息增益一定会很大。
所以，为了解决这种不准确的问题，昆兰（大佬）在ID3算法的基础改进，提出了C4.5，利用信息增益率来解决问题，也就是用信息增益除以对应的条件熵。

information gain ratio

如何生成决策树

1、ID3算法
通过比较信息增益，获得最优的分类规则，然后递归，生成整个决策树。
需要注意的地方：

当训练集（子集）只有一种类别时，生成叶子节点（类别）
包括只有一个，和都是一种的情况。

2、c4.5算法

trick

将一些我在做的时候，遇到的，思考的一些有意思的地方：
1、经验熵与经验条件熵的估计方法，在书上我们可以看到，他简单的使用了频率估计概率的方法估计得到了概率，但是我们知道，这样其实是不够准确的（装逼的），所以理所当然的我们要用极大似然估计啊。
怎么用呢？

首先，我们先来回顾一下，在朴素贝叶斯的搭建过程中，我们用到的极大似然估计是怎么回事？
我们通过贝叶斯公式展开了待求的条件概率，然后有两个概率我们需要去估计一下：

待求概率

我们只说一下后面那个：
首先就是他的实际意义：当类别是Yi时，特征向量/第i维特征值是Xi的概率是多少，显然，对于这个条件概率，我们并不知道该怎么求解，但是利用极大似然估计的想法，我们假设这个条件概率符合参数为thete的某种分布，当然，如果我们想知道这个条件概率的话，我们只需要算出参数theta，然后我们就得到了概率密度函数，剩下的就不用多说了，但是怎么算theta呢？我们就要用到训练数据了，也就是说：X1, X3, X6都是Y1类别的（举例），那么出现X1,X3,X6同时为Y1类别的可能性（似然函数）就非常的大了，所以后面的大家就都知道了，求偏导，等于零，得到theta。

那么，对于决策树的这个问题呢？我们怎么利用极大似然估计解决，其实是一样的，只不过现在我们要求的不是刚才的第二个了，而是第一个了，同样，我们假设这个概率符合某种分布，我们需要求解这个参数，同样的我们将所有的为该类的数据全部带入到似然函数中，使他最大化，这样我们就算出来了这个概率。
下面是我是实现的代码：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Date    : 2019-02-02 14:19:55
# @Author  : Vophan Lee ([email protected])
# @Link    : https://www.jianshu.com/u/3e6114e983ad

from sklearn.datasets import make_classification
import numpy as np
import math


class decision_tree(object):
    """
    one step to access my dream
    """

    def __init__(self):
        super(decision_tree, self).__init__()
        self.features = 5
        self.samples = 100
        self.data = make_classification(n_samples=self.samples, n_features=self.features, n_classes=2)
        self.data_0 = []
        self.data_1 = []
        for i in enumerate(self.data[1]):
            if i[1] == 0:
                self.data_0.append(self.data[0][i[0]])
            else:
                self.data_1.append(self.data[0][i[0]])
        self.means_1 = np.mean(self.data_1, axis=0, keepdims=True).reshape(self.features)
        self.std_1 = np.std(self.data_1, axis=0, keepdims=True, ddof=1).reshape(self.features)
        self.means_0 = np.mean(self.data_0, axis=0, keepdims=True).reshape(self.features)
        self.std_0 = np.std(self.data_0, axis=0, keepdims=True, ddof=1).reshape(self.features)
        self.std = [self.std_0, self.std_1]
        self.means = [self.means_0, self.means_1]
        self.empirical_entropy = self.cal_emp_entropy()

    def cal_emp_entropy(self):
        all_p = []
        data_list = [self.data_0, self.data_1]
        for i in range(2):
            p = 1
            for dim in range(self.features):
                for data in data_list[i]:
                    # print(self.std[data_list.index(all_data)][dim])
                    p *= (1 / (pow(2 * math.pi, 0.5) * self.std[i][dim])) * pow(math.e, -(pow((data[dim] - self.means[i][dim]), 2) / 2 * pow(self.std[i][dim], 2)))
            all_p.append(p)

        entropy = 0
        for p in all_p:
            entropy += -p * math.log2(p)
        return entropy

2、我还想说些什么呢？
就是，书本上我们可以发现，给的特征都是离散的，如果我们给几组连续的特征呢？比如西瓜书上的例子，西瓜的含糖率以及密度，我们应该怎么处理呢？

在这里，有一种方法叫做二分法求最佳划分点
首先，我们先对所有数据按照某一个维度排序，然后在将每两个点的中点作为划分点，然后分别计算他们的信息增益，最终确定一个最佳的划分点

下面是我的全部代码：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Date    : 2019-02-03 15:17:08
# @Author  : Vophan Lee ([email protected])
# @Link    : https://www.jianshu.com/u/3e6114e983ad

from sklearn.datasets import make_classification
import numpy as np
import math


class Decision_Tree(object):
    """
    this is a class to build the decision tree
    """

    feature_list = []
    gain_list = []
    dim_list = []
    index = 0

    def __init__(self):
        super(Decision_Tree, self).__init__()
        self.features = 5
        self.samples = 100
        self.data = make_classification(
            n_samples=self.samples, n_features=self.features, n_classes=2)
        self.empirical_entropy = self.cal_emp_entropy(self.data)

    def cal_emp_entropy(self, data):
        """
        calculate the empirical entropy
        """
        data_0 = []
        data_1 = []
        for i in enumerate(data[1]):
            if i[1] == 0:
                data_0.append(data[0][i[0]])
            else:
                data_1.append(data[0][i[0]])
        entropy = 0
        for data_ in [data_0, data_1]:
            entropy += - \
                (len(data_) / len(data[0])) * \
                 math.log2(len(data_) / len(data[0]))
        return entropy

    def div_point(self, dim_data):
        """
        decide the divided point of each feature,here we sopposed that dim_data is a continuous dataset
        dim_data: tuple
        """
        def dichotomy(dim_data):
            div_points = np.zeros((1, self.samples)).reshape(self.samples)
            for i in enumerate(dim_data):
                if i[0] == len(dim_data) - 1:
                    break
                div_points[i[0]] = (dim_data[i[0] + 1] + i[1]) / 2
            return div_points
        dim_data = list(dim_data)
        dim_data = np.array(dim_data)
        dim_data = dim_data[:, dim_data[0].argsort()]
        dim_data = tuple(dim_data)
        div_points = dichotomy(dim_data[1])
        information_gain_list = []
        for i in div_points:
            div_index = list(div_points).index(i) + 1
            front = dim_data[1][:div_index]
            behind = dim_data[1][div_index:]
            front_flag = dim_data[0][:div_index]
            behind_flag = dim_data[0][div_index:]
            front_data = (front, front_flag)
            behind_data = (behind, behind_flag)
            if len(front_data[0]) == 1 or ((front_data[1] == front_data[1][::-1]).all() and len(front_data[0]) != len(dim_data[0]) / 2):
                behind_entropy = self.cal_emp_entropy(behind_data)
                information_gain = self.empirical_entropy - \
                    (behind_entropy * (len(behind) / len(dim_data[0])))
                information_gain_list.append(information_gain)
            elif len(behind_data[0]) == 1 or ((behind_data[1] == behind_data[1][::-1]).all() and len(front_data[0]) != len(dim_data[0]) / 2):
                front_entropy = self.cal_emp_entropy(front_data)
                information_gain = self.empirical_entropy - \
                    (front_entropy * (len(front) / len(dim_data[0])))
                information_gain_list.append(information_gain)
            elif (front_data[1] == front_data[1][::-1]).all() and len(front_data[0]) == len(dim_data[0]) / 2:

                return -1, div_points[int(len(dim_data[0]) / 2 - 1)]
            else:
                front_entropy = self.cal_emp_entropy(front_data)
                behind_entropy = self.cal_emp_entropy(behind_data)
                information_gain = self.empirical_entropy - (front_entropy * (len(front) / len(
                    dim_data[0])) + behind_entropy * (len(behind) / len(dim_data[0])))
                information_gain_list.append(information_gain)
        max_information_gain = max(information_gain_list)
        return max_information_gain, div_points[information_gain_list.index(max_information_gain)]

    def compare_features(self):
        """
        here we choose a maximium information gain among all features
        """
        gain_list_tmp = []
        point_list = []
        for i in range(self.features):
            information_gain, div_point = self.div_point((self.data[1], self.data[0].transpose()[i]))
            gain_list_tmp.append(information_gain)
            point_list.append(div_point)
        com_matrix = np.array([
            gain_list_tmp,
            point_list,
            range(self.features)
        ])
        com_matrix = com_matrix[:, com_matrix[0].argsort()]
        Decision_Tree.feature_list = list(com_matrix[1])
        Decision_Tree.gain_list = list(com_matrix[0])
        Decision_Tree.dim_list = list(com_matrix[2])

    def planet_tree(self, data):
        """
        here is the process of planeting the tree
        data: without flag
        """
        feature = Decision_Tree.feature_list[Decision_Tree.index]
        dim = Decision_Tree.dim_list[Decision_Tree.index]
        Decision_Tree.index += 1
        if Decision_Tree.gain_list[Decision_Tree.feature_list.index(feature)] == -1 or Decision_Tree.index >= len(Decision_Tree.feature_list) - 1:
            return tree_node([x for x in data.transpose()[int(dim)] if x < feature],
                             [x for x in data.transpose()[int(dim)] if x > feature],
                             feature)
        else:
            return tree_node(self.planet_tree([x for x in data[0] if x < feature]),self.planet_tree([x for x in data[0] if x > feature]), feature)

class tree_node(object):
    """
    this is the node of the decision tree
    """

    def __init__(self, left, right, data):
        self.left=left
        self.right=right
        self.data=data

这就是今天晚上的全部内容，剪枝以及回归的内容请关注我后续的文章哦

晚上01:28