Hand line and tree code - Detailed principles (1) (python3)

premise

For information entropy

Simple way, studied physical chemistry of small partners understand:
the entropy is a measure of the degree of disorder disordered molecular motion , the greater the entropy, the greater the degree of disarray.
Therefore, the information entropy of disorder inside information on an event included
in the math:
When a thing is unlikely to happen, we have obtained a large amount of information
when one thing is very likely to happen when the amount of information we have obtained a small

example:
1. Trump is actually a Chinese undercover
2. Trump is zz.
Conclusion: The
metric information should depend on the probability distribution,
so that the entropy h (X) should be the probability P (X) is a monotonic function.

Derivation:
When x is not associated with y events Event: (x and y events independent events)
we obtain the sum of the amount of information of the event x and the event y = x + y event information event information
that is: h (X, Y) = h (X) + h ( Y) ------------ (1)
Similarly: Since x, y events independent
x, y probability of occurrence of an event while the event probability = x * y probability event
That is: P (XY) = P ( X) * P (Y)
on its both sides the logarithmic
lgP (XY) = lg (P (X) * P (Y)) = lgP (X) + lgP (Y) - ----- (2)
by the (1), (2) concluded that
h (X) should be the probability P (X) is a function of log
P (X) 0-1 probability after the log is negative, ( Normally entropy h (X) should be positive)
So we need to add a minus sign before the formula.
As to log Who bottom, generally to computer 2 (bit) as a substrate, typically machine learning base e.
Therefore: X event information entropy for a single random variables xi
Here Insert Picture Description
considering all the values of the random variable X events xi averaging (desired treatment)
to give the entire event information entropy is:
Here Insert Picture Description
Code attach:
the PS:
Event: the entire data set is
randomly variable: the dataset class
probability: the dataset, the number of times a particular class of data the probability of occurrence total data set of the instances of the total
suppose we dataset attribute values watermelon, the last column is the determination result
to find the last column each set of data the results accounted for the entire data set instance the probability
of the probability of how many good melon, rotten melon probability of how many

#  度量数据集的无序程度(计算香农熵)
def calcShannonEnt(dataSet):  # calculate shannon entropy计算香农熵
    numEntries = len(dataSet)  # 得到数据集的长度,entries词典的条目的数量,就是词条的数量
    labelCounts = {}  # 新建空字典
    for featVec in dataSet:  # 遍历数组
        currentLabel = featVec[-1]  # currentLabel 存储dataSet最后一列的数值,最后一列是最终判断的结果
        if currentLabel not in labelCounts.keys():  # 如果数值不在字典里
            labelCounts[currentLabel] = 0  # 如果判断不在字典里,扩展字典,将currentLabel的键值设为0
        labelCounts[currentLabel] += 1  # 将currentLabel的键值加1,记录当前,类别的判断在字典里出现的次数
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key]) / numEntries  # probablity 计算字典中的类别在数据集中出现的概率
        shannonEnt += -prob * log(prob, 2)  # 香农熵的计算公式,其实就是算所以信息的期望值
    return shannonEnt

Information gain

The higher the degree of information and data confusion, also on behalf of our greater difficulty to deal with this sort of thing, but in general we encountered one thing, or to arrange a higher-level task, which often contains information is extremely confused, helpless and chaotic, so we all want to be able to reduce the entropy will increase the purity of information, useful information and more orderly, rule out invalid information.
We entropy change information or change information on the purity of behavior is called information gain
information to gain
general information entropy we all want to reduce it, so the information gain is generally greater than 0
after classification - entropy before the classification information gain = information entropy

The first step: divide the data set (the same type of data classification)

#  按照特征划分数据集,它的方式是去除该value,并返回去除所有value的数据组合起来的数据集
def splitDataSet(dataSet, axis, value):  # 输入带划分数据集,axis列的属性,value(划分数据集的特征),我们需要返回的特征的值
    retDataSet = []  # 创建新的list对象,为了不修改原始数据集
    for featVec in dataSet:
        if featVec[axis] == value:  # 找出每个数据组的axis轴的属性里的特征值,让它和value特征判断,相等去除掉value
            # 下面这个操作其实就是找每个数据组的axis列上是value的,我就删掉
            reducedFeatVec = featVec[:axis]  # 0-axis-1
            reducedFeatVec.extend(featVec[axis + 1:])  # axis+1到最后,两个合并起来
            retDataSet.append(reducedFeatVec)  # 变成[[reducedFeatVec1],[reducedFeatVec2],[reducedFeatVec3]]
    return retDataSet  # 这里面存着所有被删过value的数据组,没有value的数据组没有放进去

Step import data set and formatted

In this step we import the data, and its clean up unnecessary space characters such as commas, convert it to the list.

def file2matrix(filename):
    fr=open(filename)
    lists=fr.readlines()
    listnum=[]
    for k in lists:
        listnum.append(k.strip().split(','))
    return listnum

The third step is to select the best data set dividing manner

Category data set of each data is assumed that there are five kinds, i.e. it has five kinds of property values
i.e. watermelon color, size, fineness, degree of hardness, stripes
we are in each category as a way to divide the data set, calculates it's entropy has not been reduced, the reduction in the most powerful way division, which is the largest information gain, as the first division means

#  选择最好的数据集划分方式
'''
 dataSet = [[1,2,3],[4,2,6,7],[8,3,2,11]]
     for fc in dataSet:
         if fc[1] == 2:
             print(fc[:1],fc[2:],"!")
     for i in range(3):
         featlist = [example[i] for example in dataSet]
         print(featlist)
[1] [3] !
[4] [6, 7] !
[1, 4, 8]
[2, 2, 3]
[3, 6, 2]

'''
def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1
    baseEntropy = calcShannonEnt(dataSet)  # 计算数据集的香农熵
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):
        # 这个写法是遍历数据集中的每一行,把其中的第i个数据取出来组合成一个列表,每个i列表示一种属性
        featList = [example[i] for example in dataSet]  # 把属性i中相同类别的元素划在一个列表,再合起来组合成一个大列表
        uniqueVals = set(featList)  # set可以去掉重复元素
        newEntropy = 0.0
        # 找列表的第一个列表里遍历,在遍历列表里的第二个列表,以此类推
        for value in uniqueVals:  # 把所有类别的所有特征全部划分一次数据集
            subDataSet = splitDataSet(dataSet, i, value)  # 给出在属性i下不同的特征值获取每种不同划分方式的数据集
            # 对应到决策树的情况就是每次选判断条件(特征值),通过这个判断条件之后剩下来的数据集的信息熵是否减少
            prob = len(subDataSet) / float(len(dataSet))  # 计算i轴属性i下有value的数据组占整个数据组的概率
            newEntropy += prob * calcShannonEnt(subDataSet)  # 计算不同划分方式的信息熵
        infoGain = baseEntropy - newEntropy  # 计算所有的信息增益
        if infoGain > bestInfoGain:  # 选出最大的信息增益
            bestInfoGain = infoGain
            bestFeature = i  # 找到最好的划分方式特征并返回
        return bestFeature
Published 19 original articles · won praise 4 · Views 498

Guess you like

Origin blog.csdn.net/qq_35050438/article/details/103489986