(In decision tree) information entropy and sample classification information entropy calculation source code

Table of contents

1. Information entropy

① Basic concepts

② Calculation formula

2. Information entropy in decision tree

3. Source code for calculating Shannon entropy of data set sample classification


Explanation: Since I have forgotten this part of the knowledge, I read the information for review and wrote this article.

It should be noted that in this article, all the content in square brackets is purely personal understanding, and I hope readers will bear with me.


1. Information entropy

① Basic concepts

It is used to measure the expected value of a random variable [a decision is jointly determined by many random vectors, and these random vectors are the factors considered by people in deciding everything] . The greater the information entropy of a variable, the more situations it contains [that is, this factor is also related to many other factors, the sooner this factor is dealt with, the better the decision, otherwise it will still give you a headache because you have already put After deciding on other factors, when you consider this factor, you have to reconsider other factors, so there is a high probability that you will make a mistake in your decision], that is, more information is needed to fully determine it.

Shannon thinks this way: information is the elimination of uncertainty. Generally speaking, when a certain information has a higher probability, it indicates that it has been disseminated more widely, or that it has been cited to a higher degree.

Shannon first proposed the concept of information entropy, so the expression of information entropy and Shannon entropy is the same...


② Calculation formula

For multi-category problems, extend the formula to get it. In the following decision tree, the length of the data of a certain type of feature/the length of the total data can also be used to represent the probability of occurrence, that is, the frequency represents the probability. The concept of p has different meanings in specific scenarios and can be used flexibly.

Note: A feature or a category mentioned in this article can also be called a random variable (income), which can have different values ​​(high and low); in general, a random variable is a factor that needs to be considered when making a decision .


2. Information entropy in decision tree

In decision trees, information entropy can not only be used to measure the uncertainty of categories, but also can be used to measure the uncertainty of data samples and categories containing different features. The greater the information entropy of a feature column vector, the greater the uncertainty of the vector [it is easy to understand here, the feature column vector is the various factors we mentioned that affect people's decision-making] , that is, its The greater the degree of confusion, the more priority should be given to partitioning from the eigenvector [My understanding, because the key to constructing a decision tree is the split attribute, where the split attribute is at a certain node (the node is where all the data is located) Classification basis, this basis is a certain feature) according to the different divisions of a certain feature (the data set flowing here is divided according to the characteristics of the node, such as age characteristics, and then split into various subsets, such as youth set, middle-aged Sets and old sets, these sets flow to the next feature node, repeat this logic) to construct different branches, its goal is to make each fission subset as pure as possible (belonging to the same category, such as deciding to buy seafood at noon today group of people, or people who don’t buy seafood) , that is, they belong to the same category. Therefore, it is more efficient to first divide the eigenvectors with large information entropy to make decisions].

For example: if it is predicted whether a customer will buy seafood at noon, the variable [age] contains a lot of situations (of course, the actual number depends on the calculation of information entropy), and the age is divided into youth, middle age, and old age. Whether people in a certain age group eat or not may depend on their income level, and each income group may depend on whether they are students, and among students, their decision depends on their gender. There is too much information. It might be better if we divide directly by the variable [income]. Information entropy is to select the features that are most suitable for the first division in the decision tree for us.


3. Source code for calculating Shannon entropy of data set sample classification

illustrate:

1. Solve the information entropy of each classification feature, such as income, which is divided into high and low, but not all those with high income eat seafood, and those with low income do not all eat seafood. The coding is more complicated. I will explain it below detailed in this article.

2. The following source code and annotations are original. If there is any inappropriateness or doubt, please leave a message in the comment area.


source code:

from numpy import *


# 本文源码只求样本分类的信息熵,仅为了说明信息熵的求解方法
# 对于如何求解各个类别特征的信息熵和如何构造决策树,将在后续文章详述

def calculate_xns(dataset, n):
    """
    计算给定数据集的香农熵(信息熵)
    :param dataset:数据集
    :param n:数据集的第n个特征,默认取-1,即数据集中每个样本的标签
    :return:数据集的香农熵
    """
    xns = 0.0  # 香农熵
    num = len(dataset)  # 样本集的总数,用于计算分类标签出现的概率

    # 1、将数据集样本标签的特征值(分类值)放入列表
    all_labels = [c[n] for c in dataset]  # c[-1]:即取数据集中的每条数据的标签:吃或不吃
    # print(all_labels)  # 得到 [吃,吃,不吃,不吃,不吃,不吃] 的结果
    # 2、按标签的种类进行统计,吃这一类2个;不吃这一类4个
    every_label = {}  # 以词典形式存储每个类别(键)及个数(值), 如{吃:2,不吃:4}
    for item in set(all_labels):  # 对每个类别计数,并放入词典, 其中set(all_labels) = [吃,不吃]
        every_label[item] = all_labels.count(item)
    # 计算样本标签的香农熵,即数据集的香农熵
    for item2 in every_label:
        prob = every_label[item2] / num  # 每个特征值出现的概率
        xns -= prob * log2(prob)  # xns是全局变量,这样就可以计算关于决策的要考虑的某个随机变量(如收入特征)的香农熵
    return xns


# 问题描述:如何求解数据集样本分类的信息熵?

# 如下列是已知的统计数据(如果有大数据更好):
# 第一列特征是学生(Yes:是,No:不是)
# 第二列特征是性别(Yes:男,No:女)
# 第三列特征是收入(Yes:高,No:低)
# 第四列是每个样本最终的分类标签(吃海鲜或不吃海鲜)

data = [['Yes', 'Yes', 'Yes', "吃"],
        ['No', 'No', 'Yes', "吃"],
        ['No', 'Yes', 'No', "不吃"],
        ['No', 'No', 'No', "不吃"],
        ['No', 'Yes', 'No', "不吃"],
        ['No', 'No', 'No', "不吃"]]

# 打印数据集样本分类的信息熵
print("数据集样本分类的信息熵%s" % (calculate_xns(data, -1)))  # 0.9182958340544896

operation result:

Guess you like

Origin blog.csdn.net/qq_40506723/article/details/127156554