Use of Information Entropy and Information Gain in Decision Tree Generation

    Decision tree is a kind of machine learning algorithm, which mainly generates a decision mechanism similar to a tree structure for a given data set according to relevant attributes.

    The tree structure can be generated casually, as long as the forks are made according to the branch of the feature value, and all the features are traversed, the tree is a decision tree. But to generate an optimal decision tree, we need to choose a suitable root node.

    One algorithm for selecting the root node is the ID3 algorithm, which selects features as the root node based on information gain.

     The definition of information entropy: Shannon proposed the concept of entropy, which means the measurement of the uncertainty of random variables.

     From the description, the uncertainty here actually implies the problem of probability, and the calculation formula of entropy is used to calculate this probability sum. Suppose X is a discrete random variable with finite values, and its probability distribution is:

P(X=x_{i}) = p_{i}, i = 1,2,3,...,n

, then the entropy of a random variable X is defined as:

Ent(X) = -\sum_{i=1}^{n}p_{i} log_{2}p_{i}

    This formula looks a bit strange. When we calculate information entropy, it should be a probability sum, which is ultimately a number greater than 0. Why is there a minus sign in this formula? In fact, we know that the probability here is a number between 0-1. The maximum does not exceed 1, and the logarithmic function is in the range of 0-1, the result is a negative number, as shown below:

    Therefore, the minus sign here just turns a negative number into a positive number, and the final result is greater than 0 and not a negative number.

    The result of entropy can only explain the information uncertainty. The greater the entropy, the greater the uncertainty of the information, and the more dispersed the sample distribution; the smaller the entropy, the smaller the uncertainty, and the more concentrated the samples.

     For example, let's take a look at the entropy corresponding to the sample distribution through the following example.

    In the figure above, we assume

    1. All samples are one color, so the final calculation result of entropy is 0. 

    2. If a red color is mixed in the sample, the final calculation result is 0.811.

    3. Red and blue are the same in the sample, and their probability is 50%, so the result of entropy is 1.

    The result of entropy is related to sample results, not to eigenvalues.

    Definition of information gain: Literally speaking, it is a difference, the difference of information gain, and this information gain difference needs to be linked to features and eigenvalues, here a weight is generated, and the eigenvalues ​​correspond to samples accounting for the overall sample Proportion. It's another level of probability.

     The definition is as follows: Assume that feature a has the following possible values, that is, branches: {  a^{1},a^{2},...,a^{v}}, if a is used for division, v branches will be generated. Among them, the vth branch includes a^{v}the samples whose values ​​are in the sample X, denoted as X^{v}, we can calculate X^{v}the entropy according to the definition of information entropy above Ent(X^{v}). Considering that there are v branches with different sample sizes, assuming the weight of each branch \frac{\left | X^{v} \right |}{\left | X \right |}, if so, the information gain of using feature a to divide the data set X can be calculated:

Gain(X,a) = Ent(X) - \sum_{v=1}^{V}\frac{\left | X^{v} \right |}{\left | X \right |}Ent(X^{v})

    The meaning of information gain is to use feature a to divide the size of the improvement of the purity of the entire sample. The greater the improvement, the better the feature, so when building a decision tree, we give priority to this feature. After selecting the current feature, we should remove the feature and continue to use the remaining features to make new divisions until all feature divisions are completed.

    Let's take a look at how to choose a good root node based on a specific example.

     As shown below, it is a sample of a bank deciding whether to grant a loan based on the age, work, real estate, and loan status of the loan object:

    The first table is the sample situation, and the second table is the sample statistics based on the first table.

    Then we use the above information entropy and information gain to calculate the relevant data.

    Overall information entropy, this only needs to be calculated by the probability of yes and no in the sample.

    Ent(X) = -\frac{6}{15}log_{2}\frac{6}{15} - \frac{9}{15}log_{2}\frac{9}{15} =  0.971

    Information Gain:

     Gain(X, age) = Ent(X) - \sum_{v=1}^{V}\frac{\left | X^{1} \right |}{\left | X \right |}Ent(X^{1}) = \\ \\ 0.971 - \frac{5}{15}(-\frac{2}{5}log_{2}\frac{2}{5}-\frac{3}{5}log_{2}\frac{3}{5}) - \\ \\ \frac{5}{15}(-\frac{3}{5}log_{2}\frac{3}{5}-\frac{2}{5}log_{2}\frac{2}{5}) - \\ \\ \frac{5}{15}(-\frac{4}{5}log_{2}\frac{4}{5}-\frac{1}{5}log_{2}\frac{1}{5}) = 0.083

     Gain(X, job) = Ent(X) - \sum_{v=1}^{V}\frac{\left | X^{2} \right |}{\left | X \right |}Ent(X^{2}) = \\ \\ 0.971 - \frac{5}{15}(-\frac{5}{5}log_{2}\frac{5}{5}-\frac{0}{5}log_{2}\frac{0}{5}) - \\ \\ \frac{10}{15}(-\frac{4}{10}log_{2}\frac{4}{10}-\frac{6}{10}log_{2}\frac{6}{10}) = 0.324

     Gain(X, property) = Ent(X) - \sum_{v=1}^{V}\frac{\left | X^{3} \right |}{\left | X \right |}Ent(X^{3}) = \\ \\ 0.971 - \frac{6}{15}(-\frac{6}{6}log_{2}\frac{6}{6}-\frac{0}{6}log_{2}\frac{0}{6}) - \\ \\ \frac{9}{15}(-\frac{3}{9}log_{2}\frac{3}{9}-\frac{6}{9}log_{2}\frac{6}{9}) = 0.420

     Gain(X, loan status) = Ent(X) - \sum_{v=1}^{V}\frac{\left | X^{4} \right |}{\left | X \right |}Ent(X^{4}) = \\ \\ 0.971 - \frac{5}{15}(-\frac{1}{5}log_{2}\frac{1}{5}-\frac{4}{5}log_{2}\frac{4}{5}) - \\ \\ \frac{6}{15}(-\frac{4}{6}log_{2}\frac{4}{6}-\frac{2}{6}log_{2}\frac{2}{6}) - \\ \\ \frac{4}{15}(-\frac{4}{4}log_{2}\frac{4}{4}-\frac{0}{4}log_{2}\frac{0}{4}) = 0.363

    The above calculation process is demonstrated by the code as follows:

from math import log2


def create_datasets():
    datasets = [[0, 0, 0, 0, 'no'],
                [0, 0, 0, 1, 'no'],
                [0, 1, 0, 1, 'yes'],
                [0, 1, 1, 0, 'yes'],
                [0, 0, 0, 0, 'no'],
                [1, 0, 0, 0, 'no'],
                [1, 0, 0, 1, 'no'],
                [1, 1, 1, 1, 'yes'],
                [1, 0, 1, 2, 'yes'],
                [1, 0, 1, 2, 'yes'],
                [2, 0, 1, 2, 'yes'],
                [2, 0, 1, 1, 'yes'],
                [2, 1, 0, 1, 'yes'],
                [2, 1, 0, 2, 'yes'],
                [2, 0, 0, 0, 'no']]
    labels = ['F-Age', 'F-Work', 'F-House', 'F-Loan', 'Target']
    return datasets, labels


def calc_shannon_entropy(datasets):
    data_len = len(datasets)
    label_count = {}
    for i in range(data_len):
        label = datasets[i][-1]
        if label not in label_count:
            label_count[label] = 0
        label_count[label] += 1
    entropy = -sum([(p / data_len) * log2(p / data_len) for p in label_count.values()])
    return entropy


def cal_condition_entropy(datasets, axis=0):
    data_len = len(datasets)
    feature_sets = {}
    for i in range(data_len):
        feature = datasets[i][axis]
        if feature not in feature_sets:
            feature_sets[feature] = []
        feature_sets[feature].append(datasets[i])
    condition_entropy = sum([(len(p) / data_len) * calc_shannon_entropy(p) for p in feature_sets.values()])
    return condition_entropy


def info_gain(entropy, condition_entropy):
    return entropy - condition_entropy


def info_gain_train(datasets, labels):
    count = len(datasets[0]) - 1
    entropy = calc_shannon_entropy(datasets)
    best_feature = []
    for i in range(count):
        info_gain_i = info_gain(entropy, cal_condition_entropy(datasets, axis=i))
        best_feature.append((i, info_gain_i))
        print('feature : {},info_gain : {:.3f}'.format(labels[i], info_gain_i))
    best_ = max(best_feature, key=lambda x: x[-1])
    return labels[best_[0]]


if __name__ == '__main__':
    datasets, labels = create_datasets()
    ent = calc_shannon_entropy(datasets)
    print('entropy : {}'.format(ent))
    feature = info_gain_train(datasets, labels)
    print('best feature : {}'.format(feature))

    运行结果:
entropy : 0.9709505944546686
feature : F-Age,info_gain : 0.083
feature : F-Work,info_gain : 0.324
feature : F-House,info_gain : 0.420
feature : F-Loan,info_gain : 0.363
best feature : F-House 

   In the process of decision tree generation, the above part is only a beginning, and the most suitable root node is found, and then it is necessary to continue recursively solving new suitable nodes based on other characteristics.

Guess you like

Origin blog.csdn.net/feinifi/article/details/131766471