Information entropy, information gain and the rate of information gain

Information entropy, information gain and the rate of information gain

Entropy (Information Entropy)

Entropy parameter is used to assess the purity of a sample set, that is, given a set of samples, the samples in the sample set may belong to many different classes, it may belong to only one category, if they are many different categories then we say that the sample is impure, if only belongs to a category, then we say that the sample is pure.
  And this thing is the information entropy to calculate a sample data set is pure or impure. On the following formula:
   Ent (D) = - = Σ|y|k 1pklog2pkEnt (D) = - \ sum_. 1} ^ {K = {\ left | Y \ right | P_} {2} {K} LOG_ P_ K { } Ent (D) = - Σk = 1|y| pk log2 pk
  following formulas explain the meaning, in fact, well understood, calculating a set of purity, it is to set the ratio of share of each category pkp_kpk (k from 1 to |y| \ left | y \ right | |y| , wherein |y| \ left | y \ right | |y| indicates the category number) multiplied by its logarithm, then add together, and then after calculation, can get the information entropy of a data set, then the information entropy, it may determine whether the data set pure. The smaller the entropy, then this data set shows that the more pure. Minimum entropy is 0, then the data set D contains only one category.

Information Gain (Information Gain)

To introduce the following information gain, so-called information gain, is to be directed to specific properties in terms of, for example, the data set D contains two categories, namely, good and bad, so, just select a property right, for example, gender, sexual this attribute contains two values, both men and women, if men and women to divide the data set D, then, will give two sets, respectively DmanD_ {man} Dman and DwomanD_ {woman} Dwoman. Two sets of the divided each with good and bad, it is possible to calculate the purity of the two sets of division, respectively, after the calculation, these two sets of entropy A weighted average DmanDEnt (Dman) + DwomanDEnt (Dwoman) \ frac {D_ {man}} {D} Ent (D_ {man}) + \ frac {D_ {woman}} {D} Ent (D_ {woman}) DDman Ent (Dman) + DDwoman Ent (Dwoman ), with no time before the division of information entropy Ent (D) Ent (D) Ent (D) compared with the former minus the latter, is to get the properties - the information gain gender sample set obtained by dividing D . Is understood to be popular, information gain values ​​is to raise the purity, after division of the original data set with the attribute, the difference information obtained entropy value is to enhance purity. Information gain formula is as follows:

Gain (D, a) = Ent (D) -ΣVv = 1|Dv||D|Ent (Dv) Gain (D, a) = Ent (D) - \ sum_ {v = 1} ^ {V} \ frac {\ left | D ^ { v} \ right |} {\ left | D \ right |} Ent (D ^ {v}) Gain (D, a) = Ent (D) -Σv = 1V |D ||Dv| Ent (Dv)  
  to explain the parameters of the formula, D is a data set, a is selected attribute, a in a total value of V, with V the set value to the divided data D, respectively to obtain a data set D1D_1D1 DVD_VDV, which are seeking information entropy data set V, and seeking the weighted average. The difference between the two is to get information gain.
  Then the information gain what use is it? Useful, can use this property to determine whether to divide a data set D, if the information gain is relatively large, then that this property is used to divide the data set D good properties based on size information of a gain value, or the this property is not considered suitable for dividing data set D. This helps to build decision trees.
  Known algorithm ID3 is using information gain as a criteria, whether the attribute of the divided data sets.

Information gain ratio (Information Gain Ratio)

Why raise the rate of information gain this method to judge divided the property? Information gain is not very good? In actual fact, judging by the information gain as a method of dividing the property actually has some defects, the book says, the information gain criterion value of those properties are more attributes have a preference, that is, the use of information gain as a decision method, will tend to choose more of the value of the property attribute. Then, select the value attribute much of it why not? To give an extreme example, if the ID number as an attribute, then, in fact, each person's ID number are not the same, that is, how many individuals there are that many kinds of value, its value a lot of it, let's move on, if the ID number with this property to divide the original data set D, then the original data set D how many samples will be divided into a number of subsets, each subset only one person, in this extreme case, because a person can only belong to one category, good or bad, then the time information for each subset of entropy is 0, and that is at this time each subset are particularly pure. In this case, the second term will lead ΣVv information gain formula = 1|Dv||D|Ent (Dv) \ sum_ { v = 1} ^ {V} \ frac {\ left | D ^ {v} \ right |} {\ left | D \ right |} Ent (D ^ {v}) Σv = 1V |D||Dv| Ent (Dv) is integrally 0, result of this, the information gain calculated particularly large, then the decision tree will use this property to divide the ID number of the original data set D, in fact, this division is meaningless. Therefore, in order to change this preference adverse effects, proposed using the information gain ratio as a judge of the division of property.
  Formula is as follows:
   Gain_ratio (D, A) = Gain (D, A) IV (A) Gain \ _ratio (D, A) = \ FRAC {Gain (D, A)} {IV (A)} Gain_ratio (D, A ) = IV (a) Gain ( D, a)
  wherein IV (a) IV (a) IV (a) is calculated as follows:
   IV (a) = - ΣVv = 1|Dv||D|log2|Dv||D|IV (a) = - \ sum_ {v = 1} ^ {V} \ frac {\ left | D ^ v \ right |} {\ left | D \ right |} log_2 \ frac {\ left | D ^ v \ right |} {\ left | D \ right |} IV (a) = - Σv = 1V |D|| Iog2 |D||Dv| Dv|
  IV (a) IV (a) IV (a) is referred to as the "intrinsic value", the formula IV (a) IV (a) IV (a) is not very familiar ah, simply and formulas entropy of a hair, like, is to see the purity of attribute a, if a contains only a small amount of value, then a purity is relatively high, otherwise, the more a value of , a lower purity, IV (a) IV (a ) the value of IV (a) is greater, and therefore, the lower the gain of the resulting information.
  Using information gain ratio can solve the problems of ID3 (ID3 values of those properties have more properties have a preference, such as watermelon colors are 10 kinds), and therefore will use the information gain ratio is determined as the division of property is good or bad the method is called C4.5.
  It should be noted that the gain ratio criterion value of a property when there will be fewer preferences , in order to solve this problem, C4.5 is not directly select the maximum gain rate as the division of property attributes, but through it again before the first screening, the first the information gain is lower than the average property weed out, then selecting properties from the rest of the information gain the highest rate, in this case, the equivalent of two aspects have been taken into account. (Information Gain in conjunction with the use of information gain rate)

Author: DawnChau
Source: CSDN
Original: https://blog.csdn.net/u012351768/article/details/73469813
Copyright: This article is a blogger original article, reproduced, please attach Bowen link!

Published 98 original articles · won praise 124 · views 30000 +

Guess you like

Origin blog.csdn.net/lyc0424/article/details/104733751