As mentioned earlier, in addition to the square root test (CHI), Information Gain (IG, Information Gain) is also a very effective feature selection method. However, in feature selection, it is always selected after quantifying the importance of features, and how to quantify the importance of features has become the biggest difference between various methods. The correlation between the feature and the category is used for this quantification in the square root test. The stronger the correlation, the higher the feature score, the more the feature should be retained.

In information gain, the measure of importance is to see how much information the feature can bring to the classification system. The more information it brings, the more important the feature is.

So first recall the definition of the amount of information (that is, "entropy") in information theory. It is said that there is such a variable X, and it has n kinds of possible values, namely x 1 , x 2 ,..., x n , and the probability of each one is P 1 , P 2 ,... , P n , then the entropy of X is defined as:

clip_image002

It means that the more changes a variable may have (instead, it has nothing to do with the specific value of the variable, only the type of value and the probability of occurrence), the greater the amount of information it carries (so I always feel that our policies and regulations The amount of information is very large, because it changes a lot, and basically changes from time to time, lol).

For the classification system, category C is a variable, its possible values ​​are C 1 , C 2 , ..., C n , and the probability of each category appearing is P(C 1 ), P(C 2 ), ... ..., P(C n ), so n is the total number of categories. At this time, the entropy of the classification system can be expressed as:

clip_image002[4]

Some students say that it is difficult to understand, so think about it this way. The function of the text classification system is to output a value indicating which category the text belongs to, and this value may be C 1 , C 2 , ..., C n , so this value The amount of information carried is as much as in the above formula.

The information gain is for each feature, it is to look at a feature t, the amount of information when the system has it and without it, the difference between the two is the amount of information this feature brings to the system, that is, the gain. . When the system contains the feature t, the amount of information is easy to calculate, which is the formula just now, which represents the amount of information of the system when all the features are included.

The question is how is the amount of information calculated when the system does not contain t? Let's think about the problem from a different angle, and imagine what the system is going to do as follows: say there are many seats in the classroom, and students can sit wherever they want each time they come in for class, so the changes are huge (innumerable possible seating situations); But now there is a seat. I can read the blackboard clearly and listen to the teacher very clearly, so the daughter of the principal's brother-in-law's sister asked for a relationship (really tossing and turning), and this seat was reserved, and each time I could only sit for her, others No, what's the situation at this time? For the possible cases of seating, it is easy to see that the following two cases are equivalent: (1) the seat is not available in the classroom; (2) although the seat is available in the classroom, other people cannot sit (because it is Nor can it participate in change, it is constant).

Corresponding to our system, it is the following equivalence: (1) The system does not contain feature t; (2) Although the system contains feature t, t is fixed and cannot be changed.

When we calculate that the classification system does not contain feature t, we use case (2) instead, that is, to calculate the amount of information in the system when a feature t cannot be changed. In fact, this amount of information also has a special name, which is called "conditional entropy". The condition, naturally, refers to the condition that "t has been fixed".

But problems arise one after another. For example, a feature X has n possible values ​​(x 1 , x 2 ,..., x n ). When calculating the conditional entropy and it needs to be fixed, it must be fixed. At which value? The answer is to fix each possibility, calculate n values, and then take the average to be the conditional entropy. And taking the mean is not simply adding one and then dividing by n, but to calculate the average by the probability of each value appearing (simple understanding, that is, a value is more likely to appear, and it is calculated when it is fixed on it. The proportion of information will be more).

So there are two expressions for conditional entropy:

clip_image002[6]

This refers to the conditional entropy when the feature X is fixed to the value x i ,

clip_image002[8]

This refers to the conditional entropy when the feature X is fixed, pay attention to the difference in meaning from the above formula. As can be seen from the discussion of calculating the mean just now, the relationship between the second formula and the first formula is:

clip_image004

Specific to the feature t in our text classification system, how many possible values ​​does t have? Note that t refers to a fixed feature. For example, he refers to the keyword "economy" or "sports". When we say that the feature "economy" has possible values, there are actually only two. "Economy" either appears or not. appear. Generally, the values ​​of t are only t (representing t appearing) and clip_image006(representing t not appearing). Note that the system includes t but t does not appear and the system does not include t at all.

Therefore, when t is fixed, the conditional entropy of the system is there. In order to distinguish the symbol when t occurs and the symbol of the feature t itself, we use T to represent the feature, and t to represent the appearance of T, then:

clip_image008

Compare with the formula just now, the meaning is very clear, right, P(t) is the probability of T appearing, that is, the probability clip_image010that T does not appear. This formula can be further expanded, where

clip_image012

The other half can be expanded to:

clip_image014

Therefore, the information gain brought by the feature T to the system can be written as the difference between the original entropy of the system and the conditional entropy after fixing the feature T:

clip_image016

There are many things in the formula, but they are actually very easy to calculate. For example, P(C i ), which represents the probability of the occurrence of category C i , in fact, you only need to divide 1 by the total number of categories to get it (this means that you treat each category equally and ignore their size, if you consider the size The effect of size must be added). Another example is P(t), which is the probability of feature T appearing, as long as the number of documents that have appeared T is divided by the total number of documents, and P(C i |t) means that when T appears, the category C i appears The probability of , just divide the number of documents in which T occurs and belong to category C i by the number of documents in which T occurs.

It can be seen from the above discussion that the information gain also considers the presence and absence of features. Like the square root test, it is more comprehensive, so the effect is good. But the biggest problem with information gain is that it can only examine the contribution of features to the entire system, but not specific to a certain category, which makes it only suitable for the so-called "global" feature selection (meaning that all categories are use the same feature set), and cannot do "local" feature selection (each category has its own feature set, because some words are very discriminative for this category, and insignificant for another category).

Look, the export process is actually very simple, there is nothing mysterious right. But some academic papers like to write this kind of straightforward things very obscurely, as if only the readers can't understand it is the real success of the author.

We are a new generation of scholars. If we do not have knowledge, we are not afraid of being seen by others, and if we have knowledge, we are not afraid to teach others. So let's keep things simple and make things clearer. Hello everyone, it's really good.