Select the text feature

In doing clustering text categorization tasks often need to extract from the text feature to extract valuable classification of learning, rather than have to spend all the words that would cause curse of dimensionality. Therefore, some words on the classification of little effect, such as "the, was, at the" stop words and so on. Here are three common feature selection method:

Unsupervised methods:

  • TF-IDF

Supervision Methods:

  • Bangla
  • Information gain
  • Mutual information

A, TF-IDF
a easy to think the idea is to find the highest number of words appear. If a word is very important, it should appear several times in this article. So, we "word frequency" (Term Frequency, abbreviated as TF) statistics.

The results you must have guessed, is the largest number of occurrences ---- word "and", "Yes", "In the" ---- the most commonly used words. They are called "stop words" (stop words), he expressed no help to find the results to be filtered out words.

We assume that the words they are filtered out, considering only remaining meaningful. This will encounter another problem, we may find as many occurrences, "China", "bee", "farming" of these three words. Does this mean, as a keyword, their importance is the same?

Clearly not the case. Because "China" is a very common word, relatively speaking, "Bee" and "culture" is not so common. If these three words as much as the number of occurrences of an article, it is reasonable that the "bees" and "culture" an important degree than the concept "Chinese", that is, above the sorting keyword, "Bee" and " culture "should be at the" front of China ".

So, we need an important adjustment factor, a measure of a word is not a common word. When a word is relatively rare, but it appears more than once in this article, it is likely to reflect the characteristics of this article, is the keyword we need.

Using statistical language, that is, on the basis of word frequency, to be assigned a "importance" weight for each word. The most common words ( "and", "Yes", "in") to give the minimum weight, the more common word ( "China") to give less weight, less common words ( "bee", "breeding" ) to give greater weight. This weight is called "inverse document frequency" (Inverse Document Frequency, abbreviated IDF), its size and the extent of a common word is inversely proportional.

Know the "word frequency" (TF) and "inverse document frequency" (IDF) later, these two values ​​are multiplied, you get a TF-IDF value of a word. The higher the importance of the article a word, it's TF-IDF values ​​greater. So, at the top of a few words, it is the key word of this article.

Advantage of TF-IDF algorithm is simple and rapid, more in line with the actual situation. The downside is that the importance of simply to "word frequency" measure of a word, is not comprehensive enough, sometimes important words are not many possible occurrences. Moreover, this method can not reflect the position information of the word, the word appears with the front position after position on the word appears, are treated as the same importance, it is not true. (One solution is, the full text of the first paragraph and the first sentence of each paragraph, give greater weight.)

Strategy TF-IDF algorithm can be used for unsupervised learning, does not need to know the category of the document, but it has a different TF-IDF values ​​in different documents of the same word, the process here is my take each document top K, and then do a de-emphasis.

Second, the chi-square test
chi-square test is actually a commonly used method of mathematical statistics test of the independence of two variables.

Chi-square test basic idea is to determine the correct theory or not by observing the difference between the actual value and the theoretical value. Specific to do, it is often assume that two variables really are independent (jargon is called "null hypothesis"), then observe the actual value (also called observations) and the theoretical value (the theoretical value means "If they are really independent should have value "of the case) the degree of deviation, if the deviation is small enough, we think the error is natural sampling error is less accurate means of measurement or accidental cause, indeed the two are independent, this when accept the null hypothesis; if the deviation is large to a certain extent, so that such errors are less likely to be caused by accident or inaccurate measurement, we consider the two are actually related, i.e. reject the null hypothesis, and to accept the alternative assumptions.

So what to measure the degree of deviation it? Hypothesis value E (which is the mathematical expectation symbols oh), the actual value of x, if only all the observed and theoretical values ​​of the samples and the difference xE https://zhuanlan.zhihu.com/p/28053918

Guess you like

Origin www.cnblogs.com/rise0111/p/11297902.html