Machine learning the basics - natural language processing of the Detailed tf-idf

This article originating in individual public number: TechFlow , originality is not easy, seeking followers


Today we talk about articles and text analysis algorithm which is a simple but famous - TF-idf . Speaking of this algorithm is an important method of natural language processing, but because it is too famous, so although I am not engaged in the field of NLP, but in the interview are still being asked several times, showing the importance of this algorithm.

Fortunately, the algorithm itself is not difficult, although heavy doubt Judging from the name, but once you understand the principles involved, everything is a matter of course, no longer afraid of the interview can not remember. Ado, we get into.


Algorithm theory


TF-idf middle names were separated by the division number, and TF and idf not like names, so it is actually show that this algorithm is composed of two portions of TF and idf. Let's look at the part of TF.


TF explanation

TF English stands for Term Frequency, Frequency well understood is the frequency, frequency. And this is the meaning of translation Term hard items, in context, it actually refers to the text which word or phrase. So combined, Term Frequency is the frequency of the phrase . In fact, if you understand what it means, and the rest speculation alone can be a rough guess.

It means very simple, that is, literally, that is a word in importance among the paper and its frequency is related.

This view is very intuitive, such as our Web Search " TechFlow throughout even a" TechFlow "no", which came out of the site, the search obviously poor quality. If a website which contained "TechFlow" a lot, it is likely to search for the correct description, this site is what we want.

In addition, it may reflect the importance of the word. If the same text which the frequency of occurrence of a Term larger than the other, so under normal circumstances, it is clear that the degree of importance is greater.

It is said that early this strategy is to use a search engine, which is a measure of the frequency of keyword search the user's text that appears on every page of them. Tend to high-frequency standing in the front page will appear, due to the top-ranking website can be a lot of traffic. Therefore, due to the interests of drivers, then more and more pages tend to be embedded in the content among more search hot words , in order to get higher rankings and more traffic. I believe we have had a similar experience when we use the web search engine, enter a keyword, search out claims to have matches, but when we really click into but nothing was found, or full screen ads.

Such a large number of pages in the early Internet presence among them to include more search hot words for a living. This also spawned a skill jobs --SEO, namely search engine optimization search engine optimization, specifically used various means to back the major web search engine optimization ranking among.

Soon the search engine engineers have also found this problem, precisely in order to solve this problem, it introduces the concept of the IDF.


IDF concept


English is the IDF Inverse Document Frequency, namely inverse document frequency . This concept is difficult to translate, it is difficult to explain straightforward, so we tend to use its abbreviation. Meaning it expresses is also very simple, it is the less important the more widespread Term , which is inversely proportional to the breadth and importance Term appear.

For example, the most commonly used, "the," "a," "Yes," These words certainly appear in widely among the various articles, but rather as "search", "machine learning" These phrases appear in the article would be less likely much more. Obviously for the search engines or some other model, these larger reference value fewer occurrences of the word, because often means more accurate guide . So IDF can be simply understood as the reciprocal of the extent of emergence, its definition is simple:

\[\displaystyle idf_i=\log\frac{|D|}{1 + |\{j:t_i \in d_j \}|}\]

Where \ (| D | \) is the number of all the documents, \ (t_i \) is the i-th phrase, \ (| \ {J: t_i \ in D_J \} | \) represents includes the i-th phrase of the document number. In order to prevent it to zero, we add to it a constant 1. Similarly, we can also write TF formula:

\[TF(t) = \frac{TF_i}{TN_t}\]

Denominator \ (TN_t \) indicates the number of all Term articles which contain t, molecular \ (TF_i \) represents \ (Term_i \) number in the document.

We look at these two concepts can be found, TF is a measure of the relationship between the phrase and documents, and idf measures the relationship between phrases and all documents. That is a measure of the importance of the former phrase for one specific document, and idf measure of the importance of the phrase for all documents. This is a bit like both the local and the overall relationship, we will be multiplying the two can get both the importance of a resulting Term compatible, meaning TF-idf is used to the importance of the phrase in a document in computing algorithm .

TF-idf algorithm is also very simple, we directly to TF and idf is calculated by multiplying the value can be.

After understanding the principles of the algorithm, we can write their own hands in a TF-idf algorithm, is not complicated, the whole process does not exceed 40 lines:

class TFIdfCalculator:

    # 初始化方法
    def __init__(self, text=[]):
        # 自定义的文本预处理,包括停用词过滤和分词,归一化等
        self.preprocessor = SimpleTextPreprocessing()
        # 防止用户只传了单条文本,做兼容处理
        if isinstance(text, list):
            rows = self.preprocessor.preprocess(text)
        else:
            rows = self.preprocessor.preprocess([text])

        self.count_list = []
        # 使用Counter来计算词频
        for row in rows:
            self.count_list.append(Counter(row))

    # fit接口,初始化工作
    def fit(self, text):
        self.__init__(text)

    # 计算词频,即单词出现次数除以总词数
    # 用在初始化之后
    def tf(self, word, count):
        return count[word] / sum(count.values())

    # 计算包含单词的文本数量
    def num_containing(self, word):
        return sum(1 for count in self.count_list if word in count)

    # 计算idf,即log(文档数除以出现次数+1)
    def idf(self, word):
        return math.log(len(self.count_list) / (1 + self.num_containing(word)))

    # 计算tfidf,即tf*idf
    def tf_idf(self, word, count_id):
        if isinstance(count_id, int) and count_id < len(self.count_list):
            return self.tf(word, self.count_list[count_id]) * self.idf(word)
        else:
            return 0.0

SimpleTextPreprocessing which is my own development of a pre-text of classes, including segmentation, removal of the basic operation of stop words and parts of speech normalization and so on. These content before Naive Bayes classifier articles which have mentioned, interested students can click on the link below to view.

Machine learning the basics - do Naive Bayes text classification codes combat

Let's test the code:

tfidf = TFIdfCalculator()
tfidf.fit(['go until jurong', 'point craze go', 'cine there got amore', 'cine point until'])
print(tfidf.tf_idf('jurong', 0))
print(tfidf.tf_idf('go', 0))

We've created some meaningless text call, we first calculate the degree of importance among the provisions of this word go and jurong. According to the definition of TFidf, Go appeared in the first and the second of which the provisions of this, it appears more times, so it idf smaller, and appear in both the word frequency consistent with the provisions of the first which, it should jurong the TFidf greater.

The final results are in line with our expectations, jurong of TFidf is 0.345, and go the TFidf is 0.143.


Deep Thoughts


TFidf principle we all understand, the code written, seemingly successful, but in fact there is a key point that we overlooked. One thing is very strange, why do we need frequency idf when computing the value of the proposed text request log it? While seeking the log results from the results, then it looks more normal, and distribution is also more reasonable. But this is the result not the cause, but in principle, the reasons for this log appears what is it?

In fact, in the early TFidf this theory appears, no one thought about it, we can say cock. Later, the god from the Shannon information theory to give an explanation, it was all perfectly justified.

Before deducing the article on cross entropy which we have discussed, if there is an event A, the amount of information it contains is \ (- \ log (P (A)) \) , the probability that it happened on a number of. That is the smaller the probability of the event, the greater the amount of information it is . The emergence of this log is a mystery, the nature of information theory is that the information quantify . Information quantifiable result bit, which is bit. We all know that a bit can represent two digits 0 and 1, 2 represents the information component. With the increase in bit, and we can express the amount of information has also increased, but the amount of information is not a linear growth, but exponential growth.

Here is a simple yet classic example, 32 teams very close to the World Cup, of which only one team can win. Assuming that eventually won the French team, the Spanish team, we know that when the news would not be surprised. And if winning is the Japanese team, estimated that everyone will be surprised. The reason behind it and the amount of information, although we all know that from the face of the 32 teams are equal, which have a chance of winning, but in fact, the probability of winning each team is different.

Assume that the probability France, Spain, this powerhouse of winning is 1/4, \ (- \ log (\ FRAC {1} {4}) = 2 \) then we only need two bit can be expressed. Assume that the probability of winning the Japanese team is 1/128, then we need to 7 bit representation, which is obviously much larger amount of information.

Here, everyone will understand, we take the number of information corresponding bit of the essence of number is calculated. the number of bit linear, exponential amount of information is, that we will be a level of information index converted into linear bit. For most models, linear features easier to fit , this is the essential reason TFidf excellent results.

Finally, we explain idf information theory, assuming the Internet world among all documents have \ (2 ^ {30} \) . Now users search for Sino-US trade war, which contains the number of Chinese and American documents are \ (2 ^ {14} \) , then the amount of information that China and the United States contains two words is \ (\ log (\ frac { 2 ^ {30} {14}} ^ {2}) = 16 \) , and the number of document that contains only the word trade war \ (2 ^ 6 \) , then the trade war information is included in the term \ (\ log (\ FRAC ^ {30} {2} {6} ^ 2) = 24-\) , then obviously, the amount of information than the word trade war between China and the United States is much larger, so it played in the document ordering them the role it should be larger.

If you are from the perspective of information theory to explain the principles of TFidf, rather than simply understand the principles, then I think this knowledge is the real master, and then when you come across in the interview which naturally will be able to ease up.

Today's article is that, if that be harvested, please easily scan code point of a concern now, you little things are important to me.

Guess you like

Origin www.cnblogs.com/techflow/p/12407502.html