TF-IDF term frequency inverse document frequency algorithm

I. Introduction

  [1.RF-IDF term frequency-inverse document frequency retrieving] is a weighting techniques used for inquiry.

  2.TF-IDF is a statistical method used to evaluate the importance of a word or a set of files for a corpus of one file.

  3. The importance of words with the increase in the number of times it appears in the file increases, but it will also appear with increasing frequency in the corpus increases.

II. Frequencies

  It refers to the number of times a certain given word appears in a given document. This number will usually be normalized in order to prevent its preference file [with a long term may have a higher than short-term frequency in the document file, regardless of whether or not the word is important].

  official:

    

  ni, j: is the number of times the word appears in the document dj and the denominator is the number of times the word appears in all documents and dj.

III. Inverse document frequency

  It is a measure of the importance of a common word. IDF a particular word may contain the file data is divided by the total number of words in the document, and then the resulting quotient is rounded to obtain a number.

  official:

    

  | D |: total corpus of documents

  | {J: ti € dj} |: The total number of files contain ti

Four .TF-IDF

  Formula: TF-IDF = TF * IDF

  Features: high frequency words in a given document, and the words in the low frequency in the whole document corpus may be generated high weights TF-IDF. Therefore, TF-IDF tend to filter out common words, keep your important words.

  Thought: If the frequency of a word or phrase that appears in an article TF high, and rarely appear in other articles, this is considered a word or phrase has a good ability to distinguish between categories, suitable for classification.

V. code implementation

  To be continued. . .

 

Guess you like

Origin www.cnblogs.com/yszd/p/10939583.html