Machine Learning (14) TF-IDF Algorithm

concept

     TF-IDF (term frequency–inverse document frequency) is a commonly used weighting technique for information retrieval and information mining . TF-IDF is a statistical method to assess the importance of a word to a document set or one of the documents in a corpus. The importance of a word increases proportionally to the number of times it appears in the document , but decreases inversely to the frequency it appears in the corpus . Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of relevance between documents and user queries. In addition to TF-IDF, search engines on the Internet use link analysis-based ranking methods to determine the order in which documents appear in search results.

  

principle

      In a given document, term frequency (TF)  refers to the number of times a given word appears in the document. This number is usually normalized (the numerator is generally smaller than the denominator to distinguish it from the IDF) to prevent it from skewing towards long files. (The same word may have a higher word frequency in a long file than a short file, regardless of whether the word is important or not.)

  Inverse document frequency (IDF)  is a measure of the general importance of a word. The IDF for a particular word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the quotient obtained.

  A high word frequency within a particular document, and a low document frequency of that word in the entire document set, can result in a highly weighted TF-IDF. Therefore, TF-IDF tends to filter out common words and keep important words.

      The main idea of ​​TFIDF is : if a word or phrase appears frequently TF in one article and rarely appears in other articles, it is considered that the word or phrase has good ability to distinguish between categories and is suitable for classification . TFIDF is actually: TF*IDF, TF Term Frequency, IDF Inverse Document Frequency. TF represents the frequency of the term appearing in the document d (in other words: TF term frequency (Term Frequency) refers to the number of times a given word appears in the document ). The main idea of ​​IDF is: if there are fewer documents containing the term t , that is, the smaller the n, the larger the IDF (see the following formula) , it means that the term t has a good ability to distinguish between categories. If the number of documents containing term t in a certain type of document C is m , and the total number of documents containing t in other types is k , obviously the number of documents containing t is n=m+k. When m is large , n is also large. , the value of the IDF obtained according to the IDF formula will be small, which means that the classification ability of the entry t is not strong. (Another saying: IDF inverse document frequency (Inverse Document Frequency) means that if there are fewer documents containing terms and the larger the IDF, it means that the terms have a good ability to distinguish between categories. )But in fact, sometimes, if a term appears frequently in a class of documents, it means that the term can well represent the features of the text of this class, such terms should be given higher weights, and It is selected as the characteristic word of this type of text to distinguish it from other types of documents. This is where IDF falls short.

      In a given document, term frequency (TF) refers to the frequency with which a given word appears in the document. This number is normalized to the term count to prevent it from skewing towards long files . (The same word may have a higher number of words in a long file than a short file, regardless of whether the word is important or not.) For a word in a particular file  t_{i} , its importance can be expressed as:

\mathrm{tf_{i,j}} = \frac{n_{i,j}}{\sum_k n_{k,j}}

      The above formula  n_{i,j} is the number of occurrences of the word t_{i} in the documentd_{j} , and the denominator is the sum of the occurrences of all words in the documentd_{j} .

      Inverse document frequency (IDF) is a measure of the general importance of words. The IDF for a particular word can be calculated by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the quotient to get:

\mathrm{idf_{i}} =  \log \frac{|D|}{|\{j: t_{i} \in d_{j}\}|}

in

  • |D|: The total number of documents in the corpus
  • |\{ j: t_{i} \in d_{j}\}|: the number of documents that contain the wordt_{i} (i.e. n_{i,j} \neq 0the number of documents) if the word is not in the corpus, it will cause the dividend to be zero, so it is generally used1 + |\{j : t_{i} \in d_{j}\}|

Then

\mathrm{tf{}idf_{i,j}} = \mathrm{tf_{i,j}} \times  \mathrm{idf_{i}}

      A high word frequency within a particular document, and a low document frequency of that word in the entire document set, can result in a highly weighted TF-IDF. Therefore, TF-IDF tends to filter out common words and keep important words .

 

Example

一:有很多不同的数学公式可以用来计算TF-IDF。这边的例子以上述的数学公式来计算。词频 (TF) 是一词语出现的次数除以该文件的总词语数。假如一篇文件的总词语数是100个,而词语“母牛”出现了3次,那么“母牛”一词在该文件中的词频就是3/100=0.03。一个计算文件频率 (DF) 的方法是测定有多少份文件出现过“母牛”一词,然后除以文件集里包含的文件总数。所以,如果“母牛”一词在1,000份文件出现过,而文件总数是10,000,000份的话,其逆向文件频率就是 log(10,000,000 / 1,000)=4。最后的TF-IDF的分数为0.03 * 4=0.12。

二:根据关键字k1,k2,k3进行搜索结果的相关性就变成TF1*IDF1 + TF2*IDF2 + TF3*IDF3。比如document1的term总量为1000,k1,k2,k3在document1出现的次数是100,200,50。包含了 k1, k2, k3的docuement总量分别是 1000, 10000,5000。document set的总量为10000。 TF1 = 100/1000 = 0.1 TF2 = 200/1000 = 0.2 TF3 = 50/1000 = 0.05 IDF1 = log(10000/1000) = log(10) = 2.3 IDF2 = log(10000/100000) = log(1) = 0; IDF3 = log(10000/5000) = log(2) = 0.69 这样关键字k1,k2,k3与docuement1的相关性= 0.1*2.3 + 0.2*0 + 0.05*0.69 = 0.2645 其中k1比k3的比重在document1要大,k2的比重是0.

三:在某个一共有一千词的网页中“原子能”、“的”和“应用”分别出现了 2 次、35 次 和 5 次,那么它们的词频就分别是 0.002、0.035 和 0.005。 我们将这三个数相加,其和 0.042 就是相应网页和查询“原子能的应用” 相关性的一个简单的度量概括地讲,如果一个查询包含关键词 w1,w2,...,wN, 它们在一篇特定网页中的词频分别是: TF1, TF2, ..., TFN。 (TF: term frequency)。 那么,这个查询和该网页的相关性就是:TF1 + TF2 + ... + TFN

读者可能已经发现了又一个漏洞。在上面的例子中,词“的”站了总词频的 80% 以上,而它对确定网页的主题几乎没有用。我们称这种词叫“应删除词”(Stopwords),也就是说在度量相关性是不应考虑它们的频率。在汉语中,应删除词还有“是”、“和”、“中”、“地”、“得”等等几十个。忽略这些应删除词后,上述网页的相似度就变成了0.007,其中“原子能”贡献了 0.002,“应用”贡献了 0.005。细心的读者可能还会发现另一个小的漏洞。在汉语中,“应用”是个很通用的词,而“原子能”是个很专业的词,后者在相关性排名中比前者重要。因此我们需要给汉语中的每一个词给一个权重,这个权重的设定必须满足下面两个条件:

1. 一个词预测主题能力越强,权重就越大,反之,权重就越小。我们在网页中看到“原子能”这个词,或多或少地能了解网页的主题。我们看到“应用”一次,对主题基本上还是一无所知。因此,“原子能“的权重就应该比应用大。

2. 应删除词的权重应该是零。

我们很容易发现,如果一个关键词只在很少的网页中出现,我们通过它就容易锁定搜索目标,它的权重也就应该大。反之如果一个词在大量网页中出现,我们看到它仍然不很清楚要找什么内容,因此它应该小。概括地讲,假定一个关键词 w 在 Dw 个网页中出现过,那么 Dw 越大,w的权重越小,反之亦然。在信息检索中,使用最多的权重是“逆文本频率指数” (Inverse document frequency 缩写为IDF),它的公式为log(D/Dw)其中D是全部网页数。比如,我们假定中文网页数是D=10亿,应删除词“的”在所有的网页中都出现,即Dw=10亿,那么它的IDF=log(10亿/10亿)= log (1) = 0。假如专用词“原子能”在两百万个网页中出现,即Dw=200万,则它的权重IDF=log(500) =6.2。又假定通用词“应用”,出现在五亿个网页中,它的权重IDF = log(2)则只有 0.7。也就只说,在网页中找到一个“原子能”的比配相当于找到九个“应用”的匹配。利用 IDF,上述相关性计算个公式就由词频的简单求和变成了加权求和,即 TF1*IDF1 + TF2*IDF2 +... + TFN*IDFN。在上面的例子中,该网页和“原子能的应用”的相关性为 0.0161,其中“原子能”贡献了 0.0126,而“应用”只贡献了0.0035。这个比例和我们的直觉比较一致了。

转载自:http://www.cnblogs.com/biyeymyhjob/archive/2012/07/17/2595249.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325579867&siteId=291194637