TF-IDF的算法原理

预处理过程中，我们已经把停词都过滤掉了。如果只考虑剩下的有实际意义的词，前我们已经讲过，显然词频（TF，Term Frequency）较高的词之于一篇文章来说可能是更为重要的词（也就是潜在的关键词）。但这样又会遇到了另一个问题，我们可能发现在上面例子中，madefortv、california、includ 都出现了2次（madefortv其实是原文中的made-for-TV，因为我们所选分词法的缘故，它被当做是一个词来看待），但这显然并不意味着“作为关键词，它们的重要性是等同的”。

因为”includ”是很常见的词（注意 includ 是 include 的词干）。相比之下，california 可能并不那么常见。如果这两个词在一篇文章的出现次数一样多，我们有理由认为，california 重要程度要大于 include ，也就是说，在关键词排序上面，california 应该排在 include 的前面。

于是，我们需要一个重要性权值调整参数，来衡量一个词是不是常见词。如果某个词比较少见，但是它在某篇文章中多次出现，那么它很可能就反映了这篇文章的特性，它就更有可能揭示这篇文字的话题所在。这个权重调整参数就是“逆文档频率”（IDF，Inverse Document Frequency），它的大小与一个词的常见程度成反比。

知道了 TF 和 IDF 以后，将这两个值相乘，就得到了一个词的TF-IDF值。某个词对文章的重要性越高，它的TF-IDF值就越大。如果用公式来表示，则对于某个特定文件中的词语 ti

它的 TF 可以表示为：

最后，便可以来计算 TF-IDF(t)=TF(t)×IDF(t)。

下面的代码实现了计算TF-IDF值的功能。

def tf(word, count):
    return count[word] / sum(count.values())

def n_containing(word, count_list):
    return sum(1 for count in count_list if word in count)

def idf(word, count_list):
    return math.log(len(count_list) / (1 + n_containing(word, count_list)))

def tfidf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)

再给出一段测试代码：

countlist = [count1, count2, count3]
for i, count in enumerate(countlist):
    print("Top words in document {}".format(i + 1))
    scores = {word: tfidf(word, count, countlist) for word in count}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:3]:
        print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))

输出结果如下：

Top words in document 1
    Word: film, TF-IDF: 0.02829
    Word: madefortv, TF-IDF: 0.00943
    Word: california, TF-IDF: 0.00943
Top words in document 2
    Word: genu, TF-IDF: 0.03686
    Word: 7, TF-IDF: 0.01843
    Word: among, TF-IDF: 0.01843
Top words in document 3
    Word: revolv, TF-IDF: 0.02097
    Word: colt, TF-IDF: 0.02097
    Word: manufactur, TF-IDF: 0.01398

利用Scikit-Learn实现的TF-IDF

因为 TF-IDF 在文本数据挖掘时十分常用，所以在Python的机器学习包中也提供了内置的TF-IDF实现。主要使用的函数就是TfidfVectorizer()，来看一个简单的例子。

>>> corpus = ['This is the first document.',
      'This is the second second document.',
      'And the third one.',
      'Is this the first document?',]
>>> vectorizer = TfidfVectorizer(min_df=1)
>>> vectorizer.fit_transform(corpus)
<4x9 sparse matrix of type '<class 'numpy.float64'>'
    with 19 stored elements in Compressed Sparse Row format>
>>> vectorizer.get_feature_names()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> vectorizer.fit_transform(corpus).toarray()
array([[ 0.        ,  0.43877674,  0.54197657,  0.43877674,  0.        ,
         0.        ,  0.35872874,  0.        ,  0.43877674],
       [ 0.        ,  0.27230147,  0.        ,  0.27230147,  0.        ,
         0.85322574,  0.22262429,  0.        ,  0.27230147],
       [ 0.55280532,  0.        ,  0.        ,  0.        ,  0.55280532,
         0.        ,  0.28847675,  0.55280532,  0.        ],
       [ 0.        ,  0.43877674,  0.54197657,  0.43877674,  0.        ,
         0.        ,  0.35872874,  0.        ,  0.43877674]])

最终的结果是一个 4×9

矩阵。每行表示一个文档，每列表示该文档中的每个词的评分。如果某个词没有出现在该文档中，则相应位置就为 0 。数字 9 表示语料库里词汇表中一共有 9 个（不同的）词。例如，你可以看到在文档1中，并没有出现 and，所以矩阵第一行第一列的值为 0 。单词 first 只在文档1中出现过，所以第一行中 first 这个词的权重较高。而 document 和 this 在 3 个文档中出现过，所以它们的权重较低。而 the 在 4 个文档中出现过，所以它的权重最低。

总的感觉就是，要找一个权衡一个词的重要性的参数，但是这个参数是两个指标的乘积形式，其中一个指标是该词在当前文章中的出现的次数，这个词在文章中出现的次数越多，对该文章的重要性越大，但是如果该词在所有文章中出现的次数都很大，就不能说明该词对某一篇文章重要性了，比如说汉字"的"在每一篇文章中都会大量出现，你就不能说，这个字对其中某一篇文章来说很重。我们要找的结果就是需要这个词在该文章中大量出现，但针对所有文章来说，出现的频率又属于较小的，大概类似于"垄断现象"。

利用Scikit-Learn实现的TF-IDF

猜你喜欢