TF-IDF算法介绍和基于Python的实现

TF-IDF算法概念

TF-IDF（term frequency–inverse document frequency）是一种用于信息检索与数据挖掘的常用加权技术。TF是词频(Term Frequency)，IDF是逆文本频率(Inverse Document Frequency)

TF-IDF是一种统计方法，用以评估一字词对于一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。
TFIDF的主要思想是：如果某个词或短语在一篇文章中出现的频率TF高，并且在其他文章中很少出现，则认为此词或者短语具有很好的类别区分能力，适合用来分类。

（1）词频TF(Term Frequency)

TF表示某一个给定的词语w在该文件中出现的频率
公式：
在这里插入图片描述

（2）逆文本频率IDF(Inverse Document Frequency）

IDF的主要思想是：如果包含词条w的文档越少，IDF越大，则说明词条w具有很好的类别区分能力。
公式：

注意：取以10为底的对数

（3）TF-IDF实际上是：TF * IDF

公式：
在这里插入图片描述

TF-IDF算法的Python实现

1.计算TF值

def computeTF(in_word, words_num_dic):
    """
    计算单词in_word在每篇文档的TF

    :param in_word: 单词
    :param words_num_dic: 文本词数字典｛txt1:{word1:num1,word2:num2},...｝
    :return: tfDict: 单词in_word在所有文本中的tf值字典 ｛文件名1：tf1,文件名2：tf2,...｝
    """
    allcount_dic = {}   # 各文档的总词数
    tfDict = {}     # in_word的tf字典
    # 计算每篇文档总词数
    for filename, num in words_num_dic.items():
        count = 0
        for value in num.values():
            count += value
        allcount_dic[filename] = count
    # 计算tf
    for filename, num in words_num_dic.items():
        if in_word in num.keys():
            tfDict[filename] = num[in_word] / allcount_dic[filename]
    return tfDict

2.计算IDF值

def computeIDF(in_word, words_num_dic):
    """
    计算in_word的idf值

    :param in_word: 单词
    :param words_num_dic: 文本词数字典｛txt1:{word1:num1,word2:num2},...｝
    :return: 单词in_word在整个文本库中的idf值
    """
    docu_count = len(words_num_dic)     # 总文档数
    count = 0
    for num in words_num_dic.values():
        if in_word in num.keys():
            count += 1
    import math
    return math.log10((docu_count) / (count + 1))

3.计算TF-IDF值

def computeTFIDF(in_word, words_num_dic):
    """
    计算in_word在每篇文档的tf-idf值

    :param in_word: 单词
    :param words_num_dic: 文本词数字典｛txt1:{word1:num1,word2:num2},...｝
    :return: tfidf_dic:单词in_word在所有文本中的tf-idf值字典 ｛文件名1：tfidf1,文件名2：tfidf2,...｝
    """
    tfidf_dic = {}
    idf = computeIDF(in_word, words_num_dic)
    tf_dic = computeTF(in_word, words_num_dic)
    for filename, tf in tf_dic.items():
        tfidf_dic[filename] = tf * idf
    return tfidf_dic

测试：

s = {'1.txt': 'apple maths apple maths maths', '2.txt': 'maths maths'}
s_num = {'1.txt': {'apple': 2, 'maths': 3}, '2.txt': {'maths': 2}}
print("tf:")
print(computeTF("maths", s_num))
print("idf:")
print(computeIDF("maths", s_num))
print("tf-idf:")
print(computeTFIDF("maths", s_num))

输出：
在这里插入图片描述