Brief description of TF_IDF algorithm and realization of calculation examples

It is mainly to study the book "Recommendation System" by Xiang Liang and the related study notes of the course "Design of Movie Recommendation System" by Wu Shengran in Xiaopozhan. I hope the author can correct me if there are deficiencies.

TF_IDF algorithm principle

  • TF(Term Frequency,TF) normalized term frequency: TF i, j = ni, jn ∗, j TF_{i,j}=\frac{n_{i,j}}{n_{*,j}}TFi,j=n,jni,j TF i, j, the frequency of occurrence of word i in document j, ni, j, the number of occurrences of word i in article j, n ∗, j the total number of document j. TF_{i,j}, the frequency of occurrence of word i in document j, n_{i,j}, the number of occurrences of word i in article j, n_{*,j} the total number of times in document j.TFi,j,Words words i in the text file j in the current of the frequency rate , n-i,j, Word language i in the text chapter j in the current of the times the number , the n-,jText file j of the total times the number .
  • IDF(逆文档频率):
    I D F i = log ⁡ ( N + 1 N i + 1 ) I D F_{i}=\log \left(\frac{N+1}{N_{i}+1}\right) IDFi=log(Ni+1N+1)
    N represents the total number of documents in the document set,N i N_iNiIndicates the number of documents in the document set that contain the word i

Implementation examples

  1. Define data and preprocessing
#引入库
import numpy as np
import pandas as pd

#定义预处理数据
docA="The cat sat on my bed"
docB="The dog sat on my kness"
 
#词袋汇总
bowA=docA.split(" ")
bowB=docB.split(" ")
bowA
#构建词库
wordSet = set(bowA).union(set(bowB))
  1. Count the number of words
# 统计词频
#利用统计词典保存词语出现的频率
wordDictA=dict.fromkeys(wordSet,0)
wordDictB=dict.fromkeys(wordSet,0)

#遍历文档统计词数
for word in bowA:
    wordDictA[word] +=1
for word in bowB:
    wordDictB[word] +=1

pd.DataFrame([wordDictA,wordDictB])

The bed cat dog kness my on sat
0 1 1 1 0 0 1 1 1
1 1 0 0 1 1 1 1 1

3. Calculate TF

  def computeTF(wordDict,bow):
        tfDict={
    
    }
        nbowCount = len(bow)
        for word,count in wordDict.items():
            tfDict[word]=count/nbowCount
        return tfDict
    
tfA=computeTF(wordDictA,bowA)
tfB=computeTF(wordDictB,bowB)
tfA
{'The': 0.16666666666666666,
 'cat': 0.16666666666666666,
 'on': 0.16666666666666666,
 'kness': 0.0,
 'sat': 0.16666666666666666,
 'bed': 0.16666666666666666,
 'dog': 0.0,
 'my': 0.16666666666666666}
  1. Calculate IDF
def computeIDF( wordDictList ):
    # 用一个字典对象保存idf结果,每个词作为key,初始值为0
    idfDict = dict.fromkeys(wordDictList[0], 0)
    N = len(wordDictList)
    import math
    
    for wordDict in wordDictList:
        # 遍历字典中的每个词汇,统计Ni
        for word, count in wordDict.items():
            if count > 0:
                # 先把Ni增加1,存入到idfDict
                idfDict[word] += 1
                
    # 已经得到所有词汇i对应的Ni,现在根据公式把它替换成为idf值
    for word, ni in idfDict.items():
        idfDict[word] = math.log10( (N+1)/(ni+1) )
    
    return idfDict

idfs = computeIDF( [wordDictA, wordDictB] )
idfs 
{'The': 0.0,
 'cat': 0.17609125905568124,
 'on': 0.0,
 'kness': 0.17609125905568124,
 'sat': 0.0,
 'bed': 0.17609125905568124,
 'dog': 0.17609125905568124,
 'my': 0.0}
  1. Calculate TF_IDF
def computeTFIDF( tf, idfs ):
    tfidf = {
    
    }
    for word, tfval in tf.items():
        tfidf[word] = tfval * idfs[word]
    return tfidf

tfidfA = computeTFIDF( tfA, idfs )
tfidfB = computeTFIDF( tfB, idfs )

pd.DataFrame( [tfidfA, tfidfB] )
The bed cat dog kness my on sat
0 0.0 0.029349 0.029349 0.000000 0.000000 0.0 0.0 0.0
1 0.0 0.000000 0.000000 0.029349 0.029349 0.0 0.0 0.0

Guess you like

Origin blog.csdn.net/Zengmeng1998/article/details/107283582
Recommended