It is mainly to study the book "Recommendation System" by Xiang Liang and the related study notes of the course "Design of Movie Recommendation System" by Wu Shengran in Xiaopozhan. I hope the author can correct me if there are deficiencies.
Article Directory
TF_IDF algorithm principle
- TF(Term Frequency,TF) normalized term frequency: TF i, j = ni, jn ∗, j TF_{i,j}=\frac{n_{i,j}}{n_{*,j}}TFi,j=n∗,jni,j TF i, j, the frequency of occurrence of word i in document j, ni, j, the number of occurrences of word i in article j, n ∗, j the total number of document j. TF_{i,j}, the frequency of occurrence of word i in document j, n_{i,j}, the number of occurrences of word i in article j, n_{*,j} the total number of times in document j.TFi,j,Words words i in the text file j in the current of the frequency rate , n-i,j, Word language i in the text chapter j in the current of the times the number , the n-∗,jText file j of the total times the number .
- IDF(逆文档频率):
I D F i = log ( N + 1 N i + 1 ) I D F_{i}=\log \left(\frac{N+1}{N_{i}+1}\right) IDFi=log(Ni+1N+1)
N represents the total number of documents in the document set,N i N_iNiIndicates the number of documents in the document set that contain the word i
Implementation examples
- Define data and preprocessing
#引入库
import numpy as np
import pandas as pd
#定义预处理数据
docA="The cat sat on my bed"
docB="The dog sat on my kness"
#词袋汇总
bowA=docA.split(" ")
bowB=docB.split(" ")
bowA
#构建词库
wordSet = set(bowA).union(set(bowB))
- Count the number of words
# 统计词频
#利用统计词典保存词语出现的频率
wordDictA=dict.fromkeys(wordSet,0)
wordDictB=dict.fromkeys(wordSet,0)
#遍历文档统计词数
for word in bowA:
wordDictA[word] +=1
for word in bowB:
wordDictB[word] +=1
pd.DataFrame([wordDictA,wordDictB])
The | bed | cat | dog | kness | my | on | sat | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 |
1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
3. Calculate TF
def computeTF(wordDict,bow):
tfDict={
}
nbowCount = len(bow)
for word,count in wordDict.items():
tfDict[word]=count/nbowCount
return tfDict
tfA=computeTF(wordDictA,bowA)
tfB=computeTF(wordDictB,bowB)
tfA
{'The': 0.16666666666666666,
'cat': 0.16666666666666666,
'on': 0.16666666666666666,
'kness': 0.0,
'sat': 0.16666666666666666,
'bed': 0.16666666666666666,
'dog': 0.0,
'my': 0.16666666666666666}
- Calculate IDF
def computeIDF( wordDictList ):
# 用一个字典对象保存idf结果,每个词作为key,初始值为0
idfDict = dict.fromkeys(wordDictList[0], 0)
N = len(wordDictList)
import math
for wordDict in wordDictList:
# 遍历字典中的每个词汇,统计Ni
for word, count in wordDict.items():
if count > 0:
# 先把Ni增加1,存入到idfDict
idfDict[word] += 1
# 已经得到所有词汇i对应的Ni,现在根据公式把它替换成为idf值
for word, ni in idfDict.items():
idfDict[word] = math.log10( (N+1)/(ni+1) )
return idfDict
idfs = computeIDF( [wordDictA, wordDictB] )
idfs
{'The': 0.0,
'cat': 0.17609125905568124,
'on': 0.0,
'kness': 0.17609125905568124,
'sat': 0.0,
'bed': 0.17609125905568124,
'dog': 0.17609125905568124,
'my': 0.0}
- Calculate TF_IDF
def computeTFIDF( tf, idfs ):
tfidf = {
}
for word, tfval in tf.items():
tfidf[word] = tfval * idfs[word]
return tfidf
tfidfA = computeTFIDF( tfA, idfs )
tfidfB = computeTFIDF( tfB, idfs )
pd.DataFrame( [tfidfA, tfidfB] )
The | bed | cat | dog | kness | my | on | sat | |
---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.029349 | 0.029349 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.000000 | 0.000000 | 0.029349 | 0.029349 | 0.0 | 0.0 | 0.0 |