自然语言处理--tf转化为tf-idf（数学公式方法）

词项频率(tf)必须根据其逆文档频率(idf)加权，以确保最重要、最有意义的词得到应有的权重即tf需要转化为tf-idf。任何具有毫秒级响应时间的 Web 级搜索引擎，其背后都具有 TF-IDF 的强大力量。

数学公式方法：

from nlpia.data.loaders import kite_text, kite_history
from nltk.tokenize import TreebankWordTokenizer
from collections import Counter

# 首先对语料库中的每篇文档（即 intro_doc 和 history_doc）分词
tokenizer = TreebankWordTokenizer()
kite_intro = kite_text.lower()
intro_tokens = tokenizer.tokenize(kite_intro)
kite_history = kite_history.lower()
history_tokens = tokenizer.tokenize(kite_history)
intro_total = len(intro_tokens)
print(intro_total)
history_total = len(history_tokens)
print(history_total)

# 分别计算kite,and,china的tf
intro_tf = {
    
    }
history_tf = {
    
    }
intro_counts = Counter(intro_tokens)
intro_tf['kite'] = intro_counts['kite'] / intro_total
history_counts = Counter(history_tokens)
history_tf['kite'] = history_counts['kite'] / history_total
print('Term Frequency of "kite" in intro is: {:.4f}'.format(intro_tf['kite']))
print( 'Term Frequency of "kite" in history is: {:.4f}'.format(history_tf['kite']))
intro_tf['and'] = intro_counts['and'] / intro_total
history_tf['and'] = history_counts['and'] / history_total
print('Term Frequency of "and" in intro is: {:.4f}'.format(intro_tf['and']))
print('Term Frequency of "and" in history is: {:.4f}'.format(history_tf['and']))
intro_tf['china'] = intro_counts['china'] / intro_total
history_tf['china'] = history_counts['china'] / history_total

# 计算 3 个词的 IDF
num_docs_containing_and = 0
for doc in [intro_tokens, history_tokens]:
    if 'and' in doc:
        num_docs_containing_and += 1
num_docs_containing_kite = 0
for doc in [intro_tokens, history_tokens]:
    if 'kite' in doc:
        num_docs_containing_kite += 1
num_docs_containing_china = 0
for doc in [intro_tokens, history_tokens]:
    if 'china' in doc:
        num_docs_containing_china += 1

num_docs = 2
intro_idf = {
    
    }
history_idf = {
    
    }
intro_idf['and'] = num_docs / num_docs_containing_and
history_idf['and'] = num_docs / num_docs_containing_and
intro_idf['kite'] = num_docs / num_docs_containing_kite
history_idf['kite'] = num_docs / num_docs_containing_kite
intro_idf['china'] = num_docs / num_docs_containing_china
history_idf['china'] = num_docs / num_docs_containing_china

# 计算三个词的tf-idf：tf * idf
# 对文档 intro 有：
intro_tfidf = {
    
    }
intro_tfidf['and'] = intro_tf['and'] * intro_idf['and']
intro_tfidf['kite'] = intro_tf['kite'] * intro_idf['kite']
intro_tfidf['china'] = intro_tf['china'] * intro_idf['china']
# 对文档 history 有：
history_tfidf = {
    
    }
history_tfidf['and'] = history_tf['and'] * history_idf['and']
history_tfidf['kite'] = history_tf['kite'] * history_idf['kite']
history_tfidf['china'] = history_tf['china'] * history_idf['china']

备注：
一般使用对数log()（exp()的逆函数）来对词频（和文档频率）进行尺度的缩放处理，确保 TF-IDF 分数更加符合均匀分布，即将值限定在特定的数值范围内进行缩放，防止词出现的次数类似，TF-IDF却出现指数级的差异的情况

自然语言处理--tf转化为tf-idf（数学公式方法）

猜你喜欢