自然语言处理--tf转化为tf-idf(数学公式方法)

词项频率(tf)必须根据其逆文档频率(idf)加权,以确保最重要、最有意义的词得到应有的权重即tf需要转化为tf-idf。 任何具有毫秒级响应时间的 Web 级搜索引擎,其背后都具有 TF-IDF 的强大力量。

数学公式方法:

from nlpia.data.loaders import kite_text, kite_history
from nltk.tokenize import TreebankWordTokenizer
from collections import Counter

# 首先对语料库中的每篇文档(即 intro_doc 和 history_doc)分词
tokenizer = TreebankWordTokenizer()
kite_intro = kite_text.lower()
intro_tokens = tokenizer.tokenize(kite_intro)
kite_history = kite_history.lower()
history_tokens = tokenizer.tokenize(kite_history)
intro_total = len(intro_tokens)
print(intro_total)
history_total = len(history_tokens)
print(history_total)

# 分别计算kite,and,china的tf
intro_tf = {
    
    }
history_tf = {
    
    }
intro_counts = Counter(intro_tokens)
intro_tf['kite'] = intro_counts['kite'] / intro_total
history_counts = Counter(history_tokens)
history_tf['kite'] = history_counts['kite'] / history_total
print('Term Frequency of "kite" in intro is: {:.4f}'.format(intro_tf['kite']))
print( 'Term Frequency of "kite" in history is: {:.4f}'.format(history_tf['kite']))
intro_tf['and'] = intro_counts['and'] / intro_total
history_tf['and'] = history_counts['and'] / history_total
print('Term Frequency of "and" in intro is: {:.4f}'.format(intro_tf['and']))
print('Term Frequency of "and" in history is: {:.4f}'.format(history_tf['and']))
intro_tf['china'] = intro_counts['china'] / intro_total
history_tf['china'] = history_counts['china'] / history_total

# 计算 3 个词的 IDF
num_docs_containing_and = 0
for doc in [intro_tokens, history_tokens]:
    if 'and' in doc:
        num_docs_containing_and += 1
num_docs_containing_kite = 0
for doc in [intro_tokens, history_tokens]:
    if 'kite' in doc:
        num_docs_containing_kite += 1
num_docs_containing_china = 0
for doc in [intro_tokens, history_tokens]:
    if 'china' in doc:
        num_docs_containing_china += 1

num_docs = 2
intro_idf = {
    
    }
history_idf = {
    
    }
intro_idf['and'] = num_docs / num_docs_containing_and
history_idf['and'] = num_docs / num_docs_containing_and
intro_idf['kite'] = num_docs / num_docs_containing_kite
history_idf['kite'] = num_docs / num_docs_containing_kite
intro_idf['china'] = num_docs / num_docs_containing_china
history_idf['china'] = num_docs / num_docs_containing_china

# 计算三个词的tf-idf:tf * idf
# 对文档 intro 有:
intro_tfidf = {
    
    }
intro_tfidf['and'] = intro_tf['and'] * intro_idf['and']
intro_tfidf['kite'] = intro_tf['kite'] * intro_idf['kite']
intro_tfidf['china'] = intro_tf['china'] * intro_idf['china']
# 对文档 history 有:
history_tfidf = {
    
    }
history_tfidf['and'] = history_tf['and'] * history_idf['and']
history_tfidf['kite'] = history_tf['kite'] * history_idf['kite']
history_tfidf['china'] = history_tf['china'] * history_idf['china']

备注:
一般使用对数log()(exp()的逆函数)来对词频(和文档频率)进行尺度的缩放处理,确保 TF-IDF 分数更加符合均匀分布,即将值限定在特定的数值范围内进行缩放,防止词出现的次数类似,TF-IDF却出现指数级的差异的情况

猜你喜欢

转载自blog.csdn.net/fgg1234567890/article/details/111827485
今日推荐