NLP之TF-IDF与BM25

一 术语

  • TF: Term Frequency,词频;衡量某个指定的词语在某份【文档】中出现的【频率】
  • IDF: Inverse Document Frequency,逆文档频率;一个词语【普遍重要性】的度量。

二 TD-IDF

  • 传统的TD-IDF
    • 词汇word的词频(TF)值
      \[ TF Score = \frac{ 指定词汇word在第i份文档documents[i]中出现的次数 }{ 文档的长度 } \]
    • 词汇word的逆文档频率(IDF)值
      \[ IDF Score = log( \frac{ 文档集documents的总数 }{ 指定词word在文档集documents中出现过的文档总数 } ) \]
    • 词汇word与某份文档documents[j]的关联度得分(TF-IDF)
      \[ TF-IDF(word | docuements ) = Similarity(word | documents ) \]
      \[ Similarity(word | documents ) = TF Score*IDF Score \]
    • 短语sentence与某份文档documents[j]的关联度得分(TF-IDF)
      \[ sentence = [word1,word2,...,wordi,...,wordn] \]
      \[ TF-IDF_{_{sentence}}(word | docuements ) = TF-IDF_{_{word1}} + TF-IDF_{_{word2}} + ... + TF-IDF_{_{wordi}} + ... + TF-IDF_{_{wordn}} \]
  • 早期Lucence版的TF-IDF
    \[ TF-IDF(word | docuements ) = Similarity(word | documents ) \]
    \[ Similarity(word | documents ) = log( \frac{ 文档集documents的总数 }{ 指定词word在文档集documents中出现过的文档总数 + 1 })*sqrt(TF Score) * (\frac{1}{sqrt(文档documents[j]的长度)}) \]

log(numDocs / (docFreq + 1)) * sqrt(tf) * (1/sqrt(length)) $$

三 参考文献

[1] 搜索中的权重度量利器: TF-IDF和BM25

猜你喜欢

转载自www.cnblogs.com/johnnyzen/p/11298273.html