[] Data Analysis study notes day30 natural language processing NLTK + text + text similarity similarity and classification text classification Case + + TF-IDF + (term frequency - inverse document frequency) + case

Text similarity and classification

  • Similarity between the measure of the text
  • It represents term frequency using text feature
  • The frequency or the number of times the word appears in the text
  • NLTK realization word frequency statistics

Text similarity Case:

import nltk
from nltk import FreqDist

text1 = 'I like the movie so much '
text2 = 'That is a good movie '
text3 = 'This is a great one '
text4 = 'That is a really bad movie '
text5 = 'This is a terrible movie'

text = text1 + text2 + text3 + text4 + text5
words = nltk.word_tokenize(text)
freq_dist = FreqDist(words)
print(freq_dist['is'])
# 输出结果:
# 4


# 取出常用的n=5个单词
n = 5
# 构造“常用单词列表”
most_common_words = freq_dist.most_common(n)
print(most_common_words)
# 输出结果:
# [('a', 4), ('movie', 4), ('is', 4), ('This', 2), ('That', 2)]



def lookup_pos(most_common_words):
    """
        查找常用单词的位置
    """
    result = {}
    pos = 0
    for word in most_common_words:
        result[word[0]] = pos
        pos += 1
    return result

# 记录位置
std_pos_dict = lookup_pos(most_common_words)
print(std_pos_dict)
# 输出结果:
# {'movie': 0, 'is': 1, 'a': 2, 'That': 3, 'This': 4}


# 新文本
new_text = 'That one is a good movie. This is so good!'
# 初始化向量
freq_vec = [0] * n
# 分词
new_words = nltk.word_tokenize(new_text)

# 在“常用单词列表”上计算词频
for new_word in new_words:
    if new_word in list(std_pos_dict.keys()):
        freq_vec[std_pos_dict[new_word]] += 1

print(freq_vec)
# 输出结果:
# [1, 2, 1, 1, 1]

Text Categorization

TF-IDF (term frequency - inverse document frequency)

  • TF, Term Frequency (word frequency), represents the number of times a word appears in the file
  • IDF, Inverse Document Frequency (inverse document frequency), commonly used to measure the importance of a word.
  • TF-IDF = TF * IDF

[Image dump the chain fails, the source station may have security chain mechanism, it is recommended to save the picture down uploaded directly (img-iTeE7TKD-1579959553196) (... / images / TF.png)]

[Image dump the chain fails, the source station may have security chain mechanism, it is recommended to save the picture down uploaded directly (img-b3gOQecn-1579959553197) (... / images / IDF.png)]

  • Example assumption:

The number of times the word appears in a document that contains cat 100 words for 3, then TF = 3/100 = 0.03

Sample a total of 10,000,000 documents, wherein the number of occurrence documents cat of 1,000, the IDF = log (10,000,000 / 1,000) = 4

TF-IDF = TF IDF = 0.03 4 = 0.12

  • NLTK achieve TF-IDF

TextCollection.tf_idf()

Case:

from nltk.text import TextCollection

text1 = 'I like the movie so much '
text2 = 'That is a good movie '
text3 = 'This is a great one '
text4 = 'That is a really bad movie '
text5 = 'This is a terrible movie'

# 构建TextCollection对象
tc = TextCollection([text1, text2, text3, 
                        text4, text5])
new_text = 'That one is a good movie. This is so good!'
word = 'That'
tf_idf_val = tc.tf_idf(word, new_text)
print('{}的TF-IDF值为:{}'.format(word, tf_idf_val))

Results of the:

That的TF-IDF值为:0.02181644599700369
He published 192 original articles · won praise 56 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_35456045/article/details/104084966