Python performs data analysis based on the emotional color of words (jieba library)

keyword extraction

There are generally two ways to remove punctuation: delete stop words (Stop Words);
extract keywords according to part of speech.


words2 = jieba.cut(words1)
words3 = list(words2)
print("/".join(words3))
# 速度/快/,/包装/好/,/看着/特别/好/,/喝/着/肯定/不错/!/价廉物美

stop_words = [",", "!"]
words4 =[x for x in words3 if x not in stop_words]
print(words4)
# ['速度', '快', '包装', '好', '看着', '特别', '好', '喝', '着', '肯定', '不错', '价廉物美']

Another way to optimize word segmentation results is to extract keywords based on parts of speech. The advantage of this method is that the jieba library can tag each word according to its part of speech without preparing a list of stop words in advance.

Here is a paddle (paddle is Baidu's open source deep learning platform, jieba uses paddle's model library) pattern part-of-speech table as a reference, you can manually analyze auxiliary words, function words (punctuation) according to the part-of-speech results automatically analyzed by jieba symbol) removed.
insert image description here


# words5 基于词性移除标点符号
import jieba.posseg as psg  
words5 = [ (w.word, w.flag) for w in psg.cut(words1) ]
# 保留形容词
saved = ['a',]
words5 =[x for x in words5 if x[1] in saved]
print(words5)
# [('快', 'a'), ('好', 'a'), ('好', 'a'), ('不错', 'a')]

Semantic Sentiment Analysis

For sentences that have been divided into words, we need to use another library to count the positive and negative emotional tendencies of words. This library is the snownlp library.

Snownlp's algorithmic problems can make it inaccurately classify negative words. For example, "dislike", snownlp will divide the word into two separate words, "no" and "like". Then, when calculating the semantic sentiment, there will be a large error. Therefore, we will first use jieba for word segmentation, and then use snownlp to implement semantic sentiment analysis after word segmentation.


from snownlp import SnowNLP
words6 = [ x[0] for x in words5 ]
s1 = SnowNLP(" ".join(words3))
print(s1.sentiments)
# 0.99583439264303 

This code uses the Bayes (Bayes) model training method of snownlp to read the positive samples and negative samples that come with the module into memory, and then use the classify() function in the Bayes model to classify, so that the sentiments attribute is obtained. The value of sentiments indicates the direction of emotional tendencies. In snownlp: If the sentiment orientation is positive, the result of sentiments will be close to 1. If the emotional orientation is negative, the result will be close to 0.

positive = 0
negtive = 0
for word in words6:
    s2 = SnowNLP(word)

    if s2.sentiments > 0.7:
        positive+=1
    else:
        negtive+=1

    print(word,str(s2.sentiments))
print(f"正向评价数量:{
      
      positive}")
print(f"负向评价数量:{
      
      negtive}")
# 快 0.7164835164835165
# 好 0.6558628208940429
# 好 0.6558628208940429
# 不错 0.8612132352941176
# 价廉物美 0.7777777777777779
# 正向评价数量:3
# 负向评价数量:2

In snownlp, after the model is trained and saved through the train() and save() functions, the function of extending the default dictionary can be implemented. In addition, in my work, I will also use this method to increase the emotional tendency analysis function corresponding to emoji expressions, so as to further improve the accuracy of snownlp's emotional tendency analysis.

sentiment.train(neg2.txt,pos2.txt);  #   训练用户自定义正负情感数据集
sentiment.save('sentiment2.marshal');  # 保存训练模型

Guess you like

Origin blog.csdn.net/david2000999/article/details/121516770