[Data Analysis] Data Analysis Talent Competition 1: Visual Analysis of User Emotion

Table of contents

Background

Question data

1. Import data grouping

2. Text cleaning 

3. Draw a word cloud 

4. Sentiment analysis (SnowNLP calculates sentiment score)

5. Draw a histogram of different sentiment values

 6. Sentiment score histograms under different themes

7. Draw the frequency bar chart under different emotional words

 8. Draw a correlation coefficient heat map 


Background

The competition topic is based on the analysis of online public opinion, requiring contestants to analyze and visualize brand issues based on user comments. Use this question to guide commonly used data visualization charts and data analysis methods, and conduct exploratory data analysis on the content of interest.

Question data

Data source: earphone_sentiment.csv, 10,000+ industry users’ comments on earphones

Link: https://pan.baidu.com/s/1wlHzYVi2QO9xGfisD-al8A?pwd=myfb 
Extraction code: myfb

1. Import data grouping

#导入模块
# _*_ coding:utf-8 _*_
import pandas as pd 
import numpy as np 
from collections import defaultdict
import os
import re
import jieba
import codecs

data=pd.read_csv("earphone_sentiment.csv")
data.head(10)

s1 = data[data['sentiment_value']==1]
s2 = data[data['sentiment_value']==0]
s3 = data[data['sentiment_value']==-1]

print(s2['content'])

1 The distortion of the left channel of this HD650 at 1k is about 6 times that of the right channel, which is also beyond the official specification parameter range (0.05%). See...
3 Consumers of bose, beats, and apple don't know the existence of curves at all
5 I think anyone can clearly distinguish the difference between high-end headphones without making a sound. After all, the wearing feeling is different, and it is not possible to achieve blind listening
6. It is one thing to hear the difference, and the requirement for hearing the level of high and low is even higher.
7 Can anyone tell which one is the most expensive among the 10 power cords?
                               ...                        
17170 can push the high frequency of HD650 to be harsh. What kind of weird system is this?
17172 hd800 cracked skin is normal, if you change the cable, you don’t have to worry about it
17173 You can just solder it yourself. By the way, my 820 original wire is brand new, and the 800s original wire is 99 new. I put it in the box without moving it.
17174 So before it explodes, hurry up.
17175 sommer black refers to his own diy two-meter thread, the cost is about 600, and the original thread is hung
Name: content, Length: 12210, dtype: object

 2. Text cleaning 

Stop word document download link:

Link: https://pan.baidu.com/s/1jXV_aHcbQrWES78FoIhU-g?pwd=bceb 
Extraction code: bceb

with open('stop_word/HGD_StopWords.txt','r',encoding='utf-8') as f:
    stopwords=set([line.replace('\n','')for line in f])
f.close()


#加载用户自定义词典
segs=data['content']
def clean_data(content):
    words =' '
    for seg_text in content:
        seg_text=jieba.cut(seg_text)
        for seg in seg_text:
            if seg not in stopwords and seg!=" " and len(seg)!=1:    # #文本清洗  
                  words = words +  seg + ' '
    return words
print(clean_data(s1['content']))

3. Draw a word cloud 

import wordcloud as wc
import matplotlib.pyplot as plt
# from scipy.misc import imread
from imageio import imread
from PIL import Image

def Cloud_words(s,words):
    # 引入字体
    font=r"C:/Windows/Fonts/simhei.ttf"
    mask = np.array(Image.open('词云2.png'))
    image_colors =wc.ImageColorGenerator(mask)
    
    
    #从文本中生成词云图
    cloud = wc.WordCloud(font_path=font,#设置字体 
                          background_color='black', # 背景色为白色
                          height=600, # 高度设置为400
                          width=900, # 宽度设置为800
                          scale=20, # 长宽拉伸程度程度设置为20
                          prefer_horizontal=0.2, # 调整水平显示倾向程度为0.2
                          mask=mask, # 添加蒙版
                          max_font_size=100, #字体最大值 
                          max_words=2000, # 设置最大显示字数为1000
                          relative_scaling=0.3, # 设置字体大小与词频的关联程度为0.3
 
                         )
    # 绘制词云图
    mywc = cloud.generate(words)
    plt.imshow(mywc)
    mywc.to_file(str(s)+'.png')

s=[s1,s2,s3]
for i in range(len(s)):
    Cloud_words(i,clean_data(s[i]['content']))

#绘制不同类型的词云图
import stylecloud
clouds=stylecloud.gen_stylecloud(
        text=clean_data(s1['content']), # 上面分词的结果作为文本传给text参数
        size=512,
        font_path='C:/Windows/Fonts/msyh.ttc', # 字体设置
        palette='cartocolors.qualitative.Pastel_7', # 调色方案选取,从palettable里选择
        gradient='horizontal', # 渐变色方向选了垂直方向
        icon_name='fab fa-weixin',  # 蒙版选取,从Font Awesome里选
        output_name='test_ciyun.png') # 输出词云图

  • Here, the shape of my word cloud does not follow the shape of the background image sos, maybe there is a problem with finding the material, let’s talk about it later 

4. Sentiment analysis (SnowNLP calculates sentiment score)

# 情感分析的结果是一个小数,越接近1,说明越偏向积极;越接近0,说明越偏向消极。
# SnowNLP计算情感得分
from snownlp import SnowNLP
# 评论情感分析
# f = open('earphone_sentiment.csv',encoding='gbk') 
# line = f.readline()
with open('stop_word/HGD_StopWords.txt','r',encoding='utf-8') as f:
    stopwords=set([line.replace('\n','')for line in f])
f.close()
sum=0
count=0
for i in range(len(data['content'])):
    line=jieba.cut(data.loc[i,'content'])           #分词
    words=''
    for seg in line:
        if seg not in stopwords and seg!=" ":        #文本清洗 
            words=words+seg+' '
    if len(words)!=0:
        print(words)        #输出每一段评论的情感得分
        d=SnowNLP(words)
        print('{}'.format(d.sentiments))
        data.loc[i,'sentiment_score']=float(d.sentiments)     #原数据框中增加情感得分列
        sum+=d.sentiments
        count+=1
score=sum/count
print('finalscore={}'.format(score))    #输出最终情感得分

#将情感得分结果保存为新的csv文件
data.to_csv('result.csv',encoding='gbk',header=True)

#情感值以方法一计算的作为值
#获取同一列中不重复的值
a=list(data['subject'].unique())
sum_scores=dict()
#求对应主题的情感均值
for r in range(len(a)):
    de=data.loc[data['subject']==a[r]]
    sum_scores[a[r]]=round(de['sentiment_score'].mean(),2)
print(sum_scores)

5. Draw a histogram of emotional values ​​under different themes

#不同主题下的情感得分柱形图
import seaborn as sns
import matplotlib.pyplot as plt
# 这两行代码解决 plt 中文显示的问题
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
#数据可视化
sns.barplot(x=list(sum_scores.values()),y=list(sum_scores.keys()))
plt.xlabel('情感值')
plt.ylabel('主题')
plt.title('不同主题下的情感得分柱形图')
for x,y in enumerate(list(sum_scores.values())): 
    plt.text(y,x,y,va='center')
plt.show()

  

 6. Histogram of emotional scores under different emotional words

#绘制不同情感词下的频数条形图
bar= data['sentiment_word'].value_counts().head(10)
print(bar)
labels = bar.index
sns.barplot(bar.values,labels)
plt.xlabel('频数')
plt.ylabel('情感词')
plt.title('不同情感词下Top10频数柱形图')
for x,y in enumerate(bar.values): 
    plt.text(y+0.2,x,y,va='center')#在每个柱子上标注频数
plt.show()

7. Draw a frequency bar chart under different sentiment values

#获绘制不同情感值的柱形图
# bar包括index和values
bar= data['sentiment_value'].value_counts()

sentiment_score=dict(bar)
print(sentiment_score)

sns.barplot(x=list(sentiment_score.keys()),y=list(sentiment_score.values()))
plt.xlabel('情感值')
plt.ylabel('频数')
plt.title('不同情感值下的频数柱形图')
for x,y in zip(sentiment_score.keys(),sentiment_score.values()): 
    plt.text(x+1,y,y,ha='center',va='baseline',fontsize=12)
plt.show()

 8. Draw a correlation coefficient heat map 

#绘制相关系数热力图 
import seaborn as sns
from matplotlib import font_manager
#按不同主题不同情感值进行分组,求不同主题不同情感值对应的情感得分均值
w=data.groupby(['subject','sentiment_value'],as_index=False)['sentiment_score'].mean()
print(w)
print('****************')
print(data_heatmap)

data_heatmap = w.pivot(index = 'subject', columns = 'sentiment_value', values = 'sentiment_score')
ax = sns.heatmap(data_heatmap,vmax = 1,cmap='rainbow', annot=True,linewidths=0.05) 
ax.set_title('耳机评论热力图', fontsize = 20)

Guess you like

Origin blog.csdn.net/m0_51933492/article/details/127395156