Electricity supplier shopping Comments sentiment analysis

With online shopping becoming more popular, people shopping online has become increasingly high demand, which makes Jingdong, Taobao and other electronic business platform has been a great opportunity for development. However, this demand also contributed to the development of more electronic business platform, led to intense competition. In the background of this highly competitive business platform, in addition to improving product quality, drive down commodity prices, the more consumers understand the voices of more and more important for electronic business platform is. One very important way is to carry out internal data mining analysis of information for consumers text comment data. And get this information, but also help to improve the competitiveness of their production of the corresponding goods.

1. Data Preparation

#-*- coding: utf-8 -*-
import pandas as pd

inputfile = 'huizong.csv' #评论汇总文件
outputfile = 'meidi_jd.txt' #评论提取后保存路径
data = pd.read_csv(inputfile, encoding = 'utf-8')
data.head()

Screenshot 2.03.50.png 2019-07-24 PM
Get all the brand names Jingdong platform data.

data['品牌'].unique()

The results: array ([ 'AO', ' Haier', 'beautiful', 'Glanz,' the 'and', 'Macro'], dtype = object)
we found this piece of data has AO, Haier, Midea, Galanz , 000 and, MACRO total of six brands, we are here only for the brand text comment data 'beauty' analysis.

# 这里我们只提取”美的“品牌的评论
import os
import importlib
import sys
importlib.reload (sys)
# 默认编码
sys.getdefaultencoding()  # 查看当前编码格式

data = data[[u'评论']][data[u'品牌'] == u'美的']
# 把数据保存为csv文件
data.to_csv(outputfile, index = False, header = False, encoding = 'utf-8') 
outdata_1 = pd.read_csv(outputfile, encoding = 'utf-8', header = None)
outdata_1.head()

Screenshot 2.07.54.png 2019-07-24 PM

2. Data Preprocessing

After taking to the text, we must first pre-text comment data, a large number of entries in the value of content is very low or no value content of data in the presence of textual commentary, if these comments will also be introduced word, word frequency statistics, or even sentiment analysis, is bound to great impact analysis, the results obtained are bound to quality problems. So, before the use of these data must be pre-review of the text, we put a lot of comments such worthless content removed.

2.1 text comments to heavy

Text deduplication is to remove duplicate data text comments section, some of the electronic business platform in order to avoid some customers for a long time not to comment, often set up a program, if the user exceeds the allotted time still did not comment, the system will automatically replace the customer comment, often these comments are mostly praise. However, these comments obviously no analytical value, and that these comments are a large number of recurring, must be removed.

outputfile = 'meidi_jd_process_1.txt' #评论处理后保存路径
l1 = len(outdata_1)
data_unique = pd.DataFrame(outdata_1[0].unique())
l2 = len(data_unique)
data.to_csv(outputfile, index = False, header = False, encoding = 'utf-8')
print(u'共%s条评论,删除了%s条评论。' %(l1,l1 - l2))

Results: A total 55 400 comments, delete a 2352 review.
We use statistical functions value_counts repeated comments, text comments can be repeated a number of repetitions, the maximum number of comments is the default text may comment.

series_data = pd.Series(outdata_1[0])
fre_data = pd.DataFrame(series_data.value_counts())
fre_data.head(20)

Here we have only the first 20 rows to print out, it can be seen "very satisfied, five-star" the highest frequency, appeared a total of 107 times, it should be the default comment. The "length fill in your experience using this product in between 5-200 characters, for example. You can enter up to 200 characters and other problems of the goods or a function to bring you help, or encountered in the use of" total appears 75 times, the default is the second-largest number of comments, which should also prompted comments or the user directly prompted for a comment.

2.2 text comment participle

Only words, sentences and paragraphs can be simple delimitation by obvious delimiter in Chinese, whereas for the "word" and "phrase", they blur the boundary, not a formal delimiter. Therefore, when Chinese text mining, the first word of the text, i.e. continuous word sequence recombined into the process sequence of words according to certain specifications. Here we use jieba word breaker.

import jieba #导入结巴分词,需要自行下载安装

inputfile1 = 'meidi_jd_neg.txt'
inputfile2 = 'meidi_jd_pos.txt'
outputfile1 = 'meidi_jd_neg_cut.txt'
outputfile2 = 'meidi_jd_pos_cut.txt'

data1 = pd.read_csv(inputfile1, encoding = 'utf-8', header = None) #读入数据
data2 = pd.read_csv(inputfile2, encoding = 'utf-8', header = None)

mycut = lambda s: ' '.join(jieba.cut(s)) #自定义简单分词函数
data1 = data1[0].apply(mycut) #通过“广播”形式分词,加快速度。
data2 = data2[0].apply(mycut)

data1.to_csv(outputfile1, index = False, header = False, encoding = 'utf-8') #保存结果
data2.to_csv(outputfile2, index = False, header = False, encoding = 'utf-8')
data1.head()

Screenshot 2.14.46.png 2019-07-24 PM
Seen from the results, it seems, still, battery, no electric, water heaters, switches, install a few words and other words, segmentation effect is good, but there "will", "do not know" and a slight lack of analysis of . In summary it can be seen jieba word segmentation segmentation effect is still very good, most are in line with Chinese habits.

2.3 removal of stop words

Stop words (Stop Words), dictionary translated as "virtual computer retrieval word, non-retrieval of words." In SEO, in order to save storage space and improve the efficiency of the search, the search engine will automatically ignore certain words or phrases in the index page or processing search requests, these words or phrases to be known Stop Words (stop words). In natural language processing, stop words generally do not carry valuable information, we chose to remove these words.

import os
import sys
sys.getdefaultencoding()  # 查看当前编码格式
import importlib
importlib.reload(sys)
stoplist = 'stoplist.txt'

neg = pd.read_csv(outputfile1, encoding = 'utf-8', header = None) #读入数据
pos = pd.read_csv(outputfile2, encoding = 'utf-8', header = None)
stop = pd.read_csv(stoplist, encoding = 'utf-8', header = None, sep = 'tipdm')
#sep设置分割词,由于csv默认以半角逗号为分割词,而该词恰好在停用词表中,因此会导致读取出错
#所以解决办法是手动设置一个不存在的分割词,如tipdm。
stop = [' ', ''] + list(stop[0]) #Pandas自动过滤了空格符,这里手动添加

neg[1] = neg[0].apply(lambda s: s.split(' ')) #定义一个分割函数,然后用apply广播
neg[2] = neg[1].apply(lambda x: [i for i in x if i not in stop]) #逐词判断是否停用词
pos[1] = pos[0].apply(lambda s: s.split(' '))
pos[2] = pos[1].apply(lambda x: [i for i in x if i not in stop])
neg.head()

Screenshot 2.17.21.png 2019-07-24 PM
It can be seen after the stop word filtering, before "a", "and" stop words like these were removed.

 3. LDA topic model analysis

Topic models in machine learning and natural language processing and other fields is used to find a statistical model of abstract topics in a series of documents. For a document if it has multiple themes, these particular words may represent different themes repeated, this time, the use of topic model, can be found in the use of words in the text of the law, and similar laws text linked together, to seek useful information unstructured text set. LDA topic model as a model which belongs to the generative theme probabilistic model unsupervised.

# 没有安装 gensim ,可以试用 !pip install gensim 进行安装
from gensim import corpora, models

#负面主题分析
neg_dict = corpora.Dictionary(neg[2]) #建立词典
neg_corpus = [neg_dict.doc2bow(i) for i in neg[2]] #建立语料库
neg_lda = models.LdaModel(neg_corpus, num_topics = 3, id2word = neg_dict) #LDA模型训练

 #正面主题分析
pos_dict = corpora.Dictionary(pos[2])
pos_corpus = [pos_dict.doc2bow(i) for i in pos[2]]
pos_lda = models.LdaModel(pos_corpus, num_topics = 3, id2word = pos_dict)
pos_theme = pos_lda.show_topics()#展示主题
pos_theme

Screenshot 2.19.14.png 2019-07-24 PM
The above shows the high frequency characteristic words for the positive comments of three themes analysis.
Here we feature the theme and high-frequency words become DataFrame format for viewing, first choose a regular feature words extracted high frequency.

import re
# 匹配中文字符
pattern = re.compile(r'[\u4e00-\u9fa5]+')
# 主题一的特征词
pattern.findall(pos_theme[0][1])

Screenshot 2.40.00.png 2019-07-24 PM

Then get each topic feature word format and converted to DataFrame

# 取得每个主题的特征词
pos_key_words=[]
for i in range(3):
    pos_key_words.append(pattern.findall(pos_theme[i][1]))
# 变成 DataFrame 格式
pos_key_words = pd.DataFrame(data=pos_key_words,index=['主题1',"主题2","主题3"])
pos_key_words

Screenshot 2.21.50.png 2019-07-24 PM
A major theme can be seen on the water heater installation, after-sales service, the subject of two major water heaters on the quality, price, delivery terms, three major themes regarding the installation of water heaters, heating, insulation effect aspects. In summary, the theme will feature words into a format DataFrame very clear understanding of the key points of each topic and the comments of the emotional tendencies.

4. Summary

This article is modeled after Jingdong Mall for the "United States" consumer brand of water heater comment text data in the text of the basic pre-processing, Chinese word, stop word filtered through the establishment of a data mining model LDA topic model, reviews of the text data to achieve the orientation discrimination and the high frequency characteristic words about the subject matter presented in DataFrame format.
We can project the source address fork the project  https://momodel.cn/explore/5d37d3ea1afd94479ffa37b0?type=app
Reference:
https://github.com/goto456/stopwords
https://github.com/fxsjy/jieba


Mo (URL: momodel.cn ) is a Python support of artificial intelligence online modeling platform that can help you quickly develop, training and deployment model.


Mo AI clubs  are sponsored by the site R & D and product design team, committed to the development and use artificial intelligence to reduce the threshold of the club. Team with big data processing and analysis, visualization and data modeling experience, has undertaken multidisciplinary intelligence project, with design and development capability across the board from the bottom to the front end. The main research directions for the management of large data analysis and artificial intelligence technology, and in order to promote data-driven scientific research.
Currently the club held six machine-learning technology salon themed activities under the line in Hangzhou weekly, from time to time to share articles and academic exchanges. Hoping to converge from all walks of life to artificial intelligence interested friends, continue to grow exchanges, promote the democratization of artificial intelligence, wider use.

image.png

Published 36 original articles · won praise 4 · views 10000 +

Guess you like

Origin blog.csdn.net/weixin_44015907/article/details/97972429