Commodity review analysis 2

Continuation of the previous one - https://blog.csdn.net/m0_49621298/article/details/107603652

2. Data processing

1. Remove newlines, spaces, etc. in the field

2. Use jieba for word segmentation, here you need to design the lexicon according to different business scenarios

comment=data['评论'].tolist()
commentArr=[]#存储评论分词
for i in range(len(comment)):#
    jieba.load_userdict('./file/自定义词库.txt')
    cut=jieba.cut(comment[i])
    commentArr.append(list(cut))

3. Process the results after word segmentation, such as extracting Chinese, replacing and unifying synonyms, removing punctuation marks, etc.

pattern = re.compile(r'[^\u4e00-\u9fa5]')
chinese = []
for review in commentArr:
    new_review = []
    for token in review:
        new_token = re.sub(pattern, '', token)#保留中文
        new_token=new_token.replace('安装师傅','师傅')
        if not new_token == '':
            new_review.append(new_token)
     chinese.append(new_review)

4. Create and add the required stop word database to remove stop words

stopwordArr=[]#停用词
with open('./file/停用词.txt', 'r',encoding='utf-8') as f:
    stop_word = f.readlines()
    stopwordArr.append(stop_word)
stopwordArr=[i.replace('\n','') for i in stopwordArr[0]]
# print(stopwordArr)
for i in range(len(chinese)):
    temparr=[]
    for j in range(len(chinese[i])):
        if chinese[i][j] not in(stopwordArr):
            temparr.append(chinese[i][j])
    no_stopword.append(temparr)

5. The processed data is saved into a file for subsequent use.

3. Data Analysis

1. Look at the best-selling models from the number of comments in the sub-categories

xilei={}
for i in data['细类'].unique():
    indices = np.where(data['细类'] == i)
    xilei[i]=[len(indices[0]),0]
    gailei = np.take(data['clean_word'].values, indices)
    sumgailei = np.sum(gailei)  # 合并细类下的评论
    xilei[i][1]=sumgailei

Top10 are as follows

2. Extract key words and find buyers’ concerns

for k in xilei.keys():
    temparr=[]
    tfidf = jieba.analyse.extract_tags(str(xilei[k][1]), topK=10, withWeight=True)
    temparr.append(k)
    temparr.append(xilei[k][0])
    temparr.append([i[0] for i in tfidf])
    output.append(temparr)

Looking at the top 10 high-frequency words, it can be found that in addition to the regular praise, there are a few words worth noting, such as "physical store", "garbage" and "Jianyi". "Jane Art" refers to the characteristics of the product, which is also the focus of users. The information obtained after checking the original comment of "Physical Store": First, offline physical stores are gradually expanding, and the words "just opened" appear in many comments; Single] such a process. The other "garbage" is a bit "out of the ordinary". Here, more user comments mentioned that the installers took the initiative to clean up the garbage after completion~ It seems that the details of this service have also attracted the attention of users.

3. Evaluation classification

(1)snownlp

First use snownlp to try it, and also find negative comments by the way (for various reasons, the negative comments of online products are generally unbalanced and rare), and use them as training sets for other algorithms.

def fenlei(text):
    s=SnowNLP(text)
    return s.sentiments
data['评论分类']=data['评论'].apply(fenlei)

The score given by snownlp is between 0-1, the lower the score, the more negative it is. After reading the first 20 entries in ascending order of score, I still found some bad reviews, and there were several misjudgments, 4 of which involved the high-frequency word "garbage" mentioned above. Most of the data set refers to the garbage taken away by the installer, which is a derogatory term in the third-party library. After trying to remove the word "garbage", it has indeed improved. Sure enough, data analysis still needs to be based on business~ At first glance About 0.000025 points or more must be counted as positive.

Confusion matrix, here we focus on the negative evaluation (rare category), recall rate = 72.7%, precision = 88.9%, F1 measurement value = 80.0%. It’s not too bad~~ In addition, it’s really tiring to label by hand! ! In addition to the data processing part, the big head of data analysis is also indispensable for labeling.

Why focus on negative reviews? Because some of the positive comments from customers are automatically commented by the system when they expire, the negative comments added by customers later will be covered up without any response from customer service. Focusing on these negative reviews can lead to timely customer repairs. Although the quantity is small, the word of mouth will spread from one to ten, and from ten to a hundred.

confusion matrix Forecast
the negative front
actual class the negative 16 6
front 2 1954

(2) Try to use Bayesian and SVM to classify

Few negative labels, pick them all out

neg = data[data['手工标签'] == "负面"]['clean_word'].tolist()
# print(len(neg))#22
negArr=[]
for i in range(len(neg)):
    temparr=[]
    neg[i]=neg[i].replace('[',"")
    neg[i] = neg[i].replace(']', "")
    neg[i] = neg[i].replace("'", "")
    temparr=neg[i].split(',')
    negArr.append(temparr)

Randomly select 30 in the front label for balance

posindex=np.where(data['手工标签'] == "正面")#查看正面标签的序号
# print(type(posindex[0]))#<class 'numpy.ndarray'>
randompos=random.sample(list(posindex[0]),30)#在正面中随机抽取30个
pos=data.loc[randompos]['clean_word'].tolist()
posArr=[]
for i in range(len(pos)):
    temparr=[]
    pos[i]=pos[i].replace('[',"")
    pos[i] = pos[i].replace(']', "")
    pos[i] = pos[i].replace("'", "")
    temparr=pos[i].split(',')
    posArr.append(temparr)

Construct training set

classVec=[0 if i<30 else 1 for i in range(30+22)]#创建label,1代表负面差评
posnegList=posArr[:]
posnegList.extend(negArr)#合并起来
def features(posnegList):
    return dict([(word, True) for word in posnegList])
train=[]#构造训练集
for i in range(len(posnegList)):
    train.append((features(posnegList[i]),classVec[i]))

Bayesian training with NLTK - need to pay attention to its required input format

classifier = NaiveBayesClassifier.train(train)
print(classifier.show_most_informative_features(10))

Guess you like

Origin blog.csdn.net/m0_49621298/article/details/107585855