Graduation Project ---- Emotional Tendency Analysis of Hotel Evaluation Based on Deep Learning

Overview

This article is based on 7K Ctrip hotel evaluation data as text data, imported it into the Keras model framework and then trained a model that can be used to predict emotions in actual places.

Required modules for the project

import tensorflow as tf  
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow.keras as keras
 # 导入jieba分词库
import jieba 
import re

data

the data shows

More than 7,000 Ctrip hotel reviews, more than 5,000 positive reviews, and more than 2,000 negative reviews.

Field description
  • Number of comments (overall): 7766
  • Number of comments (positive): 5322
  • Number of comments (negative): 2444

Insert picture description here

data processing

# 读取数据
data = pd.read_csv("/home/kesci/input/labelreview5456/ChnSentiCorp_htl_all.csv")
# 查看数据的前5项
data.head()

Insert picture description here

Word segmentation
# 去除标点符号和数字
# 要去除标点符号和数字,常用的办法就是使用正则表达式来处理,或者自行编写遍历替换函数

# 模式串
patten = r"[!\"#$%&'()*+,-./:;<=>?@[\\\]^_`{|}~—!,。?·¥、《》···【】:" "''\s0-9]+"  
re_obj = re.compile(patten)

# 替换函数--去除标点符号和数字
def clear(text):
    return re_obj.sub('', text)

# 将正则表达式替换函数应用于每一行
data["review"] = data["review"].apply(clear)
# 查看前5行替换结果

data["review"][:5]

Insert picture description here

Use streamlined processing, enable HMM (Hidden Markov Network) processing

def cut_words(words):
    return jieba.lcut(words) # 使用lcut分词

#apply函数对series的每一行数据进行处理
data["review"] = data["review"].apply(cut_words)
data["review"][:5]

Insert picture description here

Stop word processing
# 使用 中文停用词表 
stop_words = "/home/kesci/work/stopwords-master/stopwords.txt"
stop_list = [
    i.strip() for i in open(stop_words, encoding='utf-8').readlines()
]  #读取停用词列表


def remove_stop(words):  #移除停用词函数
    texts = []

    for word in words:  # 遍历词列表里的每一个词
        if word not in stop_list:  # 若不在停用词列表中就将结果追加至texts列表中
            texts.append(word)

    return texts


data['review'] = data['review'].apply(remove_stop)
# 查看前5行
data["review"][:5]

Insert picture description here

Sample balance
data["label"].value_counts().plot(kind = 'bar')
plt.text(0, 6000, str(data["label"].value_counts()[1]),
        ha = 'center', va = 'top')
plt.text(1, 3000, str(data["label"].value_counts()[0]),
        ha = 'center', va = 'top')
plt.ylim(0, 6500)
plt.title('正负样本的个数')
plt.show()

Insert picture description here

It can be seen from the histogram that the data set has a total of 7766 data, of which a total of 5,322 positive samples (label = 1) and a total of 2444 negative samples (label = 0), with no duplicate data

Obviously there is a serious imbalance problem in the sample. Here we consider two sample balancing strategies
(1) undersampling, 2000 positive and negative samples each, 4000 total
(2) oversampling, 3000 positive and negative samples each, total 6000

In order to reduce the amount of calculation and compare the effects of the two equalization strategies, the overall data is processed first, and then the sample equalization sampling is performed.

def get_balanced_words(size,
                       positive_comment=data[data['label'] == 1],
                       negtive_comment=data[data['label'] == 0]):
    word_size = size // 2
    #获取正负评论数
    num_pos = positive_comment.shape[0]
    num_neg = negtive_comment.shape[0]
    #     当 正(负)品论数中<采样数量/2 时,进行上采样,否则都是下采样;
    #     其中pandas的sample方法里的repalce参数代表是否进行上采样,默认不进行
    balanced_words = pd.concat([
        positive_comment.sample(word_size,
                                replace=num_pos < word_size,
                                random_state=0),
        negtive_comment.sample(word_size,
                               replace=num_neg < word_size,
                               random_state=0)
    ])
    #     打印样本个数
    print('样本总数:', balanced_words.shape[0])
    print('正样本数:', balanced_words[data['label'] == 1].shape[0])
    print('负样本数:', balanced_words[data['label'] == 0].shape[0])
    print('')
    return balanced_words
Establish a multi-layer perceptron classification model

Insert picture description here
You can see that there are four layers: the flat layer has 1,600 neurons, and the flat layer can be regarded as the input layer here. The hidden layer has a total of 256 neurons; the output layer has only 1 neuron. There are 474,113 hyperparameters that must be trained. Generally, the larger the hyperparameter value is, the more complex the model is, and more time is required for training.

Training model

Insert picture description here

Network detection rate and detection results

Insert picture description here
Insert picture description here
Insert picture description here

input_text = """
去之前会有担心,因为疫情,专门打了电话给前台,前台小哥哥好评,耐心回答,打消了我的顾虑,nice!! 
看得出有做好防疫情清洁消毒工作,前台登记反复询问,确保出行轨迹安全,体温测量登记,入住好评,选了主题房,设计是我喜欢的.
总之下次有需要还是会自住或推荐!!
"""

predict_review(input_text)
result : 正面评价!

At this point, the emotional tendency analysis of Ctrip’s hotel evaluation ends with the establishment of a simple multi-layer perceptron model. Due to the limitation of the article, the subsequent model optimization and comparison with other deep learning models will not be briefly described. Those who are interested Students can pay attention to follow-up articles by seniors. Thank you all students!

At last

How to get help and source code of this project

Help instructions for seniors-click to view

Guess you like

Origin blog.csdn.net/HUXINY/article/details/110243973