Sentiment Analysis of Hotel Chinese Reviews Based on Python

Sentiment polarity analysis , that is, sentiment classification, analyzes and summarizes texts with subjective emotional color. There are mainly two classification methods for sentiment polarity analysis: methods based on emotional knowledge and methods based on machine learning . The method based on emotional knowledge calculates the emotional polarity (positive or negative) of the text through some existing emotional dictionaries. The method is to judge the text by counting the number of positive and negative emotional words appearing in the text or the emotional value of the emotional words Emotional category: The method based on machine learning uses machine learning algorithm to train the training data set with labeled emotional category to train the classification model, and then predict the emotional classification of the text through the classification model. This project uses machine learning methods to realize the sentiment classification of hotel review data, and uses the Python language to realize the construction and prediction of the sentiment classification model. It does not include the theoretical part, and aims to understand and realize the Chinese sentiment polarity analysis step by step through practice.

1 Development environment preparation

1.1 Python environment

Download the python version corresponding to the computer from the python official website Download Python | Python.org , and the project uses the Python3.8 version.

1.2 Third-party modules

The implementation of this project code uses multiple third-party modules, the main modules are as follows:

  • 1) Jieba is currently the most widely used Chinese word segmentation component. Download address: jieba · PyPI
  • 2) Gensim is a python library for topic models, document indexing, and large-scale corpus similarity indexing, mainly for natural language processing (NLP) and information retrieval (IR). Download link: gensim · PyPI This module is required for Wikipedia Chinese corpus processing and Chinese word vector model construction in this project.
  • 3) Pandas is a python library used to efficiently process large data sets and perform data analysis tasks. It is a toolkit based on Numpy. Download address: https://pypi.python.org/pypi/pandas/
  • 4) Numpy is a toolkit for storing and manipulating large matrices. Download address: numpy · PyPI
  • 5) Scikit-learn is a python toolkit for machine learning. The python module reference name is sklearn. Two Python libraries, Numpy and Scipy, are required before installation. Official website address: scikit-learn: machine learning in Python — scikit-learn 1.2.0 documentation
  • 6) Matplotlib Matplotlib is a python graphics framework for drawing two-dimensional graphics. Download address: matplotlib · PyPI
  • 7) Tensorflow Tensorflow is an open source software library that uses data flow graphs for numerical calculations and is used in the field of artificial intelligence. Download address: tensorflow · PyPI

      Remarks: The above modules can be installed using requirement.txt , pip install -r requirement.txt

2 data acquisition

2.1 Stop word dictionary

This project uses the Chinese stop word list released by the Chinese Natural Language Processing Open Platform of the Institute of Computing Technology, Chinese Academy of Sciences, which contains 1208 stop words. The file is stopWord.txt in the data folder, and you can add words to it by yourself.

2.2 Positive and negative corpus

In this project, a balanced corpus (ChnSentiCorp_htl_ba_2000) with positive and negative numbers of 1000 is selected as the data set for analysis.

3 Data preprocessing

3.1 Positive and negative corpus preprocessing

The data/ChnSentiCorp_htl_ba_2000/ folder contains two folders, neg (negative corpus) and pos (positive corpus), and each comment in the folder is a txt file, you can add the corresponding corpus by yourself, and the text needs to use utf8 encoding. In order to facilitate subsequent operations, it is necessary to organize the positive and negative comments into a corresponding txt file, that is, the collection document of the positive corpus (named 2000_pos.txt) and the collection document of the negative corpus (named 2000_neg. txt). The specific Python implementation code is as follows:

The execution results are as follows:

After the operation is completed, two text files, 2000_pos.txt and 2000_neg.txt, are obtained, which store positive comments and negative comments respectively, and each comment is one line. A partial screenshot of the document is as follows:

3.2 Chinese text word segmentation

This project uses Jieba word segmentation to process positive and negative corpus respectively . Before word segmentation, it is necessary to remove numbers, letters, and special symbols from the text. This can be achieved by using the string and re modules that come with python. The string module is used to process string operations, and the re module is used to process regular expressions. The specific implementation code is as follows:

After the processing is completed, two txt files, 2000_pos_cut.txt and 2000_neg_cut.txt, are obtained, respectively storing the results of positive and negative corpus segmentation. A screenshot of the word segmentation results is shown below:

3.3 Removing stop words

After the word segmentation is completed, the stop words in the stop word table can be read, the positive and negative corpus after word segmentation can be matched and the stop words can be removed. The steps to remove stop words are very simple, there are two main steps:

  • 1) Read the stop word list;
  • 2) Traverse the sentence after word segmentation, throw each word into this table for matching, and replace it with empty if the disabled word list exists.

The specific implementation code is as follows:

According to the code, the acquisition of the disabled vocabulary uses the unique broadcasting form of python, which can be done with one line of code:

stopkey = [w.strip() for w in codecs.open('data\stopWord.txt', 'r', encoding='utf-8').readlines()]

Each stop word that is read must be de-symbolized, that is, w.strip() , because the read stop word also contains line breaks and tabs, and if it is not processed, it will not match. After the code is executed, two txt files are obtained: 2000_neg_cut_stopword.txt and 2000_pos_cut_stopword.txt.

Since the step of removing stop words is performed after sentence segmentation, it is usually performed in the same code segment as the word segmentation operation, that is, after the sentence segmentation operation is completed, the function of removing stop words is directly called, and the stop word removal The final result is written to the result file. In order to facilitate the understanding of the steps, this project divides the two into two code files for execution, and you can adjust them according to your own needs.

3.4 Obtaining feature word vectors

According to the above steps, the feature word text of the positive and negative corpus is obtained, and the input of the model must be numerical data, so it is necessary to convert each sentence composed of words into a numerical vector. Common conversion algorithms include Bag of Words (BOW), TF-IDF, and Word2Vec. This project uses the Word2Vec word vector model to convert the corpus into word vectors.

Since the extraction of feature word vectors is based on the trained word vector model, and the wiki Chinese corpus is recognized as a large Chinese corpus, this project intends to extract the feature word vectors of this corpus from the word vectors generated by the wiki Chinese corpus. You can search wiki.zh.text.vector to download or contact me to provide a network disk download (the size is close to 4G).

The main steps to obtain feature word vectors are as follows:

  • 1) Read the model word vector matrix;
  • 2) Traversing each word in the sentence, extracting the numerical vector of the current word from the model word vector matrix, one sentence can get a two-dimensional matrix, the number of rows is the number of words, and the number of columns is the dimension set by the model;
  • 3) Calculate the matrix mean value according to the obtained matrix as the feature word vector of the current sentence;
  • 4) After the calculation of all sentences is completed, the value represented by the spliced ​​sentence category is written into the csv file.

The main code is shown in the figure below:

After the code is executed, a file named 2000_data.csv is obtained. The first column is the value corresponding to the category (1-pos, 0-neg), the second column is a numerical vector, and each row represents a comment. A partial screenshot of the result is shown below:

3.5 Dimensionality reduction

The Word2vec model sets 256 dimensions for training, and the obtained word vector is 256 dimensions. This paper uses the PCA algorithm to reduce the dimensionality of the results. The specific implementation code is as follows:

Run the code and find that the first 100 dimensions can better contain most of the original data according to the result graph, so the first 100 dimensions are selected as the input of the model.

4 classification model construction

In this paper, support vector machine (SVM) is used as the Chinese text classification model in this experiment. Other classification models use the same analysis process, so I won’t go into details here.

A Support Vector Machine (SVM) is a supervised machine learning model. This paper first uses the classic machine learning algorithm SVM as the classifier algorithm, and verifies the effectiveness of the classifier by calculating the prediction accuracy of the test set and the ROC curve. Generally speaking, the larger the area of ​​the ROC curve (AUC), the better the performance of the model. First use SVM as the classifier algorithm, and then use matplotlib and metric library to build the ROC curve. The specific python code is as follows:

Run the code and get Test Accuracy: 0.85 , that is, the prediction accuracy rate of the experimental test set is 85%, and the ROC curve is shown in the figure below.

 Resource download address: https://download.csdn.net/download/wouderw/87353433 

The list of resource files is as follows:

 

Guess you like

Origin blog.csdn.net/wouderw/article/details/128475149