Python-- user reviews sentiment analysis

Introduction

In this section we will review user-generated data in real sentiment analysis.

Knowledge Point

  • Text word
  • Word2Vec method
  • Decision Tree Classification

This article relates to sentiment analysis, also known as text sentiment analysis, this is a natural language processing and text mining process involves a piece of content. In short, we have to judge a piece of text by algorithm, comment emotions tend to quickly understand the expression of subjective feelings of the author of this text.

In reality, when we are in the mood for some content presentation may occur are: happy, excited, excited, did not feel the loss, depression, tension, confusion and so on. In the world of natural language processing, we can not reach yet so small classification. So often when sentiment analysis for text only handle two emotional states: positive and negative.

Of course, the above-mentioned computer can not handle more emotional breakdown category is actually not accurate. Because, in principle, the algorithm is able to distinguish more emotional categories, the key is that we need to provide training complex emotions set off an artificial mark, which is very difficult to do. So, we are now carrying out sentiment analysis, deal only with positive and negative two states.

Dictionary-based approach

Currently, there are methods for text sentiment analysis of two, one based dictionary, another machine learning method. First, let's describe what the text sentiment analysis based on the principle dictionary.

One method is very simple and easy to understand sentiment analysis Dictionary. In summary, we first of all have a good manual annotation dictionary. Each of this dictionary which corresponds to a negative or positive label. Dictionary for example as follows:

Term label
well positive
not good negative
happy positive
Uncomfortable negative
love you positive
hate negative
…… ……

Then, this dictionary might have over ten thousand or hundreds of thousands, of course, is better. With After the dictionary, we can begin the text sentiment analysis.

Now, we have received a user review:

This course is very good ah!

Then, we can word this sentence. Segmentation results are as follows:

[ 'The door', 'curriculum', 'it', 'good', 'ah', '! ']

Next, we get points in order to match the good word dictionary. Matching method is very simple:

  1. If the word dictionary and is active in the presence of the label, we remember + 1 + 1;
  2. If the word exists in the dictionary and is a negative label, then we remember -1-1;
  3. If the word does not exist in the dictionary, we remember 00.

After the match finished a sentence, we can calculate the score of the sentence. The total score> 0> 0 indicates that the sentence mood is positive, with a total score less than zero representing the sentence is negative, the total score = 0 = 0 said it could not determine the mood. Sentiment analysis method by the dictionary is very simple, but the disadvantages are also obvious. We often need a large dictionary, and constantly updated. This is a great test of human and material resources.

In addition, this method also can not be judged by emotional problems solved expanded dictionary. For example, when we humans in determining a sentence clear, we will tend to be more biased in favor of the overall grasp (locale), especially care about the impact of some modal particles on mood. The method is based on sentiment analysis dictionary can not do this, the sentence split into words, it will affect the overall emotional expression sentence.

At present, for the Chinese to do very little emotion marked the dictionary. Commonly used are:

  1. National Taiwan University NTUSD mood dictionary .
  2. "HowNet" sentiment analysis set of words .

To "HowNet" mood dictionary, for example, it contains five documents were enumerated the positive and negative emotions and the extent of vocabulary words.

  • "Positive emotion" words, such as: love, appreciation, joy, empathy, curiosity, cheering, dreaming, recognition ...
  • "Negative emotion" words, such as: sad, doubtful, contempt, not satisfied, 不是滋味儿, regret, disappointment ...
  • "Positive feedback" words, such as: indispensable, ministerial, adequacy, beauties, inspiring, beautiful, Duijin Er ...
  • "Negative comments" words, such as: ugly, bitter, excessive, flashy, desolate, cloudy, Jiqingjichong, prices, wishy-washy ...
  • "Degree level" words,
  • "Advocate" words

Because of this simple dictionary comparison method described above, the accuracy is not high, so this will not be user comments sentiment analysis to achieve by this method.

Bag of words or Word2Vec-based approach

Bag of words model

In addition to comments dictionary based sentiment analysis, we still have a way to call it bag of words model. Bag of words will no longer be considered as a single word constituting word, but as \ times N1 × a. 1 N vector. For example, we now have two words need to be addressed are:

I love you, I love you very much. I love you, I love you.

We focused on two words after word segmentation, deduplication processing as a bag of words:

[ 'I', 'love', 'love', 'you', 'very']

Then, according to the word bags, we carry out vector conversion of the original sentence. Wherein N is the vector length word length of the bag, and the vector value followed by a number of times each word appears in the pouch word sentence.

I love you, I love you very much. → [2, 2, 0, 2, 1]

I love you, I love you. → [2, 0, 2, 2, 1]

With the bag of words, sentences have been manually labeled good, on the formation of our training data. And then to build a predictive model based on machine learning classification method. Enter the new mood in order to determine the sentence.

You will find that bag of words model and one-hot encoding we mentioned before is very similar. In fact, this is the one-hot encoding before in the word into a sentence only.

Bag of words model is certainly better than a simple dictionary comparison method, but one-hot encoding can not measure the distance between the context, it can not be emotional judged in context. Here, we introduce Word2Vec word vector processing method, it will be very good to overcome this shortcoming.

Word2Vec

Word2Vec, hence the name think Italy is to convert the sentence into a vector, which is the word vector. It Word2Vec open as early as 2013 by Google, and it is the word vector into a shallow neural network model composed.

Word2Vec input is generally large-scale corpus, the output is a vector space. Word2Vec feature is that each word corpus corresponds to a vector of the vector space, with context of the word, is mapped to distance vector space will be closer.

The main structure is bonded together Word2Vec CBOW (Continuous Bag-of-Words Model) model and Skip-gram (Continuous Skip-gram) model. Simply put, both want to get the probability of occurrence of a word by the context.

CBOW the context of a word model (for each of the N word) predictive current word. And Skip-gram is just the opposite, he predict its context in a word, to obtain a number of samples of the current word context, therefore useful in larger data sets.

CBOW (N = 2) and Skip-gram of the structure as shown below:

img

图中 w(t)w(t) 表示当前的词汇,而 w(t−n)w(tn),w(t+n)w(t+n) 等则用来表示上下文词汇。

用户评论情绪分析

为了保证良好的准确度,本次我们选用了 Word2Vec 结合决策树的文本情绪分析方法。首先,我们需要使用 Word2Vec 来建立向量空间,之后再使用决策树训练文本情绪分类模型。

由于我们未人工针对评论进行语料库标注,所以这里需要选择其他的已标注语料库进行模型训练。这里,我们选用了网友苏剑林提供的语料库。该语料库整合了书籍、计算机等 7 个领域的评论数据。

你可以通过下面网盘链接下载本次所需要的数据集:

链接:https://pan.baidu.com/s/1qpy23EDHG2rea9SzxphIsQ 提取码:g837

三个数据文件的预览如下。

其中,消极情绪文本 neg.xls 共有 10428 行。

import pandas as pd

pd.read_excel("data_09/data/neg.xls", header=None).head()

积极情绪文本 pos.xls 共有 10679 行。

pd.read_excel("data_09/data/pos.xls", header=None).head()

用户评论文本 comments.csv 共有 12377 行。

pd.read_csv("data_09/comments.csv").head()

语料库分词处理

在使用 Word2Vec 之前,我们需要先对训练语料库进行分词处理。这里依旧使用 jieba 分词。

import jieba
import numpy as np

# 加载语料库文件,并导入数据
neg = pd.read_excel('data_09/data/neg.xls', header=None, index=None)
pos = pd.read_excel('data_09/data/pos.xls', header=None, index=None)

# jieba 分词


def word_cut(x): return jieba.lcut(x)


pos['words'] = pos[0].apply(word_cut)
neg['words'] = neg[0].apply(word_cut)

# 使用 1 表示积极情绪,0 表示消极情绪,并完成数组拼接
x = np.concatenate((pos['words'], neg['words']))
y = np.concatenate((np.ones(len(pos)), np.zeros(len(neg))))

# 将 Ndarray 保存为二进制文件备用
np.save('X_train.npy', x)
np.save('y_train.npy', y)

print('done.')

你可以预览一下数组的形状,以 x 为例:

np.load('X_train.npy', allow_pickle=True)

Word2Vec 处理

With an array of points after the word, we can start Word2Vec process, to convert it to a word vector. At present, many open-source tools provide Word2Vec methods, such as Gensim, TensorFlow, PaddlePaddle and so on. Here we use Gensim.

from gensim.models.word2vec import Word2Vec
import warnings
warnings.filterwarnings('ignore')  # 忽略警告

# 导入上面保存的分词数组
X_train = np.load('X_train.npy', allow_pickle=True)

# 训练 Word2Vec 浅层神经网络模型
w2v = Word2Vec(size=300, min_count=10)
w2v.build_vocab(X_train)
w2v.train(X_train, total_examples=w2v.corpus_count, epochs=w2v.epochs)


def sum_vec(text):
    # 对每个句子的词向量进行求和计算
    vec = np.zeros(300).reshape((1, 300))
    for word in text:
        try:
            vec += w2v[word].reshape((1, 300))
        except KeyError:
            continue
    return vec


# 将词向量保存为 Ndarray
train_vec = np.concatenate([sum_vec(z) for z in X_train])
# 保存 Word2Vec 模型及词向量
w2v.save('w2v_model.pkl')
np.save('X_train_vec.npy', train_vec)
print('done.')

Word2Vec may take several minutes.

Training Mood classification model

With word vector, we have entered a machine learning model, it can be trained sentiment classification model. This piece of content you should be very familiar with, select faster decision tree method, and use scikit-learn is complete.

from sklearn.externals import joblib
from sklearn.tree import DecisionTreeClassifier

# 导入词向量为训练特征
X = np.load('X_train_vec.npy')
# 导入情绪分类作为目标特征
y = np.load('y_train.npy')
# 构建支持向量机分类模型
model = DecisionTreeClassifier()
# 训练模型
model.fit(X, y)
# 保存模型为二进制文件
joblib.dump(model, 'dt_model.pkl')

The implementation of the decision tree classification, the process lasts longer.

Emotion Judgment to be judged

With Word2Vec model and SVM classification model mood. Next, the user can review our emotions to be predicted.

# 读取 Word2Vec 并对新输入进行词向量计算
def sum_vec(words):
    # 读取 Word2Vec 模型
    w2v = Word2Vec.load('w2v_model.pkl')
    vec = np.zeros(300).reshape((1, 300))
    for word in words:
        try:
            vec += w2v[word].reshape((1, 300))
        except KeyError:
            continue
    return vec

To comments Emotion Judgment:

# 读取评论
df = pd.read_csv("data_09/comments.csv", header=0)
comment_sentiment = []
for string in df['评论内容']:
    # 对评论分词
    words = jieba.lcut(str(string))
    words_vec = sum_vec(words)
    # 读取支持向量机模型
    model = joblib.load('dt_model.pkl')
    result = model.predict(words_vec)
    comment_sentiment.append(result[0])

    # 实时返回积极或消极结果
    if int(result[0]) == 1:
        print(string, '[积极]')
    else:
        print(string, '[消极]')

# 将情绪结果合并到原数据文件中
merged = pd.concat([df, pd.Series(comment_sentiment, name='用户情绪')], axis=1)
pd.DataFrame.to_csv(merged, 'comment_sentiment.csv')  # 储存文件以备后用

Finally, we can see through pie charts mood distributed users. Overall, 73% are positive comments, full of positive energy.

to sum up

This section, we use machine learning methods for user comments were true sentiment analysis. Covering more, especially Word2Vec process. For the purpose of this course, we did not perform too detailed presentation on Word2Vec. If you are interested in natural language processing, Word2Vec necessarily need a thorough understanding of the method. In addition to the use of machine learning methods to analyze the text mood. We tend to use LSTM (Long Short Term Memory networks) long and short term memory training recurrent neural network classification model, corpus in large-scale data, the application of the effect may be better than support vector machine method. When there is spare capacity, you can also learn about their own learning.

In addition, as used in this corpus review from a third-party shopping sites. So, in terms of their training model to predict the user's mood, and may not have good generalization ability. If you can manually mark the existing data, forecast data for the future, the results should be better.

Guess you like

Origin www.cnblogs.com/wwj99/p/12425879.html