NLP implement sentiment analysis and comparative advantages and disadvantages of the two methods to resolve instances

Lead

"NLP" the most current and a fiery field, the business has gradually penetrated into more and more industries, a small series of decisions on common application functions one by one to try ......

0. Introduction

"Sentiment polarity analysis" is a subjective text with emotional analysis, processing, and inductive reasoning process. According to the text of the different treatment categories can be divided based on news sentiment analysis and commentary on product reviews sentiment analysis. The former are used for monitoring public opinion and forecast information, which can help users understand a product's reputation in the public mind.
Current common sentiment polarity analysis methods are mainly two ways: based on emotion dictionary method based on machine learning methods.

1. Text-based emotional sentiment lexicon polar analytes

The author is by sentiment scoring carried out as text sentiment polarity judgment, score > 0judgment is positive, score < 0determined to be negative.

1.1 Data Preparation

1.1.1 sentiment lexicon and the corresponding scores

Dictionary from BosonNLP data download of sentiment lexicon , from social media text, so the dictionary is suitable for handling social media sentiment analysis.

Dictionary of common words are all marked with a unique score has many shortcomings.

  • One , stop words with no emotional color will affect the text sentiment scoring.
  • Of the two , because the Chinese profound, changing part of speech has become an important factor affecting the model accuracy.
    A situation is the same word in different contexts may represent the exact opposite of emotional significance , with a maximum deviation of sentences I model predicts, for example (from my circle of friends text): Inside the contrary intention is expressed in the words, even whole sentences together represent to the contrary, I have not been able act recklessly depth study of how to solve this problem by the method of the dictionary, but maybe you can use machine learning neural network learning can make an initial solution to this problem. In addition, the same word can be used for a variety of parts of speech, the emotion scores should not be the same , for example: it is clear in the first sentence of the strong performance derogatory, and in the second sentence represents a neutral, single score for this type of problem Category inevitably biased.
    有车一族都用了这个宝贝,后果很严重哦[偷笑][偷笑][偷笑]1,交警工资估计会打5折,没有超速罚款了[呲牙][呲牙][呲牙]2,移动联通公司大幅度裁员,电话费少了[呲牙][呲牙][呲牙]3,中石化中石油裁员2成,路痴不再迷路,省油[悠闲][悠闲][悠闲]5,保险公司裁员2成,保费折上折2成,全国通用[憨笑][憨笑][憨笑]买不买你自己看着办吧[调皮][调皮][调皮]
    严重

    这部电影真垃圾
    垃圾
    分类
    垃圾

1.1.2 negative word dictionary

Negative words will appear directly sentence sentiment turn in the opposite direction, and the utility is usually superimposed. Common negative words: 不、没、无、非、莫、弗、勿、毋、未、否、别、無、休、难道and so on.

1.1.3 adverbs of degree dictionary

Analyzing both text by way of scoring negative emotion, then the absolute value of the fraction is usually expressed emotional intensity. Both with regard to the strength of the extent of the problem, then the introduction of the degree adverb is imperative. Dictionary from the analysis "HowNet" feelings in words set (beta version) download. The dictionary data formats refer to the following format, i.e., a total of two, as a first degree adverbs, the second column is the degree value > 1represents strengthening emotion, < 1showing emotion weakening.

Adverbs of degree dictionary

1.1.4 stop word dictionary

EPRI calculated the Chinese natural language processing open platform has released a 1208 stop words Chinese stop list , there are other ways to download does not require integration .

1.2 Data Preprocessing

1.2.1 Segmentation

About to sentence split a collection of words , results were as follows:
EG such / a / hotel / feature / so / / price / pretty / good

Python tool commonly used word:

  • Stuttered word Jieba
  • Pymmseg-cpp
  • Loso
  • smallseg
 
from collections import defaultdict

import os

import re

import jieba

import codecs



"""

1. 文本切割

"""



def sent2word(sentence):

"""

Segment a sentence to words

Delete stopwords

"""

    segList = jieba.cut(sentence)

    segResult = []

    for w in segList:

        segResult.append(w)



    stopwords = readLines('stop_words.txt')

    newSent = []

        for word in segResult:

            if word in stopwords:

                # print "stopword: %s" % word

                continue

           else:

                newSent.append(word)



          return newSent

In this we use Jieba segment words.

1.2.2 remove stop words

Through all the words in the corpus of all, delete stop words
eg such / a / hotel / feature / so / / price / pretty / good
-> Hotel / feature / price / pretty / good

1.3 build models

1.3.1 The words classify and record its position

The sentence kinds of words are stored and marked location.

 
"""

2. 情感定位

"""

def classifyWords(wordDict):

    # (1) 情感词

    senList = readLines('BosonNLP_sentiment_score.txt')

    senDict = defaultdict()

    for s in senList:

        senDict[s.split(' ')[0]] = s.split(' ')[1]

    # (2) 否定词

    notList = readLines('notDict.txt')

    # (3) 程度副词

    degreeList = readLines('degreeDict.txt')

    degreeDict = defaultdict()

    for d in degreeList:

        degreeDict[d.split(',')[0]] = d.split(',')[1]



    senWord = defaultdict()

    notWord = defaultdict()

    degreeWord = defaultdict()



    for word in wordDict.keys():

        if word in senDict.keys() and word not in notList and word not in degreeDict.keys():

            senWord[wordDict[word]] = senDict[word]

        elif word in notList and word not in degreeDict.keys():

            notWord[wordDict[word]] = -1

        elif word in degreeDict.keys():

            degreeWord[wordDict[word]] = degreeDict[word]

        return senWord, notWord, degreeWord

There are confused I do not know how to learn a school friend Editor's Choice Python in learning learning qun 315 -346- 913 can learn and progress together learn together! Share free videos
 

1.3.2 sentence score calculation

In this simplified computation logic emotion scores: score of all the words in the set and emotional

A definition of emotional word group : all negative words and emotions degree adverbs between two words and the word two emotional-affective constitutes an emotional word phrase, i.e. notWords + degreeWords + sentiWords, for example 不是很交好, where 不是a negative word is an adverb of degree 交好of emotional words , then the score for this group was emotional words:
finalSentiScore = (-1) ^ 1 * 1.25 * 0.747127733968
which 1refers to a negative word, 1.25is the value of the degree adverb, 0.747127733968for the 交好emotional score. Pseudo-code as follows:

finalSentiScore = (-1) ^ (num of notWords) * degreeNum * sentiScore
finalScore = sum(finalSentiScore)

 
"""

3. 情感聚合

"""

def scoreSent(senWord, notWord, degreeWord, segResult):

    W = 1

    score = 0

# 存所有情感词的位置的列表

    senLoc = senWord.keys()

    notLoc = notWord.keys()

    degreeLoc = degreeWord.keys()

    senloc = -1

    # notloc = -1

    # degreeloc = -1



    # 遍历句中所有单词segResult,i为单词绝对位置

    for i in range(0, len(segResult)):

    # 如果该词为情感词

        if i in senLoc:

    # loc为情感词位置列表的序号

            senloc += 1

    # 直接添加该情感词分数

            score += W * float(senWord[i])

            # print "score = %f" % score

            if senloc < len(senLoc) - 1:

            # 判断该情感词与下一情感词之间是否有否定词或程度副词

            # j为绝对位置

                for j in range(senLoc[senloc], senLoc[senloc + 1]):

                # 如果有否定词

                    if j in notLoc:

                        W *= -1

                    # 如果有程度副词

                    elif j in degreeLoc:

                        W *= float(degreeWord[j])

                    # i定位至下一个情感词


                    i = senLoc[senloc + 1]

                    return score

1.4 Model Evaluation

The score ordering more than 600 text circle of friends to make a scatter plot after:

Score Distribution

其中大多数文本被判为正向文本符合实际情况,且绝大多数文本的情感得分的绝对值在10以内,这是因为笔者在计算一个文本的情感得分时,以句号作为一句话结束的标志,在一句话内,情感词语组的分数累加,如若一个文本中含有多句话时,则取其所有句子情感得分的平均值

 

然而,这个模型的缺点与局限性也非常明显:

  • 首先,段落的得分是其所有句子得分的平均值,这一方法并不符合实际情况。正如文章中先后段落有重要性大小之分,一个段落中前后句子也同样有重要性的差异。
  • 其次,有一类文本使用贬义词来表示正向意义,这类情况常出现与宣传文本中,还是那个例子:
    有车一族都用了这个宝贝,后果很严重哦[偷笑][偷笑][偷笑]1,交警工资估计会打5折,没有超速罚款了[呲牙][呲牙][呲牙]2,移动联通公司大幅度裁员,电话费少了[呲牙][呲牙][呲牙]3,中石化中石油裁员2成,路痴不再迷路,省油[悠闲][悠闲][悠闲]5,保险公司裁员2成,保费折上折2成,全国通用[憨笑][憨笑][憨笑]买不买你自己看着办吧[调皮][调皮][调皮]2980元轩辕魔镜带回家,推广还有返利[得意]
    Score Distribution中得分小于-10的几个文本都是与这类情况相似,这也许需要深度学习的方法才能有效解决这类问题,普通机器学习方法也是很难的。
  • 对于正负向文本的判断,该算法忽略了很多其他的否定词、程度副词和情感词搭配的情况;用于判断情感强弱也过于简单。

总之,这一模型只能用做BENCHMARK...

2. 基于机器学习的文本情感极性分析

2.1 还是数据准备

2.1.1 停用词

(同1.1.4)

2.1.2 正负向语料库

来源于有关中文情感挖掘的酒店评论语料,其中正向7000条,负向3000条(笔者是不是可以认为这个世界还是充满着满满的善意呢…),当然也可以参考情感分析资源(转)使用其他语料作为训练集。

2.1.3 验证集

Amazon上对iPhone 6s的评论,来源已不可考……

2.2 数据预处理

2.2.1 还是要分词

(同1.2.1)

 
"""

3. 情感聚合

"""

def scoreSent(senWord, notWord, degreeWord, segResult):

    W = 1

    score = 0

# 存所有情感词的位置的列表

    senLoc = senWord.keys()

    notLoc = notWord.keys()

    degreeLoc = degreeWord.keys()

    senloc = -1

    # notloc = -1

    # degreeloc = -1



# 遍历句中所有单词segResult,i为单词绝对位置

    for i in range(0, len(segResult)):

# 如果该词为情感词

        if i in senLoc:

# loc为情感词位置列表的序号

            senloc += 1

# 直接添加该情感词分数

            score += W * float(senWord[i])

            # print "score = %f" % score

        if senloc < len(senLoc) - 1:

# 判断该情感词与下一情感词之间是否有否定词或程度副词

# j为绝对位置

            for j in range(senLoc[senloc], senLoc[senloc + 1]):

# 如果有否定词

                if j in notLoc:

                    W *= -1

# 如果有程度副词

                elif j in degreeLoc:

                    W *= float(degreeWord[j])

# i定位至下一个情感词


                    i = senLoc[senloc + 1]

                return score

2.2.2 也要去除停用词

(同1.2.2)

2.2.3 训练词向量

(重点来了!)模型的输入需是数据元组,那么就需要将每条数据的词语组合转化为一个数值向量

常见的转化算法有但不仅限于如下几种:

  • Bag of Words
  • TF-IDF
  • Word2Vec

在此笔者选用Word2Vec将语料转化成向量

 
def getWordVecs(wordList):

    vecs = []

    for word in wordList:

        word = word.replace('\n', '')

        try:

            vecs.append(model[word])

        except KeyError:

            continue

            # vecs = np.concatenate(vecs)

        return np.array(vecs, dtype = 'float')





def buildVecs(filename):

    posInput = []

    with open(filename, "rb") as txtfile:

        # print txtfile

        for lines in txtfile:

            lines = lines.split('\n ')

            for line in lines:

                line = jieba.cut(line)
    
                resultList = getWordVecs(line)

            # for each sentence, the mean vector of all its vectors is used to represent this sentence

                if len(resultList) != 0:

                    resultArray = sum(np.array(resultList))/len(resultList)

                    posInput.append(resultArray)



                    return posInput



                # load word2vec model

                    model = word2vec.Word2Vec.load_word2vec_format("corpus.model.bin", binary = True)

# txtfile = [u'标准间太差房间还不如3星的而且设施非常陈旧.建议酒店把老的标准间从新改善.', u'在这个西部小城市能住上这样的酒店让我很欣喜,提供的免费接机服务方便了我的出行,地处市中心,购物很方便。早餐比较丰富,服务人员很热情。推荐大家也来试试,我想下次来这里我仍然会住这里']

                    posInput = buildVecs('pos.txt')

                    negInput = buildVecs('pos.txt')



# use 1 for positive sentiment, 0 for negative

                    y = np.concatenate((np.ones(len(posInput)), np.zeros(len(negInput))))



                    X = posInput[:]

                    for neg in negInput:

                        X.append(neg)

                        X = np.array(X)

2.2.4 标准化

虽然笔者觉得在这一问题中,标准化对模型的准确率影响不大,当然也可以尝试其他的标准化的方法。

# standardization
X = scale(X)

2.2.5 降维

根据PCA结果,发现前100维能够cover 95%以上的variance。

 
# PCA

# Plot the PCA spectrum

pca.fit(X)

plt.figure(1, figsize=(4, 3))

plt.clf()

plt.axes([.2, .2, .7, .7])

plt.plot(pca.explained_variance_, linewidth=2)

plt.axis('tight')

plt.xlabel('n_components')

plt.ylabel('explained_variance_')



X_reduced = PCA(n_components = 100).fit_transform(X)

2.3 构建模型

2.3.1 SVM (RBF) + PCA

SVM (RBF)分类表现更为宽松,且使用PCA降维后的模型表现有明显提升,misclassified多为负向文本被分类为正向文本,其中AUC = 0.92KSValue = 0.7
 

"""

2.1 SVM (RBF)

using training data with 100 dimensions

"""



clf = SVC(C = 2, probability = True)

clf.fit(X_reduced_train, y_reduced_train)



print 'Test Accuracy: %.2f'% clf.score(X_reduced_test, y_reduced_test)



pred_probas = clf.predict_proba(X_reduced_test)[:,1]

print "KS value: %f" % KSmetric(y_reduced_test, pred_probas)[0]



# plot ROC curve

# AUC = 0.92

# KS = 0.7

fpr,tpr,_ = roc_curve(y_reduced_test, pred_probas)

roc_auc = auc(fpr,tpr)

plt.plot(fpr, tpr, label = 'area = %.2f' % roc_auc)

plt.plot([0, 1], [0, 1], 'k--')

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.legend(loc = 'lower right')

plt.show()



joblib.dump(clf, "SVC.pkl")

 

2.3.2 MLP

MLP相比于SVM (RBF),分类更为严格,PCA降维后对模型准确率影响不大,misclassified多为正向文本被分类为负向,其实是更容易overfitting,原因是语料过少,其实用神经网络未免有些小题大做,AUC = 0.91

 
"""

2.2 MLP

using original training data with 400 dimensions

"""

model = Sequential()

model.add(Dense(512, input_dim = 400, init = 'uniform', activation = 'tanh'))

model.add(Dropout(0.5))

model.add(Dense(256, activation = 'relu'))

model.add(Dropout(0.5))

model.add(Dense(128, activation = 'relu'))

model.add(Dropout(0.5))

model.add(Dense(64, activation = 'relu'))

model.add(Dropout(0.5))

model.add(Dense(32, activation = 'relu'))

model.add(Dropout(0.5))

model.add(Dense(1, activation = 'sigmoid'))



model.compile(loss = 'binary_crossentropy',

optimizer = 'adam',

metrics = ['accuracy'])



model.fit(X_train, y_train, nb_epoch = 20, batch_size = 16)

score = model.evaluate(X_test, y_test, batch_size = 16)

print ('Test accuracy: ', score[1])



pred_probas = model.predict(X_test)

# print "KS value: %f" % KSmetric(y_reduced_test, pred_probas)[0]



# plot ROC curve

# AUC = 0.91

fpr,tpr,_ = roc_curve(y_test, pred_probas)

roc_auc = auc(fpr,tpr)

plt.plot(fpr, tpr, label = 'area = %.2f' % roc_auc)

plt.plot([0, 1], [0, 1], 'k--')

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.legend(loc = 'lower right')

plt.show()

2.4 模型评价

  • 实际上,第一种方法中的第二点缺点依然存在,但相比于基于词典的情感分析方法,基于机器学习的方法更为客观
  • 另外由于训练集和测试集分别来自不同领域,所以有理由认为训练集不够充分,未来可以考虑扩充训练集以提升准确率。

Guess you like

Origin blog.csdn.net/weixin_44995023/article/details/91546092