基于Doc2Vec情感分析

词向量在NLP被广泛应用，通过引入Doc2Vec，不仅可以对单个词进行词向量表示，而且可以对整个句子或文章进行表示。想象一下，能够使用固定长度的向量表示整个句子，然后我们可以使用标准的分类算法去分类。是一件很神奇的事情。

本章以Word2Vec 为基础，使用Doc2Vec 来做些情感分析的任务问题。

IMDB预料进行分类（积极or消极）：http://www.cs.cornell.edu/people/pabo/movie-review-data/

Doc2Vec 论文：《Distributed Representations of Sentences and Documents.pdf》

也可以通过数据集下载：https://download.csdn.net/download/shenfuli/11435197

数据预处理

文件有5个文档

负样本集合
test-neg.txt: 12500 条；
train-neg.txt: 12500条
正样本集合
test-pos.txt: 12500条；
train-pos.txt: 12500条

我们处理思路：考虑到这些数据都有标签，我们合并所有数据，然后切分数据，进行训练和测试验证

'''
    原始数据转为label features 格式内容

    举例说明：
    1. 转换前
    1.1 负样本：0
    bromwell high is
    homelessness or
    brilliant over
    1.2 正样本：1
    bromwell high
    homelessness
    brilliant over actin

    2. 转换后
    0 bromwell high is
    0 homelessness or
    0 brilliant over
    1 bromwell high
    1 homelessness
    1 brilliant over actin
'''
import random

'''
    训练数据
'''
train_sentences = []
train_sources = {'train-pos.txt':1,'train-neg.txt':0}
for k, v in train_sources.items():
    with open(k,'r') as f: # 待处理文件
        documents = f.readlines()
        for document in documents:
            data = "{} {}".format(v, document.strip())
            train_sentences.append(data)
random.shuffle(train_sentences)
with open('IMDB-train.txt','w') as fw: # 写入文件
    for text in train_sentences:
        fw.write("{}\n".format(text.strip()))
'''
    测试数据
'''
test_sentences = []
test_sources = {'test-pos.txt':1,'test-neg.txt':0}
for k, v in test_sources.items():
    with open(k,'r') as f: # 待处理文件
        documents = f.readlines()
        for document in documents:
            data = "{} {}".format(v, document.strip())
            test_sentences.append(data)
random.shuffle(test_sentences)
with open('IMDB-test.txt','w') as fw: # 写入文件
    for text in test_sentences:
        fw.write("{}\n".format(text.strip()))

训练doc2vec 模型

# -*- coding: utf-8 -*-
"""
@brief : 将原始数据数字化为doc2vec特征，并将结果保存至本地
@author: shenfuli
"""
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import pickle


"""=====================================================================================================================
1 读取原始数据，并进行简单处理

1.0 数据数据格式
label word1 word2 word3

1.1 sentences 格式
[['bromwell', 'high', 'is'],['bromwell-1', 'high-1', 'is-1']]

1.2 labels 格式
[1, 0]

"""
sentences = []
labels = []
total_examples = 0
doc2vec_file = ['IMDB-train.txt','IMDB-test.txt']

for file_name in doc2vec_file:
    with open(file_name,'r') as f:
        lines = f.readlines()
        for line in lines:
            words_list = line.split()[1:]
            sentences.append(words_list)
            labels.append(int(line.split()[0]))
            total_examples = total_examples + 1
print('total_examples = ',total_examples)
"""=====================================================================================================================
2 doc2vec
"""
import time
t_start = time.time()
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(sentences)]
model = Doc2Vec(documents,min_count=1, window=5, size=200, sample=1e-4, negative=5, workers=4,iter=10)
t_end = time.time()
print("train doc2vec：{}min".format((t_end-t_start)/60))

total_examples =  50000
train doc2vec：2.4591749986012776min

# 获取相近的词语
model.most_similar('good')

[('great', 0.7389441728591919),
 ('decent', 0.7134653329849243),
 ('bad', 0.67735356092453),
 ('nice', 0.6550177931785583),
 ('alright', 0.6080873012542725),
 ('solid', 0.593616247177124),
 ('okay', 0.5911738872528076),
 ('fine', 0.5862981081008911),
 ('excellent', 0.5857179760932922),
 ('ok', 0.5735211372375488)]

# 保存模型
model.save('./imdb.d2v')

# 加载模型
model = Doc2Vec.load('./imdb.d2v')

# 获取模型最终的向量矩阵
docvecs = model.docvecs
print('docvecs size ',len(docvecs))

docvecs size  50000

doc2vec模型推测文本特征

对于一篇新的文本，使用训练好的doc2vec 进行抽取文本的特征方法

vector = model.infer_vector(["system", "response"])
vector

array([ 0.00306992,  0.09880129, -0.06528975, -0.00518309,  0.00975273,
        0.01179409,  0.06942781, -0.04205238, -0.08501416, -0.05424332,
       -0.06491904, -0.03669789,  0.08301632,  0.00646909, -0.04971279,
        0.0516566 ,  0.07641225, -0.00031443, -0.03912795,  0.02805902,
       -0.08065759,  0.01329106,  0.01258634,  0.01845446, -0.02991258,
       -0.01598231,  0.0235482 ,  0.0395251 ,  0.02470151, -0.02085371,
       -0.07026218, -0.03540489,  0.07731623,  0.03987392, -0.03982773,
       -0.05196444, -0.02742134, -0.0139687 ,  0.04230239,  0.03385718,
       -0.03240554,  0.05211783, -0.05808661,  0.03530499, -0.0125432 ,
        0.02180364,  0.01356464,  0.00464311, -0.001227  ,  0.07899355,
        0.02907492,  0.03771575, -0.04223124,  0.06633858, -0.12505487,
       -0.02383417, -0.03901726, -0.05491796,  0.0612916 , -0.0878048 ,
        0.10616346,  0.07216624, -0.00333762,  0.08654431,  0.05445853,
        0.02146732, -0.04320022,  0.05699158,  0.07443393,  0.01716116,
       -0.1271982 ,  0.01133513,  0.04388591, -0.01050479, -0.07112896,
        0.0483281 ,  0.05188625,  0.04284834, -0.07416691,  0.05976778,
        0.03085455, -0.11998461, -0.04980956,  0.07434805,  0.05816012,
       -0.06460897,  0.0726747 , -0.09000178, -0.00420167, -0.06699272,
       -0.01415103,  0.05362554, -0.05598104,  0.06874936,  0.07766063,
       -0.02767419, -0.05826729,  0.0177945 , -0.00466617, -0.00092517,
        0.04546773, -0.01783472, -0.05895144,  0.0861815 ,  0.06412905,
       -0.06007013, -0.00795694,  0.01949856,  0.05685847,  0.01587577,
        0.04039665,  0.0010751 ,  0.0091335 , -0.05000968, -0.03424346,
        0.03919376,  0.01162066,  0.03560687, -0.08205847, -0.05824827,
        0.07500459,  0.11294474,  0.07223377, -0.0019428 ,  0.04507867,
       -0.05088584, -0.10753464,  0.00572739, -0.03234218,  0.01748291,
        0.01449764, -0.0109925 , -0.02165018,  0.05683647, -0.00354433,
        0.11134962,  0.02013531,  0.0195157 , -0.00552291,  0.04200043,
       -0.05738139,  0.06727964, -0.06666224, -0.00524103, -0.03265556,
       -0.08127008,  0.01383475,  0.08094133,  0.03243712,  0.02253861,
       -0.03518593,  0.01116148, -0.01571034,  0.02591805, -0.08882834,
       -0.13974166,  0.10083718,  0.01317242,  0.03611537,  0.13182731,
        0.01886472,  0.00244197,  0.06576921, -0.00229551, -0.08704369,
       -0.02204782,  0.00840817, -0.07599292,  0.03328447,  0.00425508,
       -0.06202874, -0.01922429, -0.04022821,  0.00466727,  0.02703563,
       -0.09933387,  0.0278736 ,  0.02954238, -0.03181218, -0.02765816,
       -0.01375741,  0.04625296,  0.0182731 , -0.09098109, -0.00719439,
       -0.03677104,  0.04303071, -0.10935018, -0.04317935,  0.05853789,
        0.04001805,  0.03835495,  0.02120748, -0.07935818, -0.05919037,
        0.01373552,  0.05611926,  0.00383895, -0.02817624, -0.01121639],
      dtype=float32)

len(vector)

对输入的数据进行 label + doc2vec 格式数据转化，该过程稍微慢些，耐心等待

X = [] # doc2vec 特征
y = [] # 文本行的label
total_examples = 0

doc2vec_file = ['IMDB-train.txt','IMDB-test.txt']
for file_name in doc2vec_file:
    with open(file_name,'r') as f:
        lines = f.readlines()
        for line in lines:
            # x_train
            words_list = line.split()[1:] # 格式： ["system", "response"]
            words_list_doc2vec = model.infer_vector(words_list)# 输入数据进行提取doc2vec 特征
            X.append(words_list_doc2vec)
            # y_train
            y.append(int(line.split()[0]))
            total_examples = total_examples + 1

doc2vec 特征保存

'''
输出结果格式：
label feats
例如：
0 f1 f2 .... f200

其中： f1 。。 是doc2vec 数值类的数据
'''

with open('feat_doc2vec.txt','w') as f:
    for i in range(total_examples):
        x_str = [str(x) for x in list(X[i])]
        results = "{} {}".format(y[i]," ".join(x_str))
        f.write('{}\n'.format(results))

切分训练集合测试集

参考文档：https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

X = []
y = []
with open('feat_doc2vec.txt','r') as fr:
    lines = fr.readlines()
    for line in lines:
        line = line.strip()
        feat = [float(f) for f in line.split()[1:]] #[0.2501368, 0.29046363, -0.1135941]
        X.append(feat) # 特征列
        y.append(int(line.split()[0]))#Label 列

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print('X_train total_examples= ',len(X_train))
print('X_test  total_exampels= ',len(X_test))

X_train total_examples=  35000
X_test  total_exampels=  15000

分类

使用训练数据，训练一个LR 的模型

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

使用测试数据验证分类器效果

classifier.score(X_test,y_test)

0.8206

针对情感分析任务，我们取到82%(实验数据：doc2vec 10次迭代效果82%，2词迭代效果79%) 准确率。。

我们可以尝试doc2vec 模型增加迭代次数，然后在看看结果如何？这里不在演示

对文本行进行在线预测

这里模拟线上在线预测功能，实际工作中，我们需要发布一个WEB服务，然后在线预测

我们对正样本进行预测（正样本-1）

text = 'my mother took me to see this film as a child and i long to see it every year as i do all of my other christmas favorites what i remember most was the silly\
 devil and santa looking through his telescope i waited and looked through the t v guide each year after that to see when it would be shown i would usually \
 find it playing on a saturday afternoon i only found the movie in english which took something special away from the film and have longed to find a copy of\
 it in spanish i hold this film dear to my heart and have never suffered from nightmares as others might suggest yes it is a different film about santa claus\
 and that is what makes it special and unique i can t wait to get a copy of this film and watch it with my children as i explain to them my favorite parts and memories '

words_list = text.split()# 分词列表，中文可以使用jieba分词，返回list 
feat = model.infer_vector(words_list)# 输入数据进行提取doc2vec 特征
print(classifier.predict([feat])[0])

我们对负样本进行预测（0-负样本）

text2 = 'once again mr costner has dragged out a movie for far longer than necessary aside from the terrific sea rescue sequences of which there are very few i just did not care about any of the characters most of us have ghosts in the closet and costner s character are realized early on and then forgotten until much later by which time i did not care the character we should really care about is a very cocky overconfident ashton kutcher the problem is he comes off as kid who thinks he s better than anyone else around him and shows no signs of a cluttered closet his only obstacle appears to be winning over costner finally when we are well past the half way point of this stinker costner tells us all about kutcher s ghosts we are told why kutcher is driven to be the best with no prior inkling or foreshadowing no magic here it was all i could do to keep from turning it off an hour in '
feat2 = model.infer_vector(text2.split())# 输入数据进行提取doc2vec 特征
print(classifier.predict([feat2])[0])