自然语言处理(NLP):05 基于 doc2vec 特征抽取+电影情感文本分类

本章节研究内容: doc2vec 提取句子特征+基于doc2vec 特征工程 文本分类问题

by@ 沈福利

基于Doc2Vec情感分析

词向量在NLP被广泛应用,通过引入Doc2Vec,不仅可以对单个词进行词向量表示,而且可以对整个句子或文章进行表示。 想象一下,能够使用固定长度的向量表示整个句子,然后我们可以使用标准的分类算法去分类。是一件很神奇的事情。

本章以Word2Vec 为基础,使用Doc2Vec 来做些情感分析的任务问题。

IMDB预料 进行分类(积极or消极):http://www.cs.cornell.edu/people/pabo/movie-review-data/

Doc2Vec 论文:《Distributed Representations of Sentences and Documents.pdf》

也可以通过数据集下载:https://download.csdn.net/download/shenfuli/11435197

数据预处理

文件有5个文档

  1. 负样本集合
    test-neg.txt: 12500 条 ;
    train-neg.txt: 12500条

  2. 正样本集合
    test-pos.txt: 12500条 ;
    train-pos.txt: 12500条

我们处理思路: 考虑到这些数据都有标签,我们合并所有数据,然后切分数据,进行训练和测试验证

'''
    原始数据转为label features 格式内容

    举例说明:
    1. 转换前
    1.1 负样本:0
    bromwell high is
    homelessness or
    brilliant over
    1.2 正样本:1
    bromwell high
    homelessness
    brilliant over actin

    2. 转换后
    0 bromwell high is
    0 homelessness or
    0 brilliant over
    1 bromwell high
    1 homelessness
    1 brilliant over actin
'''
import random

'''
    训练数据
'''
train_sentences = []
train_sources = {'train-pos.txt':1,'train-neg.txt':0}
for k, v in train_sources.items():
    with open(k,'r') as f: # 待处理文件
        documents = f.readlines()
        for document in documents:
            data = "{} {}".format(v, document.strip())
            train_sentences.append(data)
random.shuffle(train_sentences)
with open('IMDB-train.txt','w') as fw: # 写入文件
    for text in train_sentences:
        fw.write("{}\n".format(text.strip()))
'''
    测试数据
'''
test_sentences = []
test_sources = {'test-pos.txt':1,'test-neg.txt':0}
for k, v in test_sources.items():
    with open(k,'r') as f: # 待处理文件
        documents = f.readlines()
        for document in documents:
            data = "{} {}".format(v, document.strip())
            test_sentences.append(data)
random.shuffle(test_sentences)
with open('IMDB-test.txt','w') as fw: # 写入文件
    for text in test_sentences:
        fw.write("{}\n".format(text.strip()))

训练doc2vec 模型

# -*- coding: utf-8 -*-
"""
@brief : 将原始数据数字化为doc2vec特征,并将结果保存至本地
@author: shenfuli
"""
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import pickle

"""=====================================================================================================================
1 读取原始数据,并进行简单处理

1.0 数据数据格式
label word1 word2 word3

1.1 sentences 格式
[['bromwell', 'high', 'is'],['bromwell-1', 'high-1', 'is-1']]

1.2 labels 格式
[1, 0]

"""
sentences = []
labels = []
total_examples = 0
doc2vec_file = ['IMDB-train.txt','IMDB-test.txt']

for file_name in doc2vec_file:
    with open(file_name,'r') as f:
        lines = f.readlines()
        for line in lines:
            words_list = line.split()[1:]
            sentences.append(words_list)
            labels.append(int(line.split()[0]))
            total_examples = total_examples + 1
print('total_examples = ',total_examples)
"""=====================================================================================================================
2 doc2vec
"""
import time
t_start = time.time()
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(sentences)]
model = Doc2Vec(documents,min_count=1, window=5, size=200, sample=1e-4, negative=5, workers=4,iter=10)
t_end = time.time()
print("train doc2vec:{}min".format((t_end-t_start)/60))
total_examples =  50000
train doc2vec:2.4591749986012776min
# 获取相近的词语
model.most_similar('good')
[('great', 0.7389441728591919),
 ('decent', 0.7134653329849243),
 ('bad', 0.67735356092453),
 ('nice', 0.6550177931785583),
 ('alright', 0.6080873012542725),
 ('solid', 0.593616247177124),
 ('okay', 0.5911738872528076),
 ('fine', 0.5862981081008911),
 ('excellent', 0.5857179760932922),
 ('ok', 0.5735211372375488)]
# 保存模型
model.save('./imdb.d2v')
# 加载模型
model = Doc2Vec.load('./imdb.d2v')
# 获取模型最终的向量矩阵
docvecs = model.docvecs
print('docvecs size ',len(docvecs))
docvecs size  50000

doc2vec模型推测文本 特征

对于一篇新的文本,使用训练好的doc2vec 进行抽取文本的特征方法

vector = model.infer_vector(["system", "response"])
vector
array([ 0.00306992,  0.09880129, -0.06528975, -0.00518309,  0.00975273,
        0.01179409,  0.06942781, -0.04205238, -0.08501416, -0.05424332,
       -0.06491904, -0.03669789,  0.08301632,  0.00646909, -0.04971279,
        0.0516566 ,  0.07641225, -0.00031443, -0.03912795,  0.02805902,
       -0.08065759,  0.01329106,  0.01258634,  0.01845446, -0.02991258,
       -0.01598231,  0.0235482 ,  0.0395251 ,  0.02470151, -0.02085371,
       -0.07026218, -0.03540489,  0.07731623,  0.03987392, -0.03982773,
       -0.05196444, -0.02742134, -0.0139687 ,  0.04230239,  0.03385718,
       -0.03240554,  0.05211783, -0.05808661,  0.03530499, -0.0125432 ,
        0.02180364,  0.01356464,  0.00464311, -0.001227  ,  0.07899355,
        0.02907492,  0.03771575, -0.04223124,  0.06633858, -0.12505487,
       -0.02383417, -0.03901726, -0.05491796,  0.0612916 , -0.0878048 ,
        0.10616346,  0.07216624, -0.00333762,  0.08654431,  0.05445853,
        0.02146732, -0.04320022,  0.05699158,  0.07443393,  0.01716116,
       -0.1271982 ,  0.01133513,  0.04388591, -0.01050479, -0.07112896,
        0.0483281 ,  0.05188625,  0.04284834, -0.07416691,  0.05976778,
        0.03085455, -0.11998461, -0.04980956,  0.07434805,  0.05816012,
       -0.06460897,  0.0726747 , -0.09000178, -0.00420167, -0.06699272,
       -0.01415103,  0.05362554, -0.05598104,  0.06874936,  0.07766063,
       -0.02767419, -0.05826729,  0.0177945 , -0.00466617, -0.00092517,
        0.04546773, -0.01783472, -0.05895144,  0.0861815 ,  0.06412905,
       -0.06007013, -0.00795694,  0.01949856,  0.05685847,  0.01587577,
        0.04039665,  0.0010751 ,  0.0091335 , -0.05000968, -0.03424346,
        0.03919376,  0.01162066,  0.03560687, -0.08205847, -0.05824827,
        0.07500459,  0.11294474,  0.07223377, -0.0019428 ,  0.04507867,
       -0.05088584, -0.10753464,  0.00572739, -0.03234218,  0.01748291,
        0.01449764, -0.0109925 , -0.02165018,  0.05683647, -0.00354433,
        0.11134962,  0.02013531,  0.0195157 , -0.00552291,  0.04200043,
       -0.05738139,  0.06727964, -0.06666224, -0.00524103, -0.03265556,
       -0.08127008,  0.01383475,  0.08094133,  0.03243712,  0.02253861,
       -0.03518593,  0.01116148, -0.01571034,  0.02591805, -0.08882834,
       -0.13974166,  0.10083718,  0.01317242,  0.03611537,  0.13182731,
        0.01886472,  0.00244197,  0.06576921, -0.00229551, -0.08704369,
       -0.02204782,  0.00840817, -0.07599292,  0.03328447,  0.00425508,
       -0.06202874, -0.01922429, -0.04022821,  0.00466727,  0.02703563,
       -0.09933387,  0.0278736 ,  0.02954238, -0.03181218, -0.02765816,
       -0.01375741,  0.04625296,  0.0182731 , -0.09098109, -0.00719439,
       -0.03677104,  0.04303071, -0.10935018, -0.04317935,  0.05853789,
        0.04001805,  0.03835495,  0.02120748, -0.07935818, -0.05919037,
        0.01373552,  0.05611926,  0.00383895, -0.02817624, -0.01121639],
      dtype=float32)
len(vector)
200

对输入的数据进行 label + doc2vec 格式数据转化,该过程稍微慢些,耐心等待

X = [] # doc2vec 特征
y = [] # 文本行的label
total_examples = 0

doc2vec_file = ['IMDB-train.txt','IMDB-test.txt']
for file_name in doc2vec_file:
    with open(file_name,'r') as f:
        lines = f.readlines()
        for line in lines:
            # x_train
            words_list = line.split()[1:] # 格式: ["system", "response"]
            words_list_doc2vec = model.infer_vector(words_list)# 输入数据进行提取doc2vec 特征
            X.append(words_list_doc2vec)
            # y_train
            y.append(int(line.split()[0]))
            total_examples = total_examples + 1     

doc2vec 特征保存

'''
输出结果格式:
label feats
例如:
0 f1 f2 .... f200

其中: f1 。。 是doc2vec 数值类的数据
'''

with open('feat_doc2vec.txt','w') as f:
    for i in range(total_examples):
        x_str = [str(x) for x in list(X[i])]
        results = "{} {}".format(y[i]," ".join(x_str))
        f.write('{}\n'.format(results))

切分训练集合测试集

参考文档:https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

X = []
y = []
with open('feat_doc2vec.txt','r') as fr:
    lines = fr.readlines()
    for line in lines:
        line = line.strip()
        feat = [float(f) for f in line.split()[1:]] #[0.2501368, 0.29046363, -0.1135941]
        X.append(feat) # 特征列
        y.append(int(line.split()[0]))#Label 列
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print('X_train total_examples= ',len(X_train))
print('X_test  total_exampels= ',len(X_test))
X_train total_examples=  35000
X_test  total_exampels=  15000

分类

使用训练数据,训练一个LR 的模型

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

使用测试数据验证分类器效果

classifier.score(X_test,y_test)
0.8206

针对情感分析任务,我们取到82%(实验数据:doc2vec 10次迭代效果82%,2词迭代效果79%) 准确率。。

我们可以尝试doc2vec 模型增加迭代次数,然后在看看结果如何?这里不在演示

对文本行进行在线预测

这里模拟线上在线预测功能,实际工作中,我们需要发布一个WEB服务,然后在线预测

我们对正样本进行预测(正样本-1)

text = 'my mother took me to see this film as a child and i long to see it every year as i do all of my other christmas favorites what i remember most was the silly\
 devil and santa looking through his telescope i waited and looked through the t v guide each year after that to see when it would be shown i would usually \
 find it playing on a saturday afternoon i only found the movie in english which took something special away from the film and have longed to find a copy of\
 it in spanish i hold this film dear to my heart and have never suffered from nightmares as others might suggest yes it is a different film about santa claus\
 and that is what makes it special and unique i can t wait to get a copy of this film and watch it with my children as i explain to them my favorite parts and memories '
words_list = text.split()# 分词列表,中文可以使用jieba分词,返回list 
feat = model.infer_vector(words_list)# 输入数据进行提取doc2vec 特征
print(classifier.predict([feat])[0])
1

我们对负样本进行预测(0-负样本)

text2 = 'once again mr costner has dragged out a movie for far longer than necessary aside from the terrific sea rescue sequences of which there are very few i just did not care about any of the characters most of us have ghosts in the closet and costner s character are realized early on and then forgotten until much later by which time i did not care the character we should really care about is a very cocky overconfident ashton kutcher the problem is he comes off as kid who thinks he s better than anyone else around him and shows no signs of a cluttered closet his only obstacle appears to be winning over costner finally when we are well past the half way point of this stinker costner tells us all about kutcher s ghosts we are told why kutcher is driven to be the best with no prior inkling or foreshadowing no magic here it was all i could do to keep from turning it off an hour in '
feat2 = model.infer_vector(text2.split())# 输入数据进行提取doc2vec 特征
print(classifier.predict([feat2])[0])
0

总结:使用doc2vec 抽取特征,然后进行分类相关功能

  1. 需要通过python预处理成doc2vec 训练数据格式
  2. doc2vec 模型优化的内容,主要包括迭代次数;窗口大小等,都可以进行尝试
  3. doc2vec 模型可以抽取特征,同时也可以获取某个词的相近词
  4. 这里简单的分类,可以使用其他的模型进行训练看看效果
  5. 本文对电影类英文的数据进行情感分析,思路:可以切换为中文的文本分类,唯一需要修改的地方就是分词,例如:可以使用jieba
  6. 运行环境:windows7 + python3.7.x 测试验证上述内容

参考资料

[1]Doc2vec: https://radimrehurek.com/gensim/models/doc2vec.html
[2]Paper that inspired this: http://arxiv.org/abs/1405.4053


Python 实战系列课程

  1. Python数据可视化教程:基于 plotly 动态可视化绘图
    https://edu.51cto.com/sd/4bff8

  2. Python数据可视化教程 Seaborn

https://edu.51cto.com/sd/19627

  1. Python数据可视化教程:基于Plotly的动态可视化绘图
    https://edu.csdn.net/course/detail/24935

  2. Python数据可视化教程 Seaborn

https://edu.csdn.net/course/detail/24790

  1. Python 数据分析实战 视频课程

https://edu.51cto.com/sd/63225

发布了267 篇原创文章 · 获赞 66 · 访问量 43万+

猜你喜欢

转载自blog.csdn.net/shenfuli/article/details/97393886
今日推荐