本章节研究内容: doc2vec 提取句子特征+基于doc2vec 特征工程 文本分类问题
by@ 沈福利
基于Doc2Vec情感分析
词向量在NLP被广泛应用,通过引入Doc2Vec,不仅可以对单个词进行词向量表示,而且可以对整个句子或文章进行表示。 想象一下,能够使用固定长度的向量表示整个句子,然后我们可以使用标准的分类算法去分类。是一件很神奇的事情。
本章以Word2Vec 为基础,使用Doc2Vec 来做些情感分析的任务问题。
IMDB预料 进行分类(积极or消极):http://www.cs.cornell.edu/people/pabo/movie-review-data/
Doc2Vec 论文:《Distributed Representations of Sentences and Documents.pdf》
也可以通过数据集下载:https://download.csdn.net/download/shenfuli/11435197
数据预处理
文件有5个文档
-
负样本集合
test-neg.txt: 12500 条 ;
train-neg.txt: 12500条 -
正样本集合
test-pos.txt: 12500条 ;
train-pos.txt: 12500条
我们处理思路: 考虑到这些数据都有标签,我们合并所有数据,然后切分数据,进行训练和测试验证
'''
原始数据转为label features 格式内容
举例说明:
1. 转换前
1.1 负样本:0
bromwell high is
homelessness or
brilliant over
1.2 正样本:1
bromwell high
homelessness
brilliant over actin
2. 转换后
0 bromwell high is
0 homelessness or
0 brilliant over
1 bromwell high
1 homelessness
1 brilliant over actin
'''
import random
'''
训练数据
'''
train_sentences = []
train_sources = {'train-pos.txt':1,'train-neg.txt':0}
for k, v in train_sources.items():
with open(k,'r') as f: # 待处理文件
documents = f.readlines()
for document in documents:
data = "{} {}".format(v, document.strip())
train_sentences.append(data)
random.shuffle(train_sentences)
with open('IMDB-train.txt','w') as fw: # 写入文件
for text in train_sentences:
fw.write("{}\n".format(text.strip()))
'''
测试数据
'''
test_sentences = []
test_sources = {'test-pos.txt':1,'test-neg.txt':0}
for k, v in test_sources.items():
with open(k,'r') as f: # 待处理文件
documents = f.readlines()
for document in documents:
data = "{} {}".format(v, document.strip())
test_sentences.append(data)
random.shuffle(test_sentences)
with open('IMDB-test.txt','w') as fw: # 写入文件
for text in test_sentences:
fw.write("{}\n".format(text.strip()))
训练doc2vec 模型
# -*- coding: utf-8 -*-
"""
@brief : 将原始数据数字化为doc2vec特征,并将结果保存至本地
@author: shenfuli
"""
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import pickle
"""=====================================================================================================================
1 读取原始数据,并进行简单处理
1.0 数据数据格式
label word1 word2 word3
1.1 sentences 格式
[['bromwell', 'high', 'is'],['bromwell-1', 'high-1', 'is-1']]
1.2 labels 格式
[1, 0]
"""
sentences = []
labels = []
total_examples = 0
doc2vec_file = ['IMDB-train.txt','IMDB-test.txt']
for file_name in doc2vec_file:
with open(file_name,'r') as f:
lines = f.readlines()
for line in lines:
words_list = line.split()[1:]
sentences.append(words_list)
labels.append(int(line.split()[0]))
total_examples = total_examples + 1
print('total_examples = ',total_examples)
"""=====================================================================================================================
2 doc2vec
"""
import time
t_start = time.time()
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(sentences)]
model = Doc2Vec(documents,min_count=1, window=5, size=200, sample=1e-4, negative=5, workers=4,iter=10)
t_end = time.time()
print("train doc2vec:{}min".format((t_end-t_start)/60))
total_examples = 50000
train doc2vec:2.4591749986012776min
# 获取相近的词语
model.most_similar('good')
[('great', 0.7389441728591919),
('decent', 0.7134653329849243),
('bad', 0.67735356092453),
('nice', 0.6550177931785583),
('alright', 0.6080873012542725),
('solid', 0.593616247177124),
('okay', 0.5911738872528076),
('fine', 0.5862981081008911),
('excellent', 0.5857179760932922),
('ok', 0.5735211372375488)]
# 保存模型
model.save('./imdb.d2v')
# 加载模型
model = Doc2Vec.load('./imdb.d2v')
# 获取模型最终的向量矩阵
docvecs = model.docvecs
print('docvecs size ',len(docvecs))
docvecs size 50000
doc2vec模型推测文本 特征
对于一篇新的文本,使用训练好的doc2vec 进行抽取文本的特征方法
vector = model.infer_vector(["system", "response"])
vector
array([ 0.00306992, 0.09880129, -0.06528975, -0.00518309, 0.00975273,
0.01179409, 0.06942781, -0.04205238, -0.08501416, -0.05424332,
-0.06491904, -0.03669789, 0.08301632, 0.00646909, -0.04971279,
0.0516566 , 0.07641225, -0.00031443, -0.03912795, 0.02805902,
-0.08065759, 0.01329106, 0.01258634, 0.01845446, -0.02991258,
-0.01598231, 0.0235482 , 0.0395251 , 0.02470151, -0.02085371,
-0.07026218, -0.03540489, 0.07731623, 0.03987392, -0.03982773,
-0.05196444, -0.02742134, -0.0139687 , 0.04230239, 0.03385718,
-0.03240554, 0.05211783, -0.05808661, 0.03530499, -0.0125432 ,
0.02180364, 0.01356464, 0.00464311, -0.001227 , 0.07899355,
0.02907492, 0.03771575, -0.04223124, 0.06633858, -0.12505487,
-0.02383417, -0.03901726, -0.05491796, 0.0612916 , -0.0878048 ,
0.10616346, 0.07216624, -0.00333762, 0.08654431, 0.05445853,
0.02146732, -0.04320022, 0.05699158, 0.07443393, 0.01716116,
-0.1271982 , 0.01133513, 0.04388591, -0.01050479, -0.07112896,
0.0483281 , 0.05188625, 0.04284834, -0.07416691, 0.05976778,
0.03085455, -0.11998461, -0.04980956, 0.07434805, 0.05816012,
-0.06460897, 0.0726747 , -0.09000178, -0.00420167, -0.06699272,
-0.01415103, 0.05362554, -0.05598104, 0.06874936, 0.07766063,
-0.02767419, -0.05826729, 0.0177945 , -0.00466617, -0.00092517,
0.04546773, -0.01783472, -0.05895144, 0.0861815 , 0.06412905,
-0.06007013, -0.00795694, 0.01949856, 0.05685847, 0.01587577,
0.04039665, 0.0010751 , 0.0091335 , -0.05000968, -0.03424346,
0.03919376, 0.01162066, 0.03560687, -0.08205847, -0.05824827,
0.07500459, 0.11294474, 0.07223377, -0.0019428 , 0.04507867,
-0.05088584, -0.10753464, 0.00572739, -0.03234218, 0.01748291,
0.01449764, -0.0109925 , -0.02165018, 0.05683647, -0.00354433,
0.11134962, 0.02013531, 0.0195157 , -0.00552291, 0.04200043,
-0.05738139, 0.06727964, -0.06666224, -0.00524103, -0.03265556,
-0.08127008, 0.01383475, 0.08094133, 0.03243712, 0.02253861,
-0.03518593, 0.01116148, -0.01571034, 0.02591805, -0.08882834,
-0.13974166, 0.10083718, 0.01317242, 0.03611537, 0.13182731,
0.01886472, 0.00244197, 0.06576921, -0.00229551, -0.08704369,
-0.02204782, 0.00840817, -0.07599292, 0.03328447, 0.00425508,
-0.06202874, -0.01922429, -0.04022821, 0.00466727, 0.02703563,
-0.09933387, 0.0278736 , 0.02954238, -0.03181218, -0.02765816,
-0.01375741, 0.04625296, 0.0182731 , -0.09098109, -0.00719439,
-0.03677104, 0.04303071, -0.10935018, -0.04317935, 0.05853789,
0.04001805, 0.03835495, 0.02120748, -0.07935818, -0.05919037,
0.01373552, 0.05611926, 0.00383895, -0.02817624, -0.01121639],
dtype=float32)
len(vector)
200
对输入的数据进行 label + doc2vec 格式数据转化,该过程稍微慢些,耐心等待
X = [] # doc2vec 特征
y = [] # 文本行的label
total_examples = 0
doc2vec_file = ['IMDB-train.txt','IMDB-test.txt']
for file_name in doc2vec_file:
with open(file_name,'r') as f:
lines = f.readlines()
for line in lines:
# x_train
words_list = line.split()[1:] # 格式: ["system", "response"]
words_list_doc2vec = model.infer_vector(words_list)# 输入数据进行提取doc2vec 特征
X.append(words_list_doc2vec)
# y_train
y.append(int(line.split()[0]))
total_examples = total_examples + 1
doc2vec 特征保存
'''
输出结果格式:
label feats
例如:
0 f1 f2 .... f200
其中: f1 。。 是doc2vec 数值类的数据
'''
with open('feat_doc2vec.txt','w') as f:
for i in range(total_examples):
x_str = [str(x) for x in list(X[i])]
results = "{} {}".format(y[i]," ".join(x_str))
f.write('{}\n'.format(results))
切分训练集合测试集
参考文档:https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
X = []
y = []
with open('feat_doc2vec.txt','r') as fr:
lines = fr.readlines()
for line in lines:
line = line.strip()
feat = [float(f) for f in line.split()[1:]] #[0.2501368, 0.29046363, -0.1135941]
X.append(feat) # 特征列
y.append(int(line.split()[0]))#Label 列
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print('X_train total_examples= ',len(X_train))
print('X_test total_exampels= ',len(X_test))
X_train total_examples= 35000
X_test total_exampels= 15000
分类
使用训练数据,训练一个LR 的模型
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
使用测试数据验证分类器效果
classifier.score(X_test,y_test)
0.8206
针对情感分析任务,我们取到82%(实验数据:doc2vec 10次迭代效果82%,2词迭代效果79%) 准确率。。
我们可以尝试doc2vec 模型增加迭代次数,然后在看看结果如何?这里不在演示
对文本行进行在线预测
这里模拟线上在线预测功能,实际工作中,我们需要发布一个WEB服务,然后在线预测
我们对正样本进行预测(正样本-1)
text = 'my mother took me to see this film as a child and i long to see it every year as i do all of my other christmas favorites what i remember most was the silly\
devil and santa looking through his telescope i waited and looked through the t v guide each year after that to see when it would be shown i would usually \
find it playing on a saturday afternoon i only found the movie in english which took something special away from the film and have longed to find a copy of\
it in spanish i hold this film dear to my heart and have never suffered from nightmares as others might suggest yes it is a different film about santa claus\
and that is what makes it special and unique i can t wait to get a copy of this film and watch it with my children as i explain to them my favorite parts and memories '
words_list = text.split()# 分词列表,中文可以使用jieba分词,返回list
feat = model.infer_vector(words_list)# 输入数据进行提取doc2vec 特征
print(classifier.predict([feat])[0])
1
我们对负样本进行预测(0-负样本)
text2 = 'once again mr costner has dragged out a movie for far longer than necessary aside from the terrific sea rescue sequences of which there are very few i just did not care about any of the characters most of us have ghosts in the closet and costner s character are realized early on and then forgotten until much later by which time i did not care the character we should really care about is a very cocky overconfident ashton kutcher the problem is he comes off as kid who thinks he s better than anyone else around him and shows no signs of a cluttered closet his only obstacle appears to be winning over costner finally when we are well past the half way point of this stinker costner tells us all about kutcher s ghosts we are told why kutcher is driven to be the best with no prior inkling or foreshadowing no magic here it was all i could do to keep from turning it off an hour in '
feat2 = model.infer_vector(text2.split())# 输入数据进行提取doc2vec 特征
print(classifier.predict([feat2])[0])
0
总结:使用doc2vec 抽取特征,然后进行分类相关功能
- 需要通过python预处理成doc2vec 训练数据格式
- doc2vec 模型优化的内容,主要包括迭代次数;窗口大小等,都可以进行尝试
- doc2vec 模型可以抽取特征,同时也可以获取某个词的相近词
- 这里简单的分类,可以使用其他的模型进行训练看看效果
- 本文对电影类英文的数据进行情感分析,思路:可以切换为中文的文本分类,唯一需要修改的地方就是分词,例如:可以使用jieba
- 运行环境:windows7 + python3.7.x 测试验证上述内容
参考资料
[1]Doc2vec: https://radimrehurek.com/gensim/models/doc2vec.html
[2]Paper that inspired this: http://arxiv.org/abs/1405.4053
Python 实战系列课程
-
Python数据可视化教程:基于 plotly 动态可视化绘图
https://edu.51cto.com/sd/4bff8 -
Python数据可视化教程 Seaborn
https://edu.51cto.com/sd/19627
-
Python数据可视化教程:基于Plotly的动态可视化绘图
https://edu.csdn.net/course/detail/24935 -
Python数据可视化教程 Seaborn
https://edu.csdn.net/course/detail/24790
- Python 数据分析实战 视频课程