本章节主要研究内容:基于word2vec 提取特征 + 文本分类
finetune
就是用别人训练好的模型,加上我们自己的数据,来训练新的模型。finetune相当于使用别人的模型的前几层,来提取浅层特征,然后在最后再落入我们自己的分类中。 finetune的好处在于不用完全重新训练模型,从而提高效率,因为一般新训练模型准确率都会从很低的值开始慢慢上升,但是fine tune能够让我们在比较少的迭代次数之后得到一个比较好的效果。 在数据量不是很大的情况下,finetune会是一个比较好的选择。但是如果你希望定义自己的网络结构的话,就需要从头开始了。
词向量在NLP被广泛应用,通过引入word2vec,不仅可以对单个词进行词向量表示,而且可以对整个句子或文章进行表示。 想象一下,能够使用固定长度的向量表示整个句子,然后我们可以使用标准的分类算法去分类。是一件很神奇的事情。
本章以Word2Vec 为基础,来解决文本分类问题。
主要涉及内容包括
- 数据预处理
- 全量训练word2vec模型
- 增量训练word2vec模型(fine tune)
- 特征工程处理
- LR 和 xgboost 模型训练
- 数据预测在线预测
处理流程
文本数据-> 数据预处理-> word2vec 增量模型训练-> 文本 word2vec 特征表示 + Label 标签 -> LR 和 xgboost 训练-> 在线预测
lr 和 xgboost 分类效果
Label | Precision(xgb) | Recall(xgb) | F1(xgb) | Precision(lr) | Recall(lr) | F1(lr) |
---|---|---|---|---|---|---|
entertainment | 0.899384 | 0.918056 | 0.908624 | 0.910312 | 0.919852 | 0.915057 |
technology | 0.820596 | 0.868474 | 0.843856 | 0.843728 | 0.862625 | 0.853072 |
sports | 0.934944 | 0.925174 | 0.930033 | 0.943003 | 0.933942 | 0.938450 |
military | 0.911779 | 0.883162 | 0.897242 | 0.903895 | 0.898511 | 0.901195 |
car | 0.913521 | 0.798527 | 0.852162 | 0.899191 | 0.854260 | 0.876150 |
总体 | 0.894573 | 0.893277 | 0.893347 | 0.902478 | 0.902138 | 0.902216 |
导入库
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# train_word2vec_model.py用于训练模型
import logging
import os.path
import sys
import multiprocessing
import gensim
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
# sklean 库
from sklearn.metrics import confusion_matrix,precision_recall_fscore_support
# 警告信息
import warnings
warnings.filterwarnings('ignore')
#定义文章中类别
categories_label = {'entertainment': 0, 'technology': 1, 'sports': 2, 'military': 3, 'car': 4}
print('gensim version = ',gensim.__version__)
gensim version = 3.1.0
准备数据
"""=====================================================================================================================
1 读取原始数据,并进行简单处理
1.0 数据数据格式
word1 word2 word3 label
1.1 sentences 格式
[['bromwell', 'high', 'is'],['bromwell-1', 'high-1', 'is-1']]
1.2 labels 格式
[1, 0]
"""
sentences = []
labels = []
total_examples = 0
with open("../data/news.csv", 'r', encoding='utf8') as f:
lines = f.readlines()
for line in lines:
splits = line.split(' ')
feat = splits[:splits.__len__() - 1]
label = splits[splits.__len__() - 1]
if label.strip() in list(categories_label.keys()):
sentences.append(feat)
labels.append(categories_label[label.strip()]) # 统计每个label放到list中
total_examples = total_examples + 1
print(sentences[:2])
print(labels[:2])
print('total_examples = ',total_examples)
[['另一边', '舞王', '韩庚', '跟随', '欢乐', '起舞', '八十年代', '迪斯科', '舞步', '轮番上阵', '场面', '精彩', '歌之夜', '敬请期待', '浙江', '卫视', '2017', '周五', '00', '畅意', '100%', '乳酸菌', '饮品', '独家', '冠名', '二十四', '小时', '第二季', '水手', '欢乐', '出发'], ['三是', '改变', '割裂', '状况', '建立', '一体化', '防御', '体系']]
[0, 1]
total_examples = 109477
word2vec 模型训练
训练word2vec 模型
model = Word2Vec(sentences, size=200, window=5, min_count=5,workers=multiprocessing.cpu_count())
model.vector_size
200
# 保存模型
model.save('../model/news_word2vec_200.w2v')
model.wv.save_word2vec_format('../model/news_word2vec_200.bin',binary=False)
# 加载模型
model_ = model.wv.load_word2vec_format('../model/news_word2vec_200.bin',binary=False)
获取某个单词的相似度单词
model_.most_similar('马云')
[('董事局', 0.8884702920913696),
('联席', 0.8768831491470337),
('致辞', 0.8764935731887817),
('会上', 0.8717509508132935),
('揭牌', 0.8496181964874268),
('马化腾', 0.8490221500396729),
('内塔尼亚胡', 0.8426049947738647),
('清华大学', 0.8364063501358032),
('萧泓', 0.8296477794647217),
('全国政协', 0.8289929032325745)]
model_.most_similar('人工智能')
[('AI', 0.9426432251930237),
('派别', 0.7992518544197083),
('科学家', 0.7718576192855835),
('Argo', 0.7710253596305847),
('人类', 0.7707951664924622),
('李彦宏', 0.7664409875869751),
('吴恩达', 0.7654942870140076),
('机器', 0.7499843835830688),
('李开复', 0.7363733649253845),
('围棋', 0.7349368333816528)]
增量训练word2vec模型
# 加载pre_trained 的词向量模型
pre_trained = '../model/news_word2vec_200.w2v' # 导入外部的词向量
clf = Word2Vec.load(pre_trained)
clf.most_similar('人工智能')
[('AI', 0.9426432847976685),
('派别', 0.7992516160011292),
('科学家', 0.7718576192855835),
('Argo', 0.7710260152816772),
('人类', 0.7707951664924622),
('李彦宏', 0.7664410471916199),
('吴恩达', 0.7654943466186523),
('机器', 0.7499844431877136),
('李开复', 0.736373245716095),
('围棋', 0.7349367737770081)]
# 模型增量训练
clf.build_vocab(sentences, update=True) # 更新词汇表
# epoch=iter语料库的迭代次数;(默认为5) total_examples:句子数。
clf.train(sentences, total_examples=clf.corpus_count,epochs=clf.iter)
16234895
clf.vector_size
200
clf.most_similar('人工智能')
[('AI', 0.8695564866065979),
('人类', 0.6560510396957397),
('计算技术', 0.651934802532196),
('自然语言', 0.6484375),
('AlphaGo', 0.6366993188858032),
('李彦宏', 0.6268306970596313),
('算法', 0.6251788139343262),
('工业界', 0.624036431312561),
('围棋', 0.6237311363220215),
('科学家', 0.6062170267105103)]
特征工程
# 两个list 中对应元素相加-借助np.array()
w1 = [1,2,4]
w2 = [3,2,7]
import numpy as np
[w_vec/2 for w_vec in list(np.array(w1) + np.array(w2))]
[2.0, 2.0, 5.5]
try:
print(2/0)
except Exception as e:
print('ok1')
ok1
clf['中国']
array([ 0.96740896, -1.6865271 , 1.2900462 , -1.3705163 , -1.915153 ,
-1.1933858 , 0.8096983 , 0.48648018, -0.9326265 , -1.4908819 ,
0.56121856, 2.015987 , -0.5252904 , 0.27580634, 2.238747 ,
0.89014125, 0.4018909 , 2.0528162 , -0.32533732, 0.8737432 ,
1.6745373 , -0.38107952, -0.86439705, -0.40904617, 0.9813328 ,
-1.145166 , 0.39486435, -0.21861482, -0.7214359 , -1.7244024 ,
-1.8221827 , -0.29753602, 1.004355 , 0.7812221 , 1.2393422 ,
-0.15155244, 1.5179119 , 1.6139843 , -0.9574509 , 0.8910288 ,
0.82247436, -1.7296616 , 0.36545 , -0.2216244 , 1.2839543 ,
0.1663613 , 1.7028248 , 0.91052604, -1.9714876 , 1.7771893 ,
-0.4010989 , -1.3458189 , 0.07643531, 0.9614052 , -0.12865292,
0.18776867, 1.0256236 , 0.2441487 , 1.7315062 , -0.94587404,
1.8154999 , -0.68458503, 3.1634037 , 1.3197032 , -0.42643756,
0.56886035, 1.71931 , 0.8004137 , -0.0793217 , -0.35664517,
1.336351 , -0.96460015, 0.55825627, 1.1372244 , 0.3129831 ,
1.4270234 , -0.05844976, -1.2512728 , 1.6184946 , -0.10997374,
-0.54428136, -0.27977905, -0.25261492, -0.00664744, 0.5957666 ,
-2.2385325 , 0.2673022 , -1.6742384 , -1.7813787 , 0.3536596 ,
-0.46775097, 0.5429149 , 0.03727146, -0.16933414, 1.3233528 ,
0.5205234 , -1.8943106 , 0.31667498, -0.50464225, 1.316418 ,
0.47614253, -0.50109404, 1.2957754 , -0.678785 , -0.4088726 ,
1.0855002 , -0.01045603, 0.10198852, 2.016549 , 2.0824132 ,
-1.0874176 , 0.38534838, 0.6253132 , -1.1764988 , -1.0011297 ,
1.5049071 , 0.12407725, 0.9971452 , -0.32189086, 0.18175083,
0.43934143, 0.10077734, 0.11211097, 1.0505838 , -0.5502816 ,
-1.0500243 , 1.8056464 , -1.3403738 , 0.01678614, 0.48729068,
1.0449888 , -0.13342255, 0.2285603 , -1.5818268 , 1.3460234 ,
-1.9437159 , 2.231045 , -0.15514365, 0.2713421 , -0.06743097,
0.32509503, 3.378762 , 0.6910759 , -0.37092376, -0.9336253 ,
0.58801264, -0.098266 , -0.06897759, -0.22015525, 0.4771631 ,
0.9061742 , 0.976404 , 0.43377542, -1.0387566 , -0.8317068 ,
-0.9258896 , -1.8652458 , 0.7119291 , -1.0494678 , -0.57748735,
0.07536792, -1.8017843 , 0.8706263 , 0.9734037 , -0.08668977,
0.10478871, 0.61765945, -0.10407893, -1.1222907 , 0.7849265 ,
-1.1379696 , -0.46689153, 0.84639096, -2.6367438 , 0.96062094,
0.1998642 , -1.1485023 , 0.19304404, -1.5591619 , -1.2127926 ,
-0.6445032 , 1.4753993 , -0.622278 , 0.56400865, -0.38170522,
0.40417033, -0.0315093 , 1.2562262 , 0.38456756, 0.3380754 ,
0.346896 , -0.3114489 , 1.1451244 , -1.4440751 , 1.4981574 ,
-0.7149278 , 1.6307881 , -1.0674877 , 0.14921786, 0.711246 ],
dtype=float32)
def get_words_vec(words):
'''
获取句子的向量:所有单词向量平均
words 格式:['中国2','中国1']
'''
w2v_sum = []
total = 0
for word in words:
try:
w2v_tmp = clf[word]
total = total + 1
if len(w2v_sum)>0:
w2v_sum = list(np.array(w2v_tmp) + np.array(w2v_sum))
else:
w2v_sum = list(np.array(w2v_tmp))
except Exception as e:
pass
if len(w2v_sum)==200:
w2v_documents = [float("{:.6f}".format(w2v/total)) for w2v in w2v_sum]
return w2v_documents
else:
return []
words = ['杨凯']
words_vec = get_words_vec(words)
print(len(words_vec))
0
X = [] # doc2vec 特征
y = [] # 文本行的label
total_examples = 0
with open("../data/news.csv", 'r', encoding='utf8') as f:
lines = f.readlines()
for line in lines:
splits = line.split(' ')
words_list = splits[:splits.__len__() - 1]
label = splits[splits.__len__() - 1]
if label.strip() in list(categories_label.keys()):
# x_train
words_list_doc2vec = get_words_vec(words_list)# 输入数据进行提取doc2vec 特征
if len(words_list_doc2vec)>0:
X.append(words_list_doc2vec)
# y_train
y.append(categories_label[label.strip()])
total_examples = total_examples + 1
print('total_examples = ',total_examples)
print('X size = ',len(X))
print('y size = ',len(y))
total_examples = 109472
X size = 109472
y size = 109472
print(X[0])
[-0.186026, -0.471656, -0.403095, -0.16298, -0.045337, -0.062552, -0.156995, 0.267478, 0.225991, 0.369533, -0.335204, 0.179532, -0.074346, -0.293882, 0.19305, 0.079494, 0.270315, -0.626937, 0.560467, -0.266492, 0.02279, 0.037634, 0.538005, -0.285794, 0.341812, -0.400007, 0.339306, -0.191563, -0.56323, -0.405104, -0.190517, 0.273558, -0.244152, 0.354516, 0.032482, 0.455198, 0.549414, -0.140646, 0.136085, -0.299392, 0.104765, 0.190382, -0.05006, -0.124277, 0.285367, 0.234464, 0.240897, 0.072967, 0.478128, -0.193547, -0.192899, 0.033353, -0.326732, 0.101749, 0.197146, -0.240307, 0.063487, 0.198056, -0.254162, -0.270419, 0.195258, 0.415513, 0.673777, 0.518012, -0.108143, 0.261305, 0.377806, -0.203772, 0.272787, 0.282682, -0.552932, 0.072241, -0.091994, -0.100113, -0.400602, 0.118488, 0.317893, 0.434966, 0.113871, -0.081931, -0.133279, -0.500441, 0.07054, 0.160294, 0.110109, 0.363486, 0.073197, -0.074275, -0.366876, 0.17156, 0.262335, -0.469497, 0.26584, -0.516757, 0.339812, 0.515125, -0.272341, -0.039567, 0.38705, 0.163687, -0.588046, -0.363793, -0.091529, 0.13896, 0.104893, -0.071686, 0.3933, -0.686791, 0.213298, -0.558621, 0.010017, -0.0913, -0.447628, 0.103621, 0.312466, -0.197491, -0.079608, 0.496554, -0.41856, -0.286083, -0.043195, -0.730198, 0.182539, -0.082423, -0.115626, 0.188141, -0.419775, 0.179798, 0.066709, 0.088201, 0.59903, -0.132807, 0.302099, 0.10871, -0.380457, -0.220652, -0.65525, -0.439337, -0.417222, 0.244528, 0.876633, 0.471831, 0.326137, -0.370162, -0.228948, 0.376438, -0.406891, -0.098517, 0.360538, -0.290551, -0.192426, 0.080742, 0.244631, -0.371981, 0.591938, -0.193152, 0.390026, 0.810614, -0.076158, -0.001152, 0.631901, 0.324699, 0.393992, -0.095888, 0.1158, -0.191288, 0.08339, -0.582966, 0.182854, 0.440838, 0.276029, 0.325805, 0.046083, 0.197688, -0.512184, -0.155932, 0.258983, 0.255194, 0.204922, -0.10756, 0.174984, 0.236015, -0.025575, -0.315343, -0.292815, 0.202672, -0.119458, -0.251092, 0.127497, 0.066032, 0.209923, -0.28848, -0.366751, 0.290316, -0.07157, -0.038532, -0.194492, 0.400465, 0.155835, 0.396084]
print(y[9])
1
word2vec 特征保存
'''
输出结果格式:
label feats
例如:
0 f1 f2 .... f200
其中: f1 。。 是doc2vec 数值类的数据
'''
with open('../data/news_feat_doc2vec.txt','w') as f:
for i in range(total_examples):
x_str = [str(x) for x in list(X[i])]
results = "{} {}".format(y[i]," ".join(x_str))
f.write('{}\n'.format(results))
切分训练集合测试集
X = []
y = []
with open('../data/news_feat_doc2vec.txt','r') as fr:
lines = fr.readlines()
for line in lines:
line = line.strip()
feat = [float(f) for f in line.split()[1:]] #[0.2501368, 0.29046363, -0.1135941]
X.append(feat) # 特征列
y.append(int(line.split()[0]))#Label 列
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print('X_train total_examples= ',len(X_train))
print('X_test total_exampels= ',len(X_test))
X_train total_examples= 76630
X_test total_exampels= 32842
分类训练
# 获取的类别名称
category = list(categories_label.keys())
category
['entertainment', 'technology', 'sports', 'military', 'car']
#绘制precision、recall、f1-score、support报告表
def eval_model(y_true, y_pred, labels):
# 计算每个分类的Precision, Recall, f1, support
p, r, f1, s = precision_recall_fscore_support(y_true, y_pred)
# 计算总体的平均Precision, Recall, f1, support
tot_p = np.average(p, weights=s)
tot_r = np.average(r, weights=s)
tot_f1 = np.average(f1, weights=s)
tot_s = np.sum(s)
res1 = pd.DataFrame({
u'Label': labels,
u'Precision': p,
u'Recall': r,
u'F1': f1,
u'Support': s
})
res2 = pd.DataFrame({
u'Label': ['总体'],
u'Precision': [tot_p],
u'Recall': [tot_r],
u'F1': [tot_f1],
u'Support': [tot_s]
})
res2.index = [999]
res = pd.concat([res1, res2])
return res[['Label', 'Precision', 'Recall', 'F1', 'Support']]
分类-LR
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
lr.score(X_test,y_test)
0.9021375068509835
y_pred = lr.predict(X_test)
eval_model(y_test,y_pred, category)
Label | Precision | Recall | F1 | Support | |
---|---|---|---|---|---|
0 | entertainment | 0.910312 | 0.919852 | 0.915057 | 10019 |
1 | technology | 0.843728 | 0.862625 | 0.853072 | 7010 |
2 | sports | 0.943003 | 0.933942 | 0.938450 | 8326 |
3 | military | 0.903895 | 0.898511 | 0.901195 | 4365 |
4 | car | 0.899191 | 0.854260 | 0.876150 | 3122 |
999 | 总体 | 0.902478 | 0.902138 | 0.902216 | 32842 |
分类-xgboost
import pandas as pd
import xgboost as xgb
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_test, label=y_test)
params = {
'booster': 'gbtree',
'objective': 'multi:softmax', # 多分类的问题
'num_class': 5, # 类别数,与 multisoftmax 并用
'gamma': 0.1, # 用于控制是否后剪枝的参数,越大越保守,一般0.1、0.2这样子。
'max_depth': 12, # 构建树的深度,越大越容易过拟合
'lambda': 2, # 控制模型复杂度的权重值的L2正则化项参数,参数越大,模型越不容易过拟合。
'subsample': 0.7, # 随机采样训练样本
'colsample_bytree': 0.7, # 生成树时进行的列采样
'min_child_weight': 3,
'silent': 1, # 设置成1则没有运行信息输出,最好是设置为0.
'eta': 0.007, # 如同学习率
'seed': 1000,
'nthread': 4, # cpu 线程数
}
watchlist = [(dtrain,'train'),(dval,'val')]
xgb_model = xgb.train(params,dtrain,num_boost_round=50,evals = watchlist,early_stopping_rounds=20)
[0] train-merror:0.098839 val-merror:0.157847
Multiple eval metrics have been passed: 'val-merror' will be used for early stopping.
Will train until val-merror hasn't improved in 20 rounds.
[1] train-merror:0.07698 val-merror:0.137233
[2] train-merror:0.067898 val-merror:0.12685
[3] train-merror:0.063904 val-merror:0.122404
[4] train-merror:0.06102 val-merror:0.11872
[5] train-merror:0.059311 val-merror:0.116467
[6] train-merror:0.058202 val-merror:0.116589
[7] train-merror:0.057249 val-merror:0.115523
[8] train-merror:0.056244 val-merror:0.113726
[9] train-merror:0.056571 val-merror:0.112843
[10] train-merror:0.056114 val-merror:0.111595
[11] train-merror:0.055696 val-merror:0.111473
[12] train-merror:0.055253 val-merror:0.11123
[13] train-merror:0.054952 val-merror:0.110773
[14] train-merror:0.054913 val-merror:0.110286
[15] train-merror:0.054691 val-merror:0.109524
[16] train-merror:0.054 val-merror:0.110103
[17] train-merror:0.053713 val-merror:0.109738
[18] train-merror:0.05336 val-merror:0.109403
[19] train-merror:0.052812 val-merror:0.109311
[20] train-merror:0.052421 val-merror:0.109159
[21] train-merror:0.052225 val-merror:0.108794
[22] train-merror:0.051833 val-merror:0.108124
[23] train-merror:0.051507 val-merror:0.108154
[24] train-merror:0.051546 val-merror:0.108093
[25] train-merror:0.051024 val-merror:0.108306
[26] train-merror:0.050711 val-merror:0.108246
[27] train-merror:0.050672 val-merror:0.107728
[28] train-merror:0.05032 val-merror:0.107971
[29] train-merror:0.050515 val-merror:0.107911
[30] train-merror:0.050254 val-merror:0.10788
[31] train-merror:0.050176 val-merror:0.107606
[32] train-merror:0.049915 val-merror:0.107667
[33] train-merror:0.050033 val-merror:0.107606
[34] train-merror:0.049889 val-merror:0.107667
[35] train-merror:0.049902 val-merror:0.107484
[36] train-merror:0.049641 val-merror:0.107576
[37] train-merror:0.049563 val-merror:0.107941
[38] train-merror:0.049498 val-merror:0.107758
[39] train-merror:0.049302 val-merror:0.10785
[40] train-merror:0.049237 val-merror:0.107576
[41] train-merror:0.04908 val-merror:0.107545
[42] train-merror:0.049119 val-merror:0.107332
[43] train-merror:0.049171 val-merror:0.10718
[44] train-merror:0.049002 val-merror:0.107302
[45] train-merror:0.049028 val-merror:0.107119
[46] train-merror:0.048741 val-merror:0.106723
[47] train-merror:0.048767 val-merror:0.106936
[48] train-merror:0.048545 val-merror:0.10651
[49] train-merror:0.04861 val-merror:0.106723
y_pred_xgb = xgb_model.predict(xgb.DMatrix(X_test))
eval_model(y_test,y_pred_xgb, category)
Label | Precision | Recall | F1 | Support | |
---|---|---|---|---|---|
0 | entertainment | 0.899384 | 0.918056 | 0.908624 | 10019 |
1 | technology | 0.820596 | 0.868474 | 0.843856 | 7010 |
2 | sports | 0.934944 | 0.925174 | 0.930033 | 8326 |
3 | military | 0.911779 | 0.883162 | 0.897242 | 4365 |
4 | car | 0.913521 | 0.798527 | 0.852162 | 3122 |
999 | 总体 | 0.894573 | 0.893277 | 0.893347 | 32842 |
上述模型训练完成之后应该保持下来,这里不在处理。
需要保持的文件内容
- lr 模型
- xgboost 模型
- label <-> category 名称关系
- word2vec 词向量
在线预测
实际使用中上述模型被存储,然后服务器启动时候加载。
文本-> 词向量
label_category_name = dict((v,k) for k,v in categories_label.items())
text = '世界杯 扩军 残酷 国足 这股 东风'
text_vec = get_words_vec(text.split())
print('text_vec length = ',len(text_vec))
text_vec length = 200
LR 模型预测
lr_y_pred_results = lr.predict([text_vec])[0]
print("lr predict id:{}->name:{}".format(lr_y_pred_results,label_category_name[lr_y_pred_results]))
id:2->name:sports
xgboost 预测
y_pred_xgb_results = int(xgb_model.predict(xgb.DMatrix([text_vec]))[0])
print("xgboost predict id:{}->name:{}".format(lr_y_pred_results,label_category_name[y_pred_xgb_results]))
xgboost predict id:2->name:sports
参考资料
[1] Word2Vec模型增量训练
https://blog.csdn.net/qq_43404784/article/details/83794296
[2] 利用Python构建Wiki中文语料词向量模型试验
https://github.com/AimeeLee77/wiki_zh_word2vec
[3] 【深度学习】120G+训练好的word2vec模型(中文词向量)
https://blog.csdn.net/zkq_1986/article/details/84990426
[4] markdown 使用
https://www.zybuluo.com/mdeditor