自然语言处理(NLP):06 word2vec训练中文模型-文本分类

本章节主要研究内容:基于word2vec 提取特征 + 文本分类

finetune

就是用别人训练好的模型,加上我们自己的数据,来训练新的模型。finetune相当于使用别人的模型的前几层,来提取浅层特征,然后在最后再落入我们自己的分类中。   finetune的好处在于不用完全重新训练模型,从而提高效率,因为一般新训练模型准确率都会从很低的值开始慢慢上升,但是fine tune能够让我们在比较少的迭代次数之后得到一个比较好的效果。   在数据量不是很大的情况下,finetune会是一个比较好的选择。但是如果你希望定义自己的网络结构的话,就需要从头开始了。

词向量在NLP被广泛应用,通过引入word2vec,不仅可以对单个词进行词向量表示,而且可以对整个句子或文章进行表示。 想象一下,能够使用固定长度的向量表示整个句子,然后我们可以使用标准的分类算法去分类。是一件很神奇的事情。

本章以Word2Vec 为基础,来解决文本分类问题。

主要涉及内容包括

  1. 数据预处理
  2. 全量训练word2vec模型
  3. 增量训练word2vec模型(fine tune)
  4. 特征工程处理
  5. LR 和 xgboost 模型训练
  6. 数据预测在线预测

处理流程

文本数据-> 数据预处理-> word2vec 增量模型训练-> 文本 word2vec 特征表示 + Label 标签 -> LR 和 xgboost 训练-> 在线预测

lr 和 xgboost 分类效果

Label Precision(xgb) Recall(xgb) F1(xgb) Precision(lr) Recall(lr) F1(lr)
entertainment 0.899384 0.918056 0.908624 0.910312 0.919852 0.915057
technology 0.820596 0.868474 0.843856 0.843728 0.862625 0.853072
sports 0.934944 0.925174 0.930033 0.943003 0.933942 0.938450
military 0.911779 0.883162 0.897242 0.903895 0.898511 0.901195
car 0.913521 0.798527 0.852162 0.899191 0.854260 0.876150
总体 0.894573 0.893277 0.893347 0.902478 0.902138 0.902216

导入库

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# train_word2vec_model.py用于训练模型
import logging
import os.path
import sys
import multiprocessing
import gensim
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

# sklean 库
from sklearn.metrics import confusion_matrix,precision_recall_fscore_support

# 警告信息
import warnings
warnings.filterwarnings('ignore')
#定义文章中类别
categories_label = {'entertainment': 0, 'technology': 1, 'sports': 2, 'military': 3, 'car': 4}
print('gensim version = ',gensim.__version__)
gensim version =  3.1.0

准备数据

"""=====================================================================================================================
1 读取原始数据,并进行简单处理

1.0 数据数据格式
word1 word2 word3 label

1.1 sentences 格式
[['bromwell', 'high', 'is'],['bromwell-1', 'high-1', 'is-1']]

1.2 labels 格式
[1, 0]

"""

sentences = []
labels = []
total_examples = 0
with open("../data/news.csv", 'r', encoding='utf8') as f:
    lines = f.readlines()
    for line in lines:
        splits = line.split(' ')
        feat = splits[:splits.__len__() - 1]
        label = splits[splits.__len__() - 1]
        if label.strip() in list(categories_label.keys()):
            sentences.append(feat)
            labels.append(categories_label[label.strip()])  # 统计每个label放到list中
            total_examples = total_examples + 1
print(sentences[:2])
print(labels[:2])
print('total_examples = ',total_examples)
[['另一边', '舞王', '韩庚', '跟随', '欢乐', '起舞', '八十年代', '迪斯科', '舞步', '轮番上阵', '场面', '精彩', '歌之夜', '敬请期待', '浙江', '卫视', '2017', '周五', '00', '畅意', '100%', '乳酸菌', '饮品', '独家', '冠名', '二十四', '小时', '第二季', '水手', '欢乐', '出发'], ['三是', '改变', '割裂', '状况', '建立', '一体化', '防御', '体系']]
[0, 1]
total_examples =  109477

word2vec 模型训练

训练word2vec 模型

model = Word2Vec(sentences, size=200, window=5, min_count=5,workers=multiprocessing.cpu_count())
model.vector_size
200
# 保存模型
model.save('../model/news_word2vec_200.w2v')
model.wv.save_word2vec_format('../model/news_word2vec_200.bin',binary=False)
# 加载模型
model_ = model.wv.load_word2vec_format('../model/news_word2vec_200.bin',binary=False)

获取某个单词的相似度单词

model_.most_similar('马云')
[('董事局', 0.8884702920913696),
 ('联席', 0.8768831491470337),
 ('致辞', 0.8764935731887817),
 ('会上', 0.8717509508132935),
 ('揭牌', 0.8496181964874268),
 ('马化腾', 0.8490221500396729),
 ('内塔尼亚胡', 0.8426049947738647),
 ('清华大学', 0.8364063501358032),
 ('萧泓', 0.8296477794647217),
 ('全国政协', 0.8289929032325745)]
model_.most_similar('人工智能')
[('AI', 0.9426432251930237),
 ('派别', 0.7992518544197083),
 ('科学家', 0.7718576192855835),
 ('Argo', 0.7710253596305847),
 ('人类', 0.7707951664924622),
 ('李彦宏', 0.7664409875869751),
 ('吴恩达', 0.7654942870140076),
 ('机器', 0.7499843835830688),
 ('李开复', 0.7363733649253845),
 ('围棋', 0.7349368333816528)]

增量训练word2vec模型

# 加载pre_trained 的词向量模型
pre_trained = '../model/news_word2vec_200.w2v' # 导入外部的词向量

clf = Word2Vec.load(pre_trained)
clf.most_similar('人工智能')
[('AI', 0.9426432847976685),
 ('派别', 0.7992516160011292),
 ('科学家', 0.7718576192855835),
 ('Argo', 0.7710260152816772),
 ('人类', 0.7707951664924622),
 ('李彦宏', 0.7664410471916199),
 ('吴恩达', 0.7654943466186523),
 ('机器', 0.7499844431877136),
 ('李开复', 0.736373245716095),
 ('围棋', 0.7349367737770081)]
# 模型增量训练
clf.build_vocab(sentences, update=True)  # 更新词汇表
# epoch=iter语料库的迭代次数;(默认为5)  total_examples:句子数。
clf.train(sentences, total_examples=clf.corpus_count,epochs=clf.iter)
16234895
clf.vector_size
200
clf.most_similar('人工智能')
[('AI', 0.8695564866065979),
 ('人类', 0.6560510396957397),
 ('计算技术', 0.651934802532196),
 ('自然语言', 0.6484375),
 ('AlphaGo', 0.6366993188858032),
 ('李彦宏', 0.6268306970596313),
 ('算法', 0.6251788139343262),
 ('工业界', 0.624036431312561),
 ('围棋', 0.6237311363220215),
 ('科学家', 0.6062170267105103)]

特征工程

# 两个list 中对应元素相加-借助np.array()
w1 = [1,2,4]
w2 = [3,2,7]
import numpy as np
[w_vec/2 for w_vec in list(np.array(w1) + np.array(w2))]
[2.0, 2.0, 5.5]
try:
    print(2/0)
except Exception as e:
    print('ok1')
ok1
clf['中国']
array([ 0.96740896, -1.6865271 ,  1.2900462 , -1.3705163 , -1.915153  ,
       -1.1933858 ,  0.8096983 ,  0.48648018, -0.9326265 , -1.4908819 ,
        0.56121856,  2.015987  , -0.5252904 ,  0.27580634,  2.238747  ,
        0.89014125,  0.4018909 ,  2.0528162 , -0.32533732,  0.8737432 ,
        1.6745373 , -0.38107952, -0.86439705, -0.40904617,  0.9813328 ,
       -1.145166  ,  0.39486435, -0.21861482, -0.7214359 , -1.7244024 ,
       -1.8221827 , -0.29753602,  1.004355  ,  0.7812221 ,  1.2393422 ,
       -0.15155244,  1.5179119 ,  1.6139843 , -0.9574509 ,  0.8910288 ,
        0.82247436, -1.7296616 ,  0.36545   , -0.2216244 ,  1.2839543 ,
        0.1663613 ,  1.7028248 ,  0.91052604, -1.9714876 ,  1.7771893 ,
       -0.4010989 , -1.3458189 ,  0.07643531,  0.9614052 , -0.12865292,
        0.18776867,  1.0256236 ,  0.2441487 ,  1.7315062 , -0.94587404,
        1.8154999 , -0.68458503,  3.1634037 ,  1.3197032 , -0.42643756,
        0.56886035,  1.71931   ,  0.8004137 , -0.0793217 , -0.35664517,
        1.336351  , -0.96460015,  0.55825627,  1.1372244 ,  0.3129831 ,
        1.4270234 , -0.05844976, -1.2512728 ,  1.6184946 , -0.10997374,
       -0.54428136, -0.27977905, -0.25261492, -0.00664744,  0.5957666 ,
       -2.2385325 ,  0.2673022 , -1.6742384 , -1.7813787 ,  0.3536596 ,
       -0.46775097,  0.5429149 ,  0.03727146, -0.16933414,  1.3233528 ,
        0.5205234 , -1.8943106 ,  0.31667498, -0.50464225,  1.316418  ,
        0.47614253, -0.50109404,  1.2957754 , -0.678785  , -0.4088726 ,
        1.0855002 , -0.01045603,  0.10198852,  2.016549  ,  2.0824132 ,
       -1.0874176 ,  0.38534838,  0.6253132 , -1.1764988 , -1.0011297 ,
        1.5049071 ,  0.12407725,  0.9971452 , -0.32189086,  0.18175083,
        0.43934143,  0.10077734,  0.11211097,  1.0505838 , -0.5502816 ,
       -1.0500243 ,  1.8056464 , -1.3403738 ,  0.01678614,  0.48729068,
        1.0449888 , -0.13342255,  0.2285603 , -1.5818268 ,  1.3460234 ,
       -1.9437159 ,  2.231045  , -0.15514365,  0.2713421 , -0.06743097,
        0.32509503,  3.378762  ,  0.6910759 , -0.37092376, -0.9336253 ,
        0.58801264, -0.098266  , -0.06897759, -0.22015525,  0.4771631 ,
        0.9061742 ,  0.976404  ,  0.43377542, -1.0387566 , -0.8317068 ,
       -0.9258896 , -1.8652458 ,  0.7119291 , -1.0494678 , -0.57748735,
        0.07536792, -1.8017843 ,  0.8706263 ,  0.9734037 , -0.08668977,
        0.10478871,  0.61765945, -0.10407893, -1.1222907 ,  0.7849265 ,
       -1.1379696 , -0.46689153,  0.84639096, -2.6367438 ,  0.96062094,
        0.1998642 , -1.1485023 ,  0.19304404, -1.5591619 , -1.2127926 ,
       -0.6445032 ,  1.4753993 , -0.622278  ,  0.56400865, -0.38170522,
        0.40417033, -0.0315093 ,  1.2562262 ,  0.38456756,  0.3380754 ,
        0.346896  , -0.3114489 ,  1.1451244 , -1.4440751 ,  1.4981574 ,
       -0.7149278 ,  1.6307881 , -1.0674877 ,  0.14921786,  0.711246  ],
      dtype=float32)
def get_words_vec(words):
    '''
        获取句子的向量:所有单词向量平均
        words 格式:['中国2','中国1']
    '''
    w2v_sum = []
    total = 0
    for word in words:
        try:
            w2v_tmp = clf[word]
            total = total + 1
            if len(w2v_sum)>0:
                w2v_sum = list(np.array(w2v_tmp) + np.array(w2v_sum))
            else:
                w2v_sum = list(np.array(w2v_tmp)) 
        except Exception as e:
            pass

            
    if len(w2v_sum)==200:
        w2v_documents = [float("{:.6f}".format(w2v/total)) for w2v in w2v_sum]
        return w2v_documents
    else:
        return []

words = ['杨凯']
words_vec = get_words_vec(words)
print(len(words_vec))
0
X = [] # doc2vec 特征
y = [] # 文本行的label
total_examples = 0

with open("../data/news.csv", 'r', encoding='utf8') as f:
    lines = f.readlines()
    for line in lines:
        splits = line.split(' ')
        words_list = splits[:splits.__len__() - 1]
        label = splits[splits.__len__() - 1]
        if label.strip() in list(categories_label.keys()):
            # x_train
            words_list_doc2vec = get_words_vec(words_list)# 输入数据进行提取doc2vec 特征
            if len(words_list_doc2vec)>0:
                X.append(words_list_doc2vec)
                # y_train   
                y.append(categories_label[label.strip()])
                total_examples = total_examples + 1 
            
print('total_examples = ',total_examples)
print('X size = ',len(X))
print('y size = ',len(y))
total_examples =  109472
X size =  109472
y size =  109472
print(X[0])
[-0.186026, -0.471656, -0.403095, -0.16298, -0.045337, -0.062552, -0.156995, 0.267478, 0.225991, 0.369533, -0.335204, 0.179532, -0.074346, -0.293882, 0.19305, 0.079494, 0.270315, -0.626937, 0.560467, -0.266492, 0.02279, 0.037634, 0.538005, -0.285794, 0.341812, -0.400007, 0.339306, -0.191563, -0.56323, -0.405104, -0.190517, 0.273558, -0.244152, 0.354516, 0.032482, 0.455198, 0.549414, -0.140646, 0.136085, -0.299392, 0.104765, 0.190382, -0.05006, -0.124277, 0.285367, 0.234464, 0.240897, 0.072967, 0.478128, -0.193547, -0.192899, 0.033353, -0.326732, 0.101749, 0.197146, -0.240307, 0.063487, 0.198056, -0.254162, -0.270419, 0.195258, 0.415513, 0.673777, 0.518012, -0.108143, 0.261305, 0.377806, -0.203772, 0.272787, 0.282682, -0.552932, 0.072241, -0.091994, -0.100113, -0.400602, 0.118488, 0.317893, 0.434966, 0.113871, -0.081931, -0.133279, -0.500441, 0.07054, 0.160294, 0.110109, 0.363486, 0.073197, -0.074275, -0.366876, 0.17156, 0.262335, -0.469497, 0.26584, -0.516757, 0.339812, 0.515125, -0.272341, -0.039567, 0.38705, 0.163687, -0.588046, -0.363793, -0.091529, 0.13896, 0.104893, -0.071686, 0.3933, -0.686791, 0.213298, -0.558621, 0.010017, -0.0913, -0.447628, 0.103621, 0.312466, -0.197491, -0.079608, 0.496554, -0.41856, -0.286083, -0.043195, -0.730198, 0.182539, -0.082423, -0.115626, 0.188141, -0.419775, 0.179798, 0.066709, 0.088201, 0.59903, -0.132807, 0.302099, 0.10871, -0.380457, -0.220652, -0.65525, -0.439337, -0.417222, 0.244528, 0.876633, 0.471831, 0.326137, -0.370162, -0.228948, 0.376438, -0.406891, -0.098517, 0.360538, -0.290551, -0.192426, 0.080742, 0.244631, -0.371981, 0.591938, -0.193152, 0.390026, 0.810614, -0.076158, -0.001152, 0.631901, 0.324699, 0.393992, -0.095888, 0.1158, -0.191288, 0.08339, -0.582966, 0.182854, 0.440838, 0.276029, 0.325805, 0.046083, 0.197688, -0.512184, -0.155932, 0.258983, 0.255194, 0.204922, -0.10756, 0.174984, 0.236015, -0.025575, -0.315343, -0.292815, 0.202672, -0.119458, -0.251092, 0.127497, 0.066032, 0.209923, -0.28848, -0.366751, 0.290316, -0.07157, -0.038532, -0.194492, 0.400465, 0.155835, 0.396084]
print(y[9])
1

word2vec 特征保存

'''
输出结果格式:
label feats
例如:
0 f1 f2 .... f200

其中: f1 。。 是doc2vec 数值类的数据
'''

with open('../data/news_feat_doc2vec.txt','w') as f:
    for i in range(total_examples):
        x_str = [str(x) for x in list(X[i])]
        results = "{} {}".format(y[i]," ".join(x_str))
        f.write('{}\n'.format(results))

切分训练集合测试集

X = []
y = []
with open('../data/news_feat_doc2vec.txt','r') as fr:
    lines = fr.readlines()
    for line in lines:
        line = line.strip()
        feat = [float(f) for f in line.split()[1:]] #[0.2501368, 0.29046363, -0.1135941]
        X.append(feat) # 特征列
        y.append(int(line.split()[0]))#Label 列
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print('X_train total_examples= ',len(X_train))
print('X_test  total_exampels= ',len(X_test))
X_train total_examples=  76630
X_test  total_exampels=  32842

分类训练

# 获取的类别名称
category = list(categories_label.keys())
category
['entertainment', 'technology', 'sports', 'military', 'car']
#绘制precision、recall、f1-score、support报告表
def eval_model(y_true, y_pred, labels):
    # 计算每个分类的Precision, Recall, f1, support
    p, r, f1, s = precision_recall_fscore_support(y_true, y_pred)
    # 计算总体的平均Precision, Recall, f1, support
    tot_p = np.average(p, weights=s)
    tot_r = np.average(r, weights=s)
    tot_f1 = np.average(f1, weights=s)
    tot_s = np.sum(s)
    res1 = pd.DataFrame({
        u'Label': labels,
        u'Precision': p,
        u'Recall': r,
        u'F1': f1,
        u'Support': s
    })
    res2 = pd.DataFrame({
        u'Label': ['总体'],
        u'Precision': [tot_p],
        u'Recall': [tot_r],
        u'F1': [tot_f1],
        u'Support': [tot_s]
    })
    res2.index = [999]
    res = pd.concat([res1, res2])
    return res[['Label', 'Precision', 'Recall', 'F1', 'Support']]

分类-LR

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
lr.score(X_test,y_test)
0.9021375068509835
y_pred = lr.predict(X_test)
eval_model(y_test,y_pred, category)
Label Precision Recall F1 Support
0 entertainment 0.910312 0.919852 0.915057 10019
1 technology 0.843728 0.862625 0.853072 7010
2 sports 0.943003 0.933942 0.938450 8326
3 military 0.903895 0.898511 0.901195 4365
4 car 0.899191 0.854260 0.876150 3122
999 总体 0.902478 0.902138 0.902216 32842

分类-xgboost

import pandas as pd
import xgboost as xgb
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_test, label=y_test)
params = {
    'booster': 'gbtree',
    'objective': 'multi:softmax',  # 多分类的问题
    'num_class': 5,                # 类别数,与 multisoftmax 并用
    'gamma': 0.1,                  # 用于控制是否后剪枝的参数,越大越保守,一般0.1、0.2这样子。
    'max_depth': 12,               # 构建树的深度,越大越容易过拟合
    'lambda': 2,                   # 控制模型复杂度的权重值的L2正则化项参数,参数越大,模型越不容易过拟合。
    'subsample': 0.7,              # 随机采样训练样本
    'colsample_bytree': 0.7,       # 生成树时进行的列采样
    'min_child_weight': 3,
    'silent': 1,                   # 设置成1则没有运行信息输出,最好是设置为0.
    'eta': 0.007,                  # 如同学习率
    'seed': 1000,
    'nthread': 4,                  # cpu 线程数
}
watchlist  = [(dtrain,'train'),(dval,'val')]
xgb_model = xgb.train(params,dtrain,num_boost_round=50,evals = watchlist,early_stopping_rounds=20)
[0]	train-merror:0.098839	val-merror:0.157847
Multiple eval metrics have been passed: 'val-merror' will be used for early stopping.

Will train until val-merror hasn't improved in 20 rounds.
[1]	train-merror:0.07698	val-merror:0.137233
[2]	train-merror:0.067898	val-merror:0.12685
[3]	train-merror:0.063904	val-merror:0.122404
[4]	train-merror:0.06102	val-merror:0.11872
[5]	train-merror:0.059311	val-merror:0.116467
[6]	train-merror:0.058202	val-merror:0.116589
[7]	train-merror:0.057249	val-merror:0.115523
[8]	train-merror:0.056244	val-merror:0.113726
[9]	train-merror:0.056571	val-merror:0.112843
[10]	train-merror:0.056114	val-merror:0.111595
[11]	train-merror:0.055696	val-merror:0.111473
[12]	train-merror:0.055253	val-merror:0.11123
[13]	train-merror:0.054952	val-merror:0.110773
[14]	train-merror:0.054913	val-merror:0.110286
[15]	train-merror:0.054691	val-merror:0.109524
[16]	train-merror:0.054	val-merror:0.110103
[17]	train-merror:0.053713	val-merror:0.109738
[18]	train-merror:0.05336	val-merror:0.109403
[19]	train-merror:0.052812	val-merror:0.109311
[20]	train-merror:0.052421	val-merror:0.109159
[21]	train-merror:0.052225	val-merror:0.108794
[22]	train-merror:0.051833	val-merror:0.108124
[23]	train-merror:0.051507	val-merror:0.108154
[24]	train-merror:0.051546	val-merror:0.108093
[25]	train-merror:0.051024	val-merror:0.108306
[26]	train-merror:0.050711	val-merror:0.108246
[27]	train-merror:0.050672	val-merror:0.107728
[28]	train-merror:0.05032	val-merror:0.107971
[29]	train-merror:0.050515	val-merror:0.107911
[30]	train-merror:0.050254	val-merror:0.10788
[31]	train-merror:0.050176	val-merror:0.107606
[32]	train-merror:0.049915	val-merror:0.107667
[33]	train-merror:0.050033	val-merror:0.107606
[34]	train-merror:0.049889	val-merror:0.107667
[35]	train-merror:0.049902	val-merror:0.107484
[36]	train-merror:0.049641	val-merror:0.107576
[37]	train-merror:0.049563	val-merror:0.107941
[38]	train-merror:0.049498	val-merror:0.107758
[39]	train-merror:0.049302	val-merror:0.10785
[40]	train-merror:0.049237	val-merror:0.107576
[41]	train-merror:0.04908	val-merror:0.107545
[42]	train-merror:0.049119	val-merror:0.107332
[43]	train-merror:0.049171	val-merror:0.10718
[44]	train-merror:0.049002	val-merror:0.107302
[45]	train-merror:0.049028	val-merror:0.107119
[46]	train-merror:0.048741	val-merror:0.106723
[47]	train-merror:0.048767	val-merror:0.106936
[48]	train-merror:0.048545	val-merror:0.10651
[49]	train-merror:0.04861	val-merror:0.106723
y_pred_xgb = xgb_model.predict(xgb.DMatrix(X_test))
eval_model(y_test,y_pred_xgb, category)
Label Precision Recall F1 Support
0 entertainment 0.899384 0.918056 0.908624 10019
1 technology 0.820596 0.868474 0.843856 7010
2 sports 0.934944 0.925174 0.930033 8326
3 military 0.911779 0.883162 0.897242 4365
4 car 0.913521 0.798527 0.852162 3122
999 总体 0.894573 0.893277 0.893347 32842

上述模型训练完成之后应该保持下来,这里不在处理。
需要保持的文件内容

  1. lr 模型
  2. xgboost 模型
  3. label <-> category 名称关系
  4. word2vec 词向量

在线预测

实际使用中上述模型被存储,然后服务器启动时候加载。

文本-> 词向量

label_category_name = dict((v,k) for k,v in categories_label.items()) 
text = '世界杯 扩军 残酷 国足 这股 东风'
text_vec = get_words_vec(text.split())
print('text_vec length = ',len(text_vec))
text_vec length =  200

LR 模型预测

lr_y_pred_results = lr.predict([text_vec])[0]
print("lr predict id:{}->name:{}".format(lr_y_pred_results,label_category_name[lr_y_pred_results]))
id:2->name:sports

xgboost 预测

y_pred_xgb_results = int(xgb_model.predict(xgb.DMatrix([text_vec]))[0])
print("xgboost predict id:{}->name:{}".format(lr_y_pred_results,label_category_name[y_pred_xgb_results]))
xgboost predict id:2->name:sports

参考资料

[1] Word2Vec模型增量训练

https://blog.csdn.net/qq_43404784/article/details/83794296

[2] 利用Python构建Wiki中文语料词向量模型试验

https://github.com/AimeeLee77/wiki_zh_word2vec

[3] 【深度学习】120G+训练好的word2vec模型(中文词向量)

https://blog.csdn.net/zkq_1986/article/details/84990426

[4] markdown 使用

https://www.zybuluo.com/mdeditor

发布了267 篇原创文章 · 获赞 66 · 访问量 43万+

猜你喜欢

转载自blog.csdn.net/shenfuli/article/details/98221467