基于BiLSTM的情感分析方法

「这是我参与11月更文挑战的第17天,活动详情查看:2021最后一次更文挑战

前言

数据来源:

2019年CCF互联网新闻情感分析大赛

www.datafountain.cn/competition…

数据集附件:

链接:pan.baidu.com/s/1ePKyHyE8…

提取码:2021

具体工具:

深度学习框架使用 tensorflow2.4.0
自然语言处理库 gensim3.8.3
分词工具 jieba 0.42.1
复制代码

代码哪里不懂点哪里:python菜鸟教程更像一个查询手册

www.runoob.com/python3/pyt…

具体流程

1. 数据预处理

1.1 读取数据

这里通过read_csv读取数据集,因为train_data和train_label是分开的,使用pd.merge()进行合并

  • pd.merge(x,y,how="left",on="id")详解:
  • notnull() :是否为空,返回对应布尔值
  • fillna(value):找到空值并替换为value
import pandas as pd
import numpy as np

# 读取训练数据
train_data = pd.read_csv('Train_DataSet.csv')
train_label = pd.read_csv('Train_DataSet_Label.csv')
test = pd.read_csv('Test_DataSet.csv')
# 合并两个训练数据和训练标签
train = pd.merge(train_data, train_label, how='left', on='id')
# 使用布尔判断 判断train中是否存在空值 并剔除
train = train[(train.label.notnull()) & (train.content.notnull())]
# 将na值替换为空字符串
train['title'] = train['title'].fillna('')
train['content'] = train['content'].fillna('')
test['title'] = test['title'].fillna('')
test['content'] = test['content'].fillna('')
复制代码

1.2 文本无效字符和标签过滤

其中 text_filter(text) 函数可以存储到工具库,方便下次使用

  • re.sub(正则表达式,被替换字符串,替换字符串)
  • str.strip():去除字符串首尾的空格
import re
# 文本过滤函数
def text_filter(text):
    # re.sub(正则表达式,被替换字符串,替换字符串)
    text = re.sub("[A-Za-z0-9!=\?%[],\(\)><:&lt;/#. -----_]", "", text)
    text = text.replace('图片', '')
    text = text.replace('\xa0', '') # 删除nbsp
    # 去除html标签
    cleanr = re.compile('<.*?>')
    text = re.sub(cleanr, ' ', text) 
    # 去除其他字符
    r1 =  "\【.*?】+|\《.*?》+|\#.*?#+|[.!/_,$&%^*()<>+""'?@|:~{}#]+|[——!\\,。=?、:“”‘’¥……()《》【】]"
    text = re.sub(r1,'',text)
    # 去除字符串首尾的空格
    text = text.strip()
    return text
# 文本清理函数
def clean_text(data):
    # 标题文本
    data['title'] = data['title'].apply(lambda x: text_filter(x))
    # 正文文本
    data['content'] = data['content'].apply(lambda x: text_filter(x))
    return data

# run clean_text
train = clean_text(train)
test = clean_text(test)
复制代码

1.3 分词和停用词

  • str.maketrans(x,y,z):三个参数 x、y、z,第三个参数 z 必须是字符串,其字符将被映射为 None,即删除该字符;如果 z 中字符与 x 中字符重复,该重复的字符在最终结果中还是会被删除。 也就是无论是否重复,只要有第三个参数 z,z 中的字符都会被删除。

  • string.punctuation:所有的标点符号

  • [token for token in tokens if token not in stop_words] :列表表达式:如果tokens中的token不在stop_words中那么就返回此token

import jieba
import string

# 加载stop_words停用词
stop_words = pd.read_table('stop.txt', header=None)[0].tolist()
# 创建翻译表,后续用于去除英文标点
table = str.maketrans("","",string.punctuation)
def cut_text(sentence):
    tokens = list(jieba.cut(sentence))
    # 去除停用词  列表
    tokens = [token for token in tokens if token not in stop_words]
#     # 去除英文标点
#     tokens = [w.translate(table) for w in tokens] # translate 和 maketrans 连用 具体百度
    return tokens

# 调用分词函数对训练集和测试集中的标题和文本进行分词
train_title = [cut_text(sent) for sent in train.title.values]
train_content = [cut_text(sent) for sent in train.content.values]
test_title = [cut_text(sent) for sent in test.title.values]
test_content = [cut_text(sent) for sent in test.content.values]
# 连接所有分词,为后续训练词向量做准备
all_doc = train_title + train_content + test_title + test_content
复制代码

1.4 使用gensim训练词向量

这一系列代码可以直接使用,训练自己的词向量,经过测试样本量越大效越好。本次比赛分词后vacob_size大约为29244

import gensim
import time
class EpochSaver(gensim.models.callbacks.CallbackAny2Vec):
    '''用于保存模型, 打印损失函数等等'''
    def __init__(self, save_path):
        self.save_path = save_path # 模型存储路径
        self.epoch = 0 # 轮次
        self.pre_loss = 0 # 前一轮损失
        self.best_loss = 999999999.9 # 最佳损失
        self.since = time.time() # 跑一轮的持续时间

    def on_epoch_end(self, model):
        self.epoch += 1 
        cum_loss = model.get_latest_training_loss() # 返回的是从第一个epoch累计的
        epoch_loss = cum_loss - self.pre_loss # epoch-loss = 当前损失 - 前一轮损失
        time_taken = time.time() - self.since # 持续时间
        print("Epoch %d, loss: %.2f, time: %dmin %ds" % 
                    (self.epoch, epoch_loss, time_taken//60, time_taken%60)) # 打印一轮的结果,时间采用分秒
        # 记录best_loss,并通过best_loss进行early_stop
        if self.best_loss > epoch_loss:
            self.best_loss = epoch_loss
            print("Better model. Best loss: %.2f" % self.best_loss) # 打印最佳损失
            model.save(self.save_path) # 保存模型
            print("Model %s save done!" % self.save_path)

        self.pre_loss = cum_loss
        self.since = time.time()
        
# 下面的代码可以加载训练好的词向量
# model_word2vec = gensim.models.Word2Vec.load('final_word2vec_model')
复制代码

创建word2vec训练器并使用build_vocab把单词导入到词库

model_word2vec = gensim.models.Word2Vec(min_count=1, 
                                        window=5, 
                                        size=256,
                                        workers=4,
                                        batch_words=1000)
since = time.time()
model_word2vec.build_vocab(all_doc, progress_per=2000)
time_elapsed = time.time() - since
print('Time to build vocab: {:.0f}min {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
复制代码
  • 训练词向量并保存
since = time.time()
model_word2vec.train(all_doc, total_examples=model_word2vec.corpus_count, 
                        epochs=20, compute_loss=True, report_delay=60*10,
                        callbacks=[EpochSaver('./final_word2vec_model')])
time_elapsed = time.time() - since
print('Time to train: {:.0f}min {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
复制代码

1.5 编码为TF.Tokenizer格式

Tokenizer是tensorflow的分词器,里面有封装完整的词典索引等。

参考博客:dengbocong.blog.csdn.net/article/det…

# 转化为Tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_title + test_title)
# tokenizer.fit_on_texts(train_content + test_content)
复制代码

1.6 构造Embedding_matrix矩阵

from tqdm import tqdm

# 转化成词向量矩阵,利用新的word2vec模型
vocab_size = len(tokenizer.word_index) # 词库大小
error_count=0
embedding_matrix = np.zeros((vocab_size + 1, 256))
for word, i in tqdm(tokenizer.word_index.items()):
    if word in model_word2vec:
        embedding_matrix[i] = model_word2vec.wv[word]
    else:
        error_count += 1
复制代码

1.7 padding编码

  • padding用于补齐较短的文本,或截断较长的文本
from tensorflow.keras.preprocessing.sequence import pad_sequences

sequence = tokenizer.texts_to_sequences(train_title)
traintitle = pad_sequences(sequence, maxlen=30)
sequence = tokenizer.texts_to_sequences(test_title)
testtitle = pad_sequences(sequence, maxlen=30)

# sequence = tokenizer.texts_to_sequences(train_content)
# traincontent = pad_sequences(sequence, maxlen=512)
# sequence = tokenizer.texts_to_sequences(test_content)
# testcontent = pad_sequences(sequence, maxlen=512)
复制代码

2.构造模型

2.1 BiLSTM

from tensorflow.keras.layers import *
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras import initializers, regularizers, constraints, optimizers, layers

model = Sequential([
    layers.Embedding(input_dim=len(tokenizer.word_index) + 1, 
                    output_dim=256, 
                    input_length=30, 
                    weights=[embedding_matrix]),
    layers.Bidirectional(LSTM(32, return_sequences = True)),
    layers.GlobalMaxPool1D(),
    layers.Dense(20, activation="relu"),
    layers.Dropout(0.05),
    layers.Dense(3, activation="softmax"),
])

model.compile(loss='categorical_crossentropy',
               optimizer='adam',
               metrics=['categorical_accuracy'])
model.summary()
复制代码

2.2 TextCNN

Attention 别人写的代码

import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras import Input, Model,backend as K
from tensorflow.keras.layers import Embedding, Dense, Attention, Bidirectional, LSTM
from tensorflow.keras import initializers, regularizers, constraints
from tensorflow.keras.layers import Layer


class Attention(Layer):
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        """
        Keras Layer that implements an Attention mechanism for temporal data.
        Supports Masking.
        Follows the work of Raffel et al. [https://arxiv.org/abs/1512.08756]
        # Input shape
            3D tensor with shape: `(samples, steps, features)`.
        # Output shape
            2D tensor with shape: `(samples, features)`.
        :param kwargs:
        Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
        The dimensions are inferred based on the output shape of the RNN.
        Example:
            # 1
            model.add(LSTM(64, return_sequences=True))
            model.add(Attention())
            # next add a Dense layer (for classification/regression) or whatever...
            # 2
            hidden = LSTM(64, return_sequences=True)(words)
            sentence = Attention()(hidden)
            # next add a Dense layer (for classification/regression) or whatever...
        """
        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0

        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight(shape=(input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight(shape=(input_shape[1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
        # do not pass the mask to the next layers
        return None

    def call(self, x, mask=None):
        features_dim = self.features_dim
        step_dim = self.step_dim

        e = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim))  # e = K.dot(x, self.W)
        if self.bias:
            e += self.b
        e = K.tanh(e)

        a = K.exp(e)
        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask, K.floatx())
        # in some cases especially in the early stages of training the sum may be almost zero
        # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
        a = K.expand_dims(a)

        c = K.sum(a * x, axis=1)
        return c

    def compute_output_shape(self, input_shape):
        return input_shape[0], self.features_dim
复制代码

TextCNN

# from keras import Input, Model
from tensorflow.keras.layers import Embedding, Dense, Conv1D, GlobalMaxPooling1D, Concatenate, Dropout
class TextCNN(object):
    def __init__(self, maxlen, max_features, embedding_dims,
                 class_num=1,
                 last_activation='sigmoid'):
        self.maxlen = maxlen
        self.max_features = max_features
        self.embedding_dims = embedding_dims
        self.class_num = class_num
        self.last_activation = last_activation

    def get_model(self):
        input = Input((self.maxlen,))

        # Embedding part can try multichannel as same as origin paper
        embedding = Embedding(self.max_features, self.embedding_dims, input_length=self.maxlen,
                              weights=[embedding_matrix])(input)
        convs = []
        for kernel_size in [3, 4, 5]:
            c = Conv1D(128, kernel_size, activation='relu')(embedding)
            c = GlobalMaxPooling1D()(c)
            convs.append(c)
        x = Concatenate()(convs)

        output = Dense(self.class_num, activation=self.last_activation)(x)
        model = Model(inputs=input, outputs=output)
        return model
    
model = TextCNN(maxlen=30, max_features=len(tokenizer.word_index) + 1,
                    embedding_dims=256, class_num=3, last_activation='softmax').get_model()
# metric_F1score 在下面
model.compile('adam', 'categorical_crossentropy', metrics=['accuracy',metric_F1score])
model.summary()
复制代码

2.3 Atention-BiLSTM

Atention-BiLSTM

class TextAttBiRNN(object):
    def __init__(self, maxlen, max_features, embedding_dims,
                 class_num=1,
                 last_activation='sigmoid'):
        self.maxlen = maxlen
        self.max_features = max_features
        self.embedding_dims = embedding_dims
        self.class_num = class_num
        self.last_activation = last_activation

    def get_model(self):
        input = Input((self.maxlen,))

        embedding = Embedding(self.max_features, self.embedding_dims,
                              input_length=self.maxlen, weights=[embedding_matrix])(input)
        x = Bidirectional(LSTM(128,return_sequences=True))(embedding)  # LSTM or GRU
        x = Attention(self.maxlen)(x)

        output = Dense(self.class_num, activation=self.last_activation)(x)
        model = Model(inputs=input, outputs=output)
        return model
    pass

model = TextAttBiRNN(maxlen=30, max_features=len(tokenizer.word_index) + 1,
                    embedding_dims=256, class_num=3, last_activation='softmax').get_model()
model.compile('adam', 'categorical_crossentropy', metrics=['categorical_accuracy'])
model.summary()
复制代码

3.模型训练

3.1 评价标准

import tensorflow as tf

# F1值指标
def metric_F1score(y_true,y_pred):    
    TP=tf.reduce_sum(y_true*tf.round(y_pred))
    TN=tf.reduce_sum((1-y_true)*(1-tf.round(y_pred)))
    FP=tf.reduce_sum((1-y_true)*tf.round(y_pred))
    FN=tf.reduce_sum(y_true*(1-tf.round(y_pred)))
    precision=TP/(TP+FP)
    recall=TP/(TP+FN)
    F1score=2*precision*recall/(precision+recall)
    return F1score
复制代码

3.2 训练集切分

  • 输入:traintitle时刚才padding过得sequence
  • 输出:label从原始csv中获取
  • 划分比例:训练集:测试机 == 4:1
import tensorflow as tf
from sklearn.model_selection import train_test_split

label = train['label'].astype(int)

train_X, val_X, train_Y, val_Y = train_test_split(traintitle, label, shuffle=True, test_size=0.2,random_state=42)
# to_categorical是tf的one-hot编码转换,因为 loss用的 categorical_crossentropy
# loos用 sparse_categorical_crossentropy 就不用转换
train_Y = tf.keras.utils.to_categorical(train_Y)
复制代码

3.3 模型训练

  • 其他参数自己设定
# 模型训练
history = model.fit(train_X,train_Y,
                    batch_size=128,
                    epochs=10,
                    validation_split=0.1,
                    validation_freq=1,
                   )
复制代码

3.4 校验模型效果

from sklearn.metrics import f1_score

pred_val = model.predict(val_X)
print(f1_score(val_Y, np.argmax(pred_val, axis=1), average='macro'))
复制代码

3.5 可视化损失和F1值

import  matplotlib.pyplot as plt
# 画出损失函数
def show_loss_acc_img(history):
    # 损失
    plt.plot(history.history['loss'], label="$Loss$")
    plt.plot(history.history['val_loss'], label='$val_loss$')
    plt.title('Loss')
    plt.xlabel('epoch')
    plt.ylabel('num')
    plt.legend()
    plt.show()
    # 准确率
    plt.plot(history.history['categorical_accuracy'], label="$categorical_accuracy$")
    plt.plot(history.history['val_categorical_accuracy'], label='$val_categorical_accuracy$')
    plt.title('Accuracy')
    plt.xlabel('epoch')
    plt.ylabel('num')
    plt.legend()
    plt.show()
    pass
show_loss_acc_img(history)
复制代码

3.6 预测测试集情感极性

# 预测测试集极性
pred_val = model.predict(testtitle)

# 保存预测文件
submission = pd.DataFrame(test.id.values,columns=["id"])
submission["label"] = np.argmax(pred_val, axis=1)
submission.to_csv("submission.csv",index=False)
复制代码

可以直接用的干货

1. 使用正则去除文本的html和其他符号

import re
# 文本过滤函数
def text_filter(text):
    # re.sub(正则表达式,被替换字符串,替换字符串)
    text = re.sub("[A-Za-z0-9!=\?%[],\(\)><:&lt;/#. -----_]", "", text)
    text = text.replace('图片', '')
    text = text.replace('\xa0', '') # 删除nbsp
    # 去除html标签
    cleanr = re.compile('<.*?>')
    text = re.sub(cleanr, ' ', text) 
    # 去除其他字符
    r1 =  "\【.*?】+|\《.*?》+|\#.*?#+|[.!/_,$&%^*()<>+""'?@|:~{}#]+|[——!\\,。=?、:“”‘’¥……()《》【】]"
    text = re.sub(r1,'',text)
    # 去除字符串首尾的空格
    text = text.strip()
    return text
复制代码

2. 使用gensim训练自己的词向量

参考博客:

[1] www.jianshu.com/p/5f04e97d1… 使用TSEN降维后打印词向量图片

[2] www.cnblogs.com/johnnyzen/p… gensim.models.Word2Vec参数详解

import gensim
import time
from sklearn.manifold import TSNE
from matplotlib.font_manager import *
import matplotlib.pyplot as plt

class EpochSaver(gensim.models.callbacks.CallbackAny2Vec):
    '''用于保存模型, 打印损失函数等等'''

    def __init__(self, save_path):
        self.save_path = save_path  # 模型存储路径
        self.epoch = 0  # 轮次
        self.pre_loss = 0  # 前一轮损失
        self.best_loss = 999999999.9  # 最佳损失
        self.since = time.time()  # 跑一轮的持续时间

    def on_epoch_end(self, model):
        self.epoch += 1
        cum_loss = model.get_latest_training_loss()  # 返回的是从第一个epoch累计的
        epoch_loss = cum_loss - self.pre_loss  # epoch-loss = 当前损失 - 前一轮损失
        time_taken = time.time() - self.since  # 持续时间
        print("Epoch %d, loss: %.2f, time: %dmin %ds" %
              (self.epoch, epoch_loss, time_taken // 60, time_taken % 60))  # 打印一轮的结果,时间采用分秒
        # 记录best_loss,并通过best_loss进行early_stop
        if self.best_loss > epoch_loss:
            self.best_loss = epoch_loss
            print("Better model. Best loss: %.2f" % self.best_loss)  # 打印最佳损失
            model.save(self.save_path)  # 保存模型
            print("Model %s save done!" % self.save_path)

        self.pre_loss = cum_loss
        self.since = time.time()
        pass


# 加载以训练好的词向量
def load_model_word2vec(save_path):
    model_word2vec = gensim.models.Word2Vec.load(save_path)
    # 下面的代码可以加载训练好的词向量
    # model_word2vec = gensim.models.Word2Vec.load('final_word2vec_model')
    return model_word2vec

def print_since_time(since):
    time_elapsed = time.time() - since
    print('Time to build vocab: {:.0f}min {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))

def show_word2vec_2D(model_word2vec,random_word):

    # 训练TSNE降维 注意model_word2vec.wv[random_word]中 random_word 必须是字符串构成的list
    X_tsne = TSNE(n_components=2, learning_rate=100).fit_transform(model_word2vec.wv[random_word])

    # 解决负号'-'显示为方块的问题
    plt.figure(figsize=(14, 8))
    myfont = FontProperties(fname='C:\Windows\Fonts\simsun.ttc') # 加载字体

    plt.scatter(X_tsne[:, 0], X_tsne[:, 1]) # 创建散点图

    for i in range(len(X_tsne)):
        x = X_tsne[i][0]
        y = X_tsne[i][1]
        plt.text(x, y, random_word[i], fontproperties=myfont, size=16) # 输出坐标标签

    plt.show()
    pass

if __name__=="__main__":
    # 自己处理输入 注意input_doc 不能是一维,构造成这样的形式input_doc = [[]]
    input_doc = [['`', '广告', '联系', '微', '信号', '花都区', '租房', '满', '一年', '有望', '确保', '学位', '信息', '时报讯', '记者', '崔小远', '近日',
     '发布', '下称', '了解', '今年', '花都区', '公办', '小学', '计划', '招收', '班级', '公办', '初中', '计划', '招收', '班级', ';', '民办小学', '花都区',
     '班级', '民办', '初中', '计划', '招收', '班级', '对比', '年', '招生', '细则', '今年', '招生', '规模', '总体', '变化', '不', '大', '计划', '招收',
     '了解', '花都区', '招生', '时间', '安排', '月', '日', '~', '月', '日', '花都区', '公办', '小学', '网上', '报名', ';','教育局',
     '~', '积分', '入学', '网上', '报名', ';', '月', '日', '~', '月', '日', '花都区', '民办小学', '网上', '报名', ';',
     '~', '花都区', '小区', '配套', '业主', '非', '广州市', '户籍', '适龄', '子女', '报名', '保障', '区内', '明确', '未来', '十年',
     '承租人', '子女', '入学', '方面', '提出', '具有', '广州市', '户籍', '含', '政策性', '照顾', '生', '广州市', '无', '自有', '产权', '住房', '含', '城乡',
     '自建房', '租赁', '房屋', '所在地', '唯一', '居住地', '房屋', '租赁', '合同', '登记', '备案', '连续', '满', '一年', '截止', '日期', '申请', '入学', '内在',
     '年月日', '以上', '申请', '时', '租赁', '合同', '有效', '状态', '承租人', '适龄', '子女', '花都区', '教育局', '确保', '学位', '供给', '年', '当中', '已经',
     '增城', '花都', '从化', '张', '病床', '条件', '建立', '专业', '精神病', '医院', '来源', '花都', '早晨', '区卫计局', '广州', '花都', '发布', '今日', '花都',
     '花都', '求职', '招聘', '群', '添加', '时', '注明', '招定', '求职', '小编', '工资', '大拇指', '挂钩', '点', '一下', '一分钱', '求', '打赏', '记得', '加']]
    # 训练模型主题 -----------------------------------------------
    model_word2vec = gensim.models.Word2Vec(min_count=1,
                                            window=5,
                                            size=256,
                                            workers=4,
                                            batch_words=1000)

    since = time.time() # 计时开始
    model_word2vec.build_vocab(input_doc, progress_per=2000) # 从一连串的句子中建立词汇,progress_per表示多少单词展示一次
    print_since_time(since) #计时结束,打印耗时

    since = time.time()
    model_word2vec.train(input_doc,
                         total_examples=model_word2vec.corpus_count,
                         epochs=20,
                         compute_loss=True,
                         report_delay=60 * 10,
                         callbacks=[EpochSaver('./final_word2vec_model')]) # model_word2vec模型存储
    print_since_time(since) #计时结束,打印耗时

    # 打印词向量图片
    show_word2vec_2D(model_word2vec, input_doc[0])

    # model_word2vec = load_model_word2vec(./final_word2vec_model)
    # print(model_word2vec)
    # # 计算两个词语之间的相似度
    # y2 = model_word2vec.wv.similarity(u"租赁", u"承租人")
    # print(y2)
    # # 打印词语相似度
    # for i in model_word2vec.wv.most_similar(u"建立"):
    #     print(i[0], i[1])
复制代码

使用TSNE打印词向量的效果

おすすめ

転載: juejin.im/post/7031526862774304782