gensim:word2vec实战

一、语料处理

import jieba

jieba.suggest_freq('沙瑞金', True)
# 避免分割特殊词
...

with open("./in_the_name_of_people.txt", encoding="utf-8") as file:
    doc = file.read()
    doc_cut = jieba.cut(doc)
    res = " ".join(doc_cut)
    with open("./cutcut.txt", 'w', encoding="utf-8") as wr:
        wr.write(res)

二、模型训练

sentences最好是嵌套列表的形式,比如 [ [‘a’, ‘b’, ‘c’, …], [‘c’, ‘d’, …], […], …]。避免把标点符号也作为训练数据。
Word2Vec类的构造函数API可以查看文档,一般需要自行设定sg(选择CBOW或SKip-Gram),hs(训练算法,hierarchical softmax或negative sampling),min_count(过滤低频词),window(滑动窗口大小),size(词向量长度)。

import logging
from gensim.models import word2vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentences = word2vec.LineSentence('./cutcut.txt')
# LineSentence 读入文件默认一行一句话,用空格分隔好的
model = word2vec.Word2Vec(sentences, hs=1, min_count=1, window=3, size=100)

# 模型保存和读取
model.save("./people.model")
model = Word2Vec.load("people.model")

三、测试模型

利用训练得到的词向量进行一些工作

print(model.wv['沙瑞金'])
# 直接输出词向量
print(model.wv.similarity('沙瑞金', '高育良'))
print(model.wv.similar_by_word('高育良', topn=2))

猜你喜欢

转载自blog.csdn.net/weixin_42231070/article/details/86652623