Training of word vectors

 Overview

Word vector is an important part of the probabilistic language model. By using a fixed-length vector to represent a word in the language, the article uses gensim to train the word vector word2vec.

The work of gensim includes: training large-scale semantic models; representing text as semantic vectors; and finding semantically related documents.

First, before training, create a vocabulary dictionary through jieba word segmentation. Through the word2vec method CBOW or Skip-gram training methods, and through the back propagation algorithm of the deep neural network, the parameters of the model can be trained, and all the parameters can be obtained at the same time. The word vector corresponding to the word. Here you need to pay attention to the usage of hierarchical softmax.

Code display:

import os 
import jieba
import logging as log
from gensim import corpora
from gensim import models

#设置日志的输出级别,默认INFO
log.basicConfig(level=log.DEBUG)

# 得到当前所加载的文件的物理的绝对路径
corpus_file = os.path.join(os.path.dirname(__file__),'lyrics','文本绝对路径.txt')

def read_lyrics(corpu_file):
    '''读取语料并进行分词'''
    lyrics_words = []
    for line in open(corpus_file, encoding='utf-8'):
        line = line.strip()              
        if line == '':        
            continue
        lyrics_words.append(jieba.lcut(line))  
    # 返回分词词汇集    
    return lyrics_words     

if __name__ == '__main__':
    lyrics = read_lyrics(corpus_file)        
    log.debug(lyrics)     

   
    # 制作词汇字典  
    dictionary = corpora.Dictionary(lyrics)     
    log.debug(dictionary.token2id)       # 调用词典对象返回词典,词和词对应的索引值     

    # 词汇转换为索引并计数(bag of word) 
    log.debug(dictionary.doc2bow('你 就 不要 送 我'.split()))    

    # 词汇转换为数学表示,(矢量)
    bowcorpus = [dictionary.doc2bow(i) for i in lyrics]
    log.debug(bowcorpus)

Secondly, use the trained parameters of the model to calculate the similarity of the word vectors between each word or sentence through cosine similarity.

import os
from gensim.models import Word2Vec,word2vec
from gensim.models.keyedvectors import KeyedVectors

# 语料文件
corpos_file = os.path.join(os.path.dirname(__file__),r'文件绝对路径.txt')
# 模型二进制存盘文件  
save_bin_file = os.path.join(os.path.dirname(__file__),'embeddings','word2vec.model')
# 模型文本格式存盘文件
save_txt_file = os.path.join(os.path.dirname(__file__),'embeddings','word2vec.txt')

def train_word2vec():
    # 保存文本的所有信息
    sentences = word2vec.Text8Corpus(corpos_file)  # 8 meaning 'by'
    model = Word2Vec(sentences=sentences,vector_size=200)
    # 模型训练保存
    model.wv.save_word2vec_format(save_bin_file,binary=True)  
    model.wv.save_word2vec_format(save_txt_file)

def load_vectors():
    # 加载存盘的词向量模型
    # word_vectors = KeyedVectors.load_word2vec_format(save_bin_file,binary=True)
    word_vectors = KeyedVectors.load_word2vec_format(save_txt_file)
    return word_vectors


# 训练词向量
if __name__ == '__main__':
    train_word2vec()

    wv = load_vectors()

    # 词汇数量
    print('词典大小',len(wv.key_to_index))
    # 词汇相似度 
    print(wv.similarity('词汇','词汇'))
    # 获取最近似的词汇
    print(wv.most_similar('词汇',topn=10))


Finally, save the calculated word vector to the file to get the correlation between each vocabulary. The greater the similarity, the better the correlation.

Guess you like

Origin blog.csdn.net/m0_71145378/article/details/126909067