Use gensim framework and Word2Vec word vector model to obtain similar words

Use gensim framework and Word2Vec word vector model to obtain similar words

Prerequisite knowledge

gensim framework

Gensim is a framework based on Python. It not only integrates Python and Word2Vec, but also provides a main framework based on LSA, LDA, and HDP.

Word2Vec

Word2Vec is a probabilistic language model based on a neural network architecture.


Two important models are
CBOW model : CBOW model is the most important model of Word2Vec. The input is the word vector of surrounding words, and the output is the word vector of the current word. That is, predict the current word by surrounding words.
Skip-Gram model : It is the opposite of CBOW, it uses the current word to predict surrounding words.


Optimization method
Negative Sample : When training a neural network, after each training sample is received, the weight parameters of all neural units are adjusted to make the prediction of the neural network more accurate. Negative sampling updates only a small part of the weight parameters for each training sample, thereby reducing the amount of calculation in the gradient descent process.

Hierarchical Softmax : The traditional word vector model generally has an input layer (word vector), a hidden layer and an output layer (softmax). The most time-consuming is the softmax layer, which is very computationally intensive. word2vec has improved this model. First of all, for the mapping from the input layer to the hidden layer, the method of linear transformation plus activation function of the neural network is not adopted, but a simple method of summing all input word vectors and taking the average . The second improvement is an improvement in the amount of calculation from the hidden layer to the output softmax layer. In order to avoid calculating the softmax probability of all words, word2vec samples the Huffman tree to replace the mapping from the hidden layer to the output softmax layer.


Word2Vec model download

Word vector model download address

Save to local after downloading

Load word vector model

# 加载词向量模型
model = KeyedVectors.load_word2vec_format(read_path1, binary=False)

# 获取与term最接近的10个相似词, 返回值格式为[('xx', 0.89123), ('xx', 0.88123)...]
result_list = model.most_similar(term, topn=10)

Insert picture description here



Even a little restraint on yourself can make a person strong and powerfulInsert picture description here

Guess you like

Origin blog.csdn.net/qq_43965708/article/details/111241754