2g using python + gensim training corpus word2vec vector

0 Introduction

"Word2Vec mathematical principles, and source code Detailed vector word of" good Word2Vec explain the principles and interpretation of some of the source code, word vector Word2Vec There are two ways, one is Google bigwigs write their own word2vec, as well as one is gensim library, since most people use python, so a lot of people use gensim library. This article will detail gensim in the word2vec model, other models gensim contains no introduction.

1 gensim library

Gensim (http://pypi.python.org/pypi/gensim) is an open source third-party Python toolkit for unstructured text from the original, the unsupervised learning to hidden layer of text vector theme expression. The main theme for the modeling process and document similarity, it supports a variety of topics including the model algorithm TF-IDF, LSA, LDA, and word2vec including. Gensim useful words such as get the word vector and other tasks.

1.1 gensim.models.word2vec API Overview

The first to use pip install gensimthe installation gensim library.
Then pour word2vec module from gensim.models.word2vec.Note that poured lowercase word2vec this .pydocument, which Word2Vec model implemented in the capital of this file, you need to use statements when using the model word2vec.Word2Vec()to create the model.

class Word2Vec(utils.SaveLoad):
    def __init__(
            self, sentences=None, size=100, alpha=0.025, window=5, min_count=5,
            max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
            sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
            trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH):
  • Sentences : may be a list, for large corpus, it is recommended to use LneSentence and PathLineSentence to build. When creating a model without this parameter, you will pass a None object at the time of follow-up training can then pass the training corpus.
  • size : refers to the dimension of the word vector, the default is 100. This dimension values generally associated with the size of our corpus, if the corpus is small, such as less than 100M of the text corpus, the default value is usually fine. If the corpus is large, it is recommended to increase the dimensions. Large size requires more training data, but the effect will be better. The recommended value is 100-300.
  • window : the window size, i.e. the maximum distance context word vector, i.e., comprises a front window and a rear window words headword words, attention is word2vec random value between [1, windows], a window is not fixed . The larger the window, the word far and some also have a word context. The default is 5. In actual use, the size may be dynamically adjusted according to actual needs window. If it is small corpus then this value can be set smaller. For a general corpus recommended value between [5,10]. Personal understanding should be a central word may be associated with a plurality of words back and forth, and some words may only be associated with a small amount of words in a sentence (such as short text may only its immediate word related).
  • min_count : need to calculate the minimum word word vector. This value can remove some of the very low frequency of uncommon words, the default is 5. If the corpus is small, you can lower this value. Can be done to cut off the dictionary, word frequency is less than the number of words min_count will be discarded.
  • negative : negative samples i.e. using Negative Sampling number, default 5. Recommended between [3,10].
  • cbow_mean : CBOW only when doing the projection, is 0, then the algorithm is a vector sum word and the context, compared to an average of term vectors context. The default value is 1, not recommended to modify the default values.
  • ITER : the maximum number of stochastic gradient descent method iterations, default 5. For large corpus, you can increase this value in the following corpus, I used five times, the effect is not very good training.
  • Alpha : is the initial learning rate, linearly decremented to min_alpha during training. Iterative stochastic gradient descent method in the initial learning rate, the default is 0.025.
  • min_alpha : Since the algorithm supports the learning rate decreases gradually in the process of iteration, min_alpha gives the minimum learning rate. Stochastic gradient descent in each round of iteration steps may, alpha, derived from iter min_alpha together. For large corpus, the need for alpha, min_alpha, iter with parameter adjustment, to select the appropriate three values.
  • max_vocab_size : Set Words RAM limits during vector construction, there is no limit set to None.
  • Sample : configure thresholds random downsampled high-frequency words, the default is 1e-3, the range (0,1e-5).
  • SEED : a random number generator. Words related to the initialization vector.
  • Workers : Training for controlling the number of parallel.
  • sg : namely choosing our word2vec the two models. If 0, it is CBOW model is a model is Skip-Gram, i.e. 0 is the default CBOW model.
  • HS : i.e. two word2vec our choice of solution, if it is 0, it is Negative Sampling, and is negative, then a negative number of samples is greater than 0, it is Hierarchical Softmax. The default is 0, ie Negative Sampling.
  • negative : if greater than zero, it will use negativesampling, a number of noise words (typically 5-20) is provided.
  • hashfxn : hash function to initialize the weights, use the default python hash function.
  • batch_words : the number of threads to pass word of each batch, the default is 10000.
  • trim_rule : Finishing rules for setting vocabulary, specifying those words to leave, which is to be deleted. Can be set to None (min_count will be used).
  • sorted_vocab : If 1 (default), then the allocation word index will be the first time in descending order based on word frequency.

1.2 gensim training word2vec word vector steps

Word2vec very convenient to use Gensim training, training the following steps:

  • 1) The corpus pre-processing: one for a document or sentence, the document or sentence word (separated by spaces, the English can not word, has been separated by spaces between the English words, Chinese is expected to need to use the word tools word, common segmentation tool there StandNLP, ICTCLAS, Ansj, FudanNLP, HanLP, stuttering word, etc.);
  • 2) conversion of the original corpus into a training sentence iterator returned by each iteration is a list of sentence word (utf8 format). You may be used in the Gensim word2vec.py LineSentence () method implementation;
  • 3) The result of the process input Gensim above objects built word2vec training required.

2 training Sogou corpus

Corpus: Sogou Laboratory News as the training corpus Sogou Laboratory: http: //www.sogou.com/labs/resource/ca.php
I downloaded the full version, a total of more than 600 M, after extracting more than 1 G, can Home to the format of the data.
Registration is required to download information note
Data is decompressed news_sohusite_xml.dat, it is gbk format of the data needs to be converted into a utf-8 encoding format, and require only content content, others do not need, at this time you need to use linux command to resolve, since my system is windows10, you need to download a WSL to run linux commands. Specific see my blog:
Use under windows Linux command
contents removed, execute the following command in linux:
cat news_sohusite_xml.dat | iconv -f gbk -t utf-8 -c | grep "<content>" > corpus.txt
get 1.71 G's corpus.txtfile, the file is too large notepad can not open this file.
Note that all of the following programs are running in the jupyter notebook, jupyter notebook download can be done by Anaconda, see:
under Windows 10 installation and use of Anaconda

2.1 participle

The document is required to give word2vec word, the word can be used jieba word implementation, installation jieba word
using pip install jiebathe original text word:

file_path = './corpus.txt'
file_segment_path = './corpus_segment.txt'
train_file_read = []
with open(file_path, 'r', encoding='utf-8') as f:
    for line in f.readlines():
        train_file_read.append(line)
train_file_read

Yield:
Here Insert Picture Description
print(len(train_file_read))Check if there are a number of segments, the output of 1,411,996.

import jieba
# jieba分词后保存在列表中
train_file_seg = []
for i in range(len(train_file_read)):
    train_file_seg.append([' '.join(list(jieba.cut(train_file_read[i][9:-11], cut_all=False)))])
    if i % 100 == 0:
        print(i)
train_file_seg

Word will spend more than one hour

# 保存分词结果到文件中
with open(file_segment_path, 'w', encoding='utf-8') as f:
    for i in range(len(train_file_seg)):
        f.write(train_file_seg[i])
        f.write('\n')
# 加载分词
seg_sentences = []
with open(file_segment_path, 'r', encoding='utf-8') as f:
    seg_sentences = f.readlines()
seg_sentences

Here Insert Picture Description
We can see that there is a blank section
to remove the blank segment
Here Insert Picture Description
total remaining 1,298,156 words section
Here Insert Picture Description

2.2 Construction of the vector word

Here Insert Picture Description
Training will take more than one hour

2.3 Save and load model

Here Insert Picture Description

2.4 vector using the word

Here Insert Picture Description
Note model.similarityalso be used in the past, the new version requiresmodel.wv.similarity

3-D display space vector word

See my github

Published 21 original articles · won praise 1 · views 1108

Guess you like

Origin blog.csdn.net/Elenstone/article/details/105284890