How gensim of word2vec word derived vectors (python)

Gensim first need to have a package, and then require a corpus for training, used here is the skip-gram or CBOW method, the details can go to look up the relevant information, which is roughly two methods to map similar meaning to the word word space similar position.

Corpus test8 Download: 

http://mattmahoney.net/dc/text8.zip

This corpus is found from http://blog.csdn.net/m0_37681914/article/details/73861441 this article.

Check whether the corpus pre-processing needs to be done:
to download data extract out well, before we do need to understand the word vector data storage structure, to determine whether it meets gensim bag word2vec function in the form of input data requirements. word2vec input function is preferably a whole text, line breaks and without punctuation. So we should check whether the data are consistent with test8. However, double-click to open the test8 not work, because the file is too large. So we need to open it with the program. code show as below:    

Open with ( '/ text8', 'r', encoding = 'UTF-8') AS File:
for Line in file.readlines ():
Print (Line)
program will return a warning, not enough memory to print it out. Obviously because there are too many single line caused. Can be verified as follows:

with open('/text8','r',encoding='utf-8') as file:
for line in file.readlines():
print(len(line))


Only one output value, data representing only one line, and this line has a display 100 000 000 characters in length. Due to consistent data in the file structure, so we do not need to look at all the data output, output only need to know a part of its data structure, then modify the code as follows:

0 = A
B = 0
with Open ( '/ text8', 'R & lt', encoding = 'UTF-. 8') AS File:
Line File.read = ()
for Line in char:
B = +. 1
Print (char, End = '')
IF == BA 100:
A = B
Print ( '\ n-')
IF A == 5000:
BREAK
before we look at the output of 5000 characters and 100 characters per line change.

 

Here are just a part of the beginning, you can see the data no punctuation, and before all the data are validated in the same line, that there is no line breaks. So we do not need to preprocess the data. Next, the data processing section.

Data processing section:
from gensim.models Import word2vec
Import the logging
logging.basicConfig (the format = '% (the asctime) S:% (levelname) S:% (Message) S', Level = logging.info)
Sentences word2vec.Text8Corpus = ( '/ text8')
Model = word2vec.Word2Vec (Sentences, SG =. 1, size = 100, window =. 5, min_count =. 5, negative =. 3, Sample = from 0.001, HS =. 1, Workers =. 4)
model.save ( ' /text82.model ')
Print (Model [' man '])
then
logging.basicConfig (format ='% (asctime ) s:% (levelname) s:% (message) s', level = logging.INFO)
the line It represents our program will output log information form (format) the date (asctime): information level (levelname): log information (message), information level is normal information (logging.INFO). Knowledge about logging, we can go on their own learning. Click to open the link https://www.cnblogs.com/bjdxy/archive/2013/04/12/3016820.html

 

Log information is output on FIG. Practical work, we can not add this log, but do so only if we must determine the correct procedure, you can not go wrong, because once the mistake we need to infer possible errors based on log information.

The corpus is stored in the sentence

sentences = word2vec.Text8Corpus ( '/ text8' )
generating the word vector space model

model = word2vec.Word2Vec (sentences, sg = 1, size = 100, window = 5, min_count = 5, negative = 3, sample = 0.001, hs = 1, workers = 4)
where the stresses Definition:

class gensim.models.word2vec.Word2Vec (sentences = None, size = 100, alpha = 0.025, window = 5, min_count = 5, max_vocab_size = None, sample = 0.001, seed = 1, workers = 3, min_alpha = 0.0001, sg = 0, hs = 0, negative = 5, cbow_mean = 1, hashfxn = <built-in function hash>, iter = 5, null_word = 0, trim_rule = None, sorted_vocab = 1, batch_words = 10000)
parameters:
1.sentences : You can be a List, for large corpus, it is recommended to use BrownCorpus, Text8Corpus or · ineSentence building.
2.sg: training algorithm is used to set, the default is 0, the corresponding algorithm CBOW; sg = 1 is used skip-gram algorithm.
3.size: the word refers to the dimension of the vector output, default is 100. Large size requires more training data, but the effect will be better. The recommended value of tens to hundreds.
4.window: the training window size, 8 denotes a front consider each word after 8 words of 8 words (the actual code and a randomly selected process window, the window size <= 5), the default value is 5.
5.alpha: learning rate
6.seed: a random number generator. Words related to the initialization vector.
7.min_count: dictionary can be done to cut the number of times the word frequency is less than min_count word will be discarded, the default is 5.
8.max_vocab_size: Set Words RAM limits during vector construction. If all independent number words than this, it would eliminate one of the most frequent one. Each takes about ten million words 1GB of RAM. There is no limit is set to None.
9.sample: represents the threshold sampling frequency is greater if a word appears in the training sample, the more you will be sampled. The default is 1e-3, in the range of (. 5-0,1e)
10.workers: parameter controls the number of parallel training.
11.hs: whether to use the HS method, 0 is not used, 1 means use. The default is 0
12.negative: if> 0, it will adopt negativesamp · ing, the number of noise words for setting
13.cbow_mean: If it is 0, then the use of the context word vector and, if it is 1 (default) is used mean . Only use CBOW when it works.
14.hashfxn: hash function to initialize the weights. Default python hash function
15.iter: the number of iterations, default is 5.
16.trim_rule: Finishing rules for setting vocabulary, specifying those words to leave, which is to be deleted. It can be set to None (min_count will be used) or a receiving () function and returns RU · E_DISCARD, uti · s.RU · E_KEEP or uti · s.RU · E_DEFAU · T's.
17.sorted_vocab: If time is 1 (defau · t), then the distribution will first word index of the word frequency in descending order based.
18.batch_words: the number of threads to pass word of each batch, the default is 10,000
save space model is then generated down here, for the next use.

model.save('/text8.model')

Next time do not need to load in the corpus and generate a model. only need to:

'' '
Sentences word2vec.Text8Corpus = (' / text8 ')
Model = word2vec.Word2Vec (Sentences, SG =. 1, size = 100, window =. 5, min_count =. 5, negative =. 3, Sample = from 0.001, HS =. 1, = 4 Workers)
model.save ( '/ text8.model')
'' '
Model = word2vec.Word2Vec.load (' / text8.model ')
Finally, a word vector see:

print(model['man'])


Of course model function can do more things, such as viewing the similarity of two words, and so on, I want to know your own Baidu
---------------------
Author : lwn556u5ut
source: CSDN
original: https: //blog.csdn.net/weixin_40292043/article/details/79571346
copyright: This article is a blogger original article, reproduced, please attach Bowen link!

Guess you like

Origin www.cnblogs.com/jfdwd/p/11089155.html