gensim word2vec | study notes from the slag slag Shuo

 

  Writing papers recently ran the model, to use word2vec, but found himself not understand how online posts, or your own stupid it, so there is my first blog! ! ! About word2vec going to write a series of tools, of course, only intended to write this article today:

  • How to load word2vec model
  • How to solve the model using the word vector word2vec
  • How to save word2vec model

 

A, word2vec Profile

  2013, Google open source tools for word --word2vec a vector calculation, attracted the attention of industry and academia. First, word2vec can be efficiently trained in the order of millions and millions of dictionary data set; secondly, the training results obtained by the tool - word vector (word embedding), can be a good measure of between words similarity. About internal algorithm word2vec tools I know not a lot, read a few blog, linked below have interested friends can see:

  https://www.cnblogs.com/guoyaohua/p/9240336.html

  

Two, word2vec use

  

 Load Model # 1
 2   from gensim.models import word2vec
 3   
 4 # Load Corpus, corpus format
 5   
 6 sentences = [[ 'shape', 'look', 'looking', 'screen', 'sound', 'big', 'waiting time', 'long', 'camera', 'effect', 'special' ], [ 'phone', 'good-looking' 'period of time'], [ 'phone', 'received', 'very beautiful', 'follow-up', 'evaluation' ]]
 7   
 Load Model # 8 
9 model = model = word2vec.Word2Vec (sentences , size = 4, window = 5, min_count = 1) # 10 Vector and solving word similarity 11 model.most_similar (u 'shape' ) 12 model [ 'shape' ] 13

 

Output # 1
The degree of similarity # 2 shape, the default output 10 is the most similar words, is output in the form of list
3 
4 [( 'big', 0.7367413640022278), ( 'sound', 0.657544732093811), ( 'follow', 0.5379071235656738), ( 'long', 0.5151427984237671), ( 'period of time', 0.4361593723297119), ( 'phone', .33148619532585144 ), ( 'special', .19552142918109894), ( 'evaluation', .09857006371021271), ( 'waiting time', .08498627692461014), ( 'received', -0.01799720525741577 )]
5 
6 
# 7 outputs the word "appearance" of the word vector
8 
9 array([-0.03313196,  0.04037894, -0.11632963, -0.08618639], dtype=float32)

 

  After reading the above example, is not that super simple, of course, the above is just a simple example, you will see next to a txt file, after which get through the process can be used as input parameters, the first spots at the following parameters word2vec function Detailed:

  

Parameter Description #
 
word2vec.Word2Vec (Sentences = None, size = 100, Alpha = 0.025, window =. 5 ,
min_count=5, max_vocab_size=None, sample=0.001,seed=1,
workers=3,min_alpha=0.0001, sg=0, hs=0, negative=5,
cbow_mean=1, hashfxn=<built-in function hash>,iter=5,null_word=0, = None trim_rule, sorted_vocab = 1, batch_words = 10000 ) --- sentence: may be a list, for large corpus, it is recommended to use BrownCorpus, Text8Corpus or · ineSentence building. --- size: the word is a vector of dimension, the default is --- 100 window: indicates how many current maximum distance with word prediction word in a sentence is, when solving word vectors we will consider before and after the current sentence where the word window words --- sg: training algorithm is used to set, the default is 0, the corresponding algorithm CBOW; sg = 1 skip- is used gram algorithm. --- Alpha: is the learning rate --- SEED: a random number generator. Words related to the initialization vector. --- min_count: can be done to cut the number of dictionary word frequency is less than min_count word will be discarded, the default value --- 5. Max_vocab_size: Set word vector construction RAM limitation period. If all independent number words than this, it would eliminate one of the most frequent one. Each takes about ten million words 1GB of RAM. There is no limit is set to None. --- sample: a random configuration threshold downsampled high-frequency words, the default is 1e-3, the range (-0,1e. 5 ) --- Workers parameter controls the number of parallel training. --- hs: If 1 will hierarchica · softmax skills. If set to 0 (defau · t), the negative sampling will be used. --- negative: if> 0, it will adopt negativesamp · ing, the number of noise words for setting ---cbow_mean: If 0, then the use of the word vectors and the context, if 1 (defau · t) using the mean value. Only use CBOW when it works. --- hashfxn: hash function to initialize the weights. Default python hash function --- iter: number of iterations, the default is --- 5 trim_rule: setting rules for organizing vocabulary, specifying those words to leave, which is to be deleted. Can be set to None (min_count will be used) or a receiving () function and returns RU · E_DISCARD, uti · s.RU · E_KEEP or uti · s.RU · E_DEFAU · T's. --- sorted_vocab: If 1 (defau · t), it will first sort the time allocated word index is based on word frequency in descending order. --- batch_words: the number of threads to pass word of each batch, the default is 10000

 

  Here's an example of using word2vec solving tools txt file word vectors:

 

  

 # 1 uses three text processing text corpus, the text crawl their own right
 Import 2 jieba
 3 from zhon.hanzi import punctuation
 4 path1 = 'C: / Users / Administrator / Desktop / data / News / Health News / 2019 "World Influenza Day" science activities and Academic Conference held in Beijing .txt'
 5 path2 = 'C: / Users / Administrator / Desktop / data / News / Health News / 67-year-old mother said to conceive naturally produced female experts puzzled .txt'
 6 path3 = 'C: / Users / Administrator / Desktop / data / News / Health News rear / 90 health: health care, "not afford to eat" healthy anxiety .txt'
 7 
 8 def get_load(path):
 9     f=open(path,'r',encoding='utf-8')
10     data=f.read()
11     new_s=re.sub(r'[%s,\t,\\]+'%punctuation, " ", data) 12 cut_s=jieba.lcut(new_s) 13 sentences=[] 14 for word in cut_s: 15 if word !='\n'and word !=' ': 16  sentences.append(word) 17 return sentences 18 data1=get_load(path1) 19 data2=get_load(path2) 20 data3=get_load(path3) 21 final_data=[data1,data2,data3] 22 23 #模型建立 24 25 model=word2vec.Word2Vec(final_data,size=50,window=4) 26 27 model['健康']
# Output
>>> model [ 'healthy' ]
array([ 8.6532356e-03,  2.1515305e-03,  3.4037780e-03, -4.4254097e-03,
       03--8.4194457e, -1.5364622e-03, 1.0745996e-02, 03-5.3538852e ,
       -1.1601291e-03,  6.8697990e-03,  8.7537011e-03,  8.6077927e-03, 03-1.4498243e, 2.6482970e-03, -3.4553630e-03, 8.2870452e-03 ,
        3.5420412e-03,  8.8039534e-03, -3.6633634e-03,  5.4932209e-03, -7.5302450e-03, 9.6533290e-04, -1.9622964e-03, 6.5719029e-03, -3.7521331e-04, -9.1459788e-04, -8.3307233e-03, 2.9766238e-03, 7.6092435e-03, -8.3235843e-04, -9.2809896e-05, -6.7277048e-03, 1.5067700e-03, -8.0193384e-03, -1.0153291e-02, 5.9706415e-03, 4.3323904e-04, -9.5779281e-03, -9.3199704e-03, 3.5575093e-03, 3.0641828e-03, 4.4296687e-03, 2.8934417e-04, -1.8675557e-03, -4.8446902e-03, -3.5805893e-03, -1.1002035e-03, -1.0306393e-02, 4.5978278e-03, 6.8134381e-03], dtype=float32) >>>model.most_similar(‘健康') [( '67', .37046998739242554), ( 'will', 0.363727331161499), ( 'the', .30487531423568726), ( 'country', .2739967703819275), ( 'social', .26224130392074585), ( 'news', .19897636771202087), ( 'mothers', .19829007983207703), ( '-', .19742634892463684), ( 'the age', .16749148070812225), ( 'after', .15823742747306824 )] # this result is not very good, the results did not go to stop words

 

 

Processing text above, I only wrote the segmentation process, had to be a stop word processing under normal circumstances, processing low-frequency words, which I do not write, lazy

Here comes the use of word2vec file processing tool for processing text:

# This file is the Chinese word has been divided corpus --- Jingdong comment on oppo phone
from gensim.models import word2vec
import logging
 
# The main program
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

path2 = 'C: / Users / Administrator / Desktop / data / comment /cut_comment.txt'


sentences = word2vec.Text8Corpus(path2) = word2vec.Word2Vec Model (Sentences, size = 20 is ) Model [ 'OPPO' ] model.most_similar ( 'Good' ) model.similarity (U "Good", "oppo")
>>> model.most_similar('oppo')
[( 'First', .9777179956436157), ( 'trust', .9736739993095398), ( 'hope', .9670151472091675), ( 'price', .9577376842498779), ( 'content', .9538010954856873), ( 'not filled', .9495989084243774 ), ( 'packaging', .9487740993499756), ( 'this', .9475699663162231), ( 'broken', .9475245475769043), ( 'evaluation', .9470676779747009 )]
>>> model.most_similar ( 'Good' )
[( 'Quality', .9727074503898621), ( 'shopping', .9600175619125366), ( 'genuine', .9578911066055298), ( 'fighter', .9555199146270752), ( 'like', .9444591999053955), ( 'my wife', .9358581304550171), ( 'phone', .9266927242279053), ( 'recommended', .9224187731742859), ( 'the goods have been' .9196405410766602), ( 'friend', 0.917504072189331 )] >>> model.most_similar ( 'phone' ) [( 'Baby', .9600850343704224), ( 'days', .9596285820007324), ( 'shopping', .9558006525039673), ( 'some time', .9556002020835876), ( 'quality', .9525821208953857), ( 'genuine', .9524366855621338) , ( 'arrival', .9513840079307556), ( 'true', .9481478929519653), ( 'received', .9459341764450073), ( 'next', 0.9382076263427734 )] >>> model.similarity (U "praise", "oppo " ) 0.81516

After the file has the following format above, you can get all the files are processed can be aggregated into a single file

  

 

 

  Finally over, these are my study notes, hoping to help people to see this article, if there is trouble to give the wrong place in the comments section below! ! !

  

Guess you like

Origin www.cnblogs.com/learn-ruijiali/p/12091136.html