GloVe模型的理解及实践(2)

一、运行环境

Ubuntu16.04 + python 3.5

二、安装gensim

两种安装方式

1)打开终端

sudo easy_install --upgrade gensim

2)打开终端

pip install gensim

三、Git官方GitHub代码

https://github.com/stanfordnlp/GloVe

四、生成词向量

1.在glove文件下打开终端进行编译:

make

编译后生成 bin 文件夹,文件夹内有四个文件:

Readme中有关于四个文件的介绍。

1)vocab_count:计算原文本的单词统计(生成vocab.txt文件)

格式为“单词 词频”如下图:

2)cooccur:用于统计词与词的共现(生成二进制文件 cooccurrence.bin )

3)shuffle:生成二进制文件 cooccurrence.shuf.bin

4)glove:Glove算法的训练模型,生成vectors.txt和vectors.bin

2.执行 sh demo.sh

sh demo.sh  

如下图,下载默认语料库并训练模型:

最后得到 vectors.txt

五、词向量生成模型并加载

1.在目录下建一个 load_model.py 文件,代码如下

#!usr/bin/python
# -*- coding: utf-8 -*-
 
import shutil
import gensim  
def getFileLineNums(filename):  
    f = open(filename,'r')  
    count = 0  
  
    for line in f:  
          
        count += 1  
    return count
 
def prepend_line(infile, outfile, line):  
    """ 
    Function use to prepend lines using bash utilities in Linux. 
    (source: http://stackoverflow.com/a/10850588/610569) 
    """  
    with open(infile, 'r') as old:  
        with open(outfile, 'w') as new:  
            new.write(str(line) + "\n")  
            shutil.copyfileobj(old, new) 
      
def prepend_slow(infile, outfile, line):  
    """ 
    Slower way to prepend the line by re-creating the inputfile. 
    """  
    with open(infile, 'r') as fin:  
        with open(outfile, 'w') as fout:  
            fout.write(line + "\n")  
            for line in fin:  
                fout.write(line) 
 
def load(filename):  
      
    # Input: GloVe Model File  
    # More models can be downloaded from http://nlp.stanford.edu/projects/glove/  
    # glove_file="glove.840B.300d.txt"  
    glove_file = filename  
      
    dimensions = 50  
      
    num_lines = getFileLineNums(filename)  
    # num_lines = check_num_lines_in_glove(glove_file)  
    # dims = int(dimensions[:-1])  
    dims = 50  
      
    print num_lines  
        #  
        # # Output: Gensim Model text format.  
    gensim_file='glove_model.txt'  
    gensim_first_line = "{} {}".format(num_lines, dims)  
        #  
        # # Prepends the line.  
    #if platform == "linux" or platform == "linux2":  
    prepend_line(glove_file, gensim_file, gensim_first_line)  
    #else:  
    #    prepend_slow(glove_file, gensim_file, gensim_first_line)  
      
        # Demo: Loads the newly created glove_model.txt into gensim API.  
    model=gensim.models.KeyedVectors.load_word2vec_format(gensim_file,binary=False) #GloVe Model  
      
    model_name = gensim_file[6:-4]  
          
    model.save('./' + model_name)  
      
    return model  
 
 #load(glove.6B.300d.txt)#生成模型
 
 
if __name__ == '__main__':  
    myfile=open('vectors.txt')
    myfile.read()
    
    ####################################
    model_name='model\.6B.300d'
    model = gensim.models.KeyedVectors.load('./'+model_name)  
 
    print len(model.vocab)  
 
    word_list = [u'girl',u'dog']  
   
    for word in word_list:  
        print word,'--'  
        for i in model.most_similar(word, topn=10):  
            print i[0],i[1]  
        print '' 

在目录下终端运行

python load_model.py 

以上输出结果为:

词汇行数:400000

以及与 word_list = [u'girl',u'dog'] 最相近的Top 10 单词

六、测试

测试代码如下:

import shutil
import gensim  
model_name='model\.6B.300d'
model = gensim.models.KeyedVectors.load('./'+model_name)  
print len(model.vocab)  
word_list = [u'person',u'pet']
for word in word_list:  
    print word,'--'  
    for i in model.most_similar(word, topn=10):       
        print i[0],i[1]  
        print '' 

结果:

person --
someone 0.690635979176

man 0.64434415102

anyone 0.627825558186

woman 0.617089629173

one 0.591174006462

actually 0.579971313477

persons 0.577681422234

people 0.571225821972

else 0.562521100044

somebody 0.560000300407

pet --
pets 0.686407566071

dog 0.629159927368

cat 0.58703649044

dogs 0.545046746731

cats 0.526196360588

animal 0.516855597496

animals 0.507143497467

puppy 0.486273795366

toy 0.430860459805

rabbits 0.420677244663

参考:https://blog.csdn.net/sscssz/article/details/53333225 

猜你喜欢

转载自blog.csdn.net/qq_33373858/article/details/83684502
今日推荐