Glove笔记

论文出处：

http://nlp.stanford.edu/projects/glove/

下面准备测试的是python实现版本：

github地址： https://github.com/maciejkula/glove-python

安装：

sudo pip install glove_python

下载源码：

git clone --recursive https://github.com/maciejkula/glove-python.git

执行测试examples：

ipython -i -- examples/example.py -c my_corpus.txt -t 10

报错：未安装gensim

安装gensim: https://radimrehurek.com/gensim/install.html

sudo pip install --upgrade gensim

安装smart_open（文件open）：

sudo pip install smart_open

安装。。。。。。（依赖的文件太多，放弃，考虑使用C语言版）

下面测试C语言版：

github地址： https://github.com/stanfordnlp/GloVe

源码下载及测试：

$ git clone http://github.com/stanfordnlp/glove$ cd glove && make$ ./demo.sh

执行后可以看到生成了下面的文件：

cooccurrence.bin

cooccurrence.shuf.bin

vocab.txt #词典中词及对应的索引

vectors.bin #词向量二进制文件

vectors.txt #词典中词对应的词向量

vectors.bin看上去和word2vec生成的词向量二进制文件类似，那么是否可以直接利用word2vec去读取vectors.bin呢，使用word2vec去测试一下：

model = word2vec.load( "vectors.bin" )

报下面的错误：

ValueError: invalid literal for int() with base 10: '\x95\x87\xf2\x9aD\xa7\xf2?;\xbe6\x9dg\x05\xe8?\xf6\xf8\x93?\xfb*\xdb?%M}\xb9\xa1\x92\xd1?.\x85\xba\x19\xd7\xb4\xdb\xbf\x17\x16'

进一步查看load方法,是可以读取txt文件的：

def load(fname, kind= 'auto' , *args, **kwargs):

if kind == 'auto' :

if fname.endswith( '.bin' ):

kind = 'bin'

elif fname.endswith( '.txt' ):

kind = 'txt'

else :

raise Exception( 'Could not identify kind' )

if kind == 'bin' :

return word2vec.WordVectors.from_binary(fname, *args, **kwargs)

elif kind == 'txt' :

return word2vec.WordVectors.from_text(fname, *args, **kwargs)

elif kind == 'mmap' :

return word2vec.WordVectors.from_mmap(fname, *args, **kwargs)

else :

raise Exception( 'Unknown kind' )

from_text方法中有如下代码：

with open(fname, 'rb' ) as fin:

header = fin.readline()

vocab_size, vector_size = list(map(int, header.split()))

可以看到第一行header是读取的 vocab_size, vector_size，而查看vectors.txt中的数据，第一行就是word和其word vector。

是否可以在vectors.txt第一行插入vocab_size, vector_size这两个值就可以直接load呢（有待测试。。。。。。）

这里采用的方法是参考的博文： http://blog.csdn.net/adooadoo/article/details/38505497，

测试代码在： https://github.com/eclipse-du/glove_py_model_load/blob/master/glove_dist.py，经过测试是可以运行的。

稍作修改后的代码在： https://github.com/zhangweijiqn/testPython/blob/master/src/Word2vec/test_glove.py

上面的测试是使用demo.sh是根据text8生成的文件，那么自己的语料库如果用来训练？

查看demo.sh代码后，发现通过简单修改就可以将inputfile改为自定义的，修改后代码如下：

#!/bin/bash

set -e

make

CORPUS=data_fenci.txt #这里是已经分好词的文件路径。

VOCAB_FILE=vocab.txt #输出的字典

COOCCURRENCE_FILE=cooccurrence.bin

COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin

BUILDDIR=build

SAVE_FILE=vectors

VERBOSE=2

MEMORY=4.0

VOCAB_MIN_COUNT=5

VECTOR_SIZE=50

MAX_ITER=15

WINDOW_SIZE=15

BINARY=2

NUM_THREADS=8

X_MAX=10

echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE"

$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE

echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE"

$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE

echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"

$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE

echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE"

$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE

执行上面的代码，在结果中看到实际执行的cmd如下，很多参数都可以进行设置，如windowSize，vectorSize：

$make

mkdir -p build

gcc src/glove.c -o build/glove -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result

gcc src/shuffle.c -o build/shuffle -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result

gcc src/cooccur.c -o build/cooccur -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result

gcc src/vocab_count.c -o build/vocab_count -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result

$ build/vocab_count -min-count 5 -verbose 2 < /home/zhangwj/Applications/Scrapy/baike/files/data_fenci.txt > vocab.txt

BUILDING VOCABULARY

Processed 19899975 tokens.

Counted 388318 unique words.

Truncating vocabulary at min count 5.

Using vocabulary of size 107254.

$ build/cooccur -memory 4.0 -vocab-file vocab.txt -verbose 2 - window-size 15 < /home/zhangwj/Applications/Scrapy/baike/files/data_fenci.txt > cooccurrence.bin

COUNTING COOCCURRENCES

window size: 15

context: symmetric

max product: 13752509

overflow length: 38028356

Reading vocab from file "vocab.txt"...loaded 107254 words.

Building lookup table...table contains 106188469 elements.

Processed 19899975 tokens.

Writing cooccurrences to disk.........3 files in total.

Merging cooccurrence files: processed 63589490 lines.

$ build/shuffle -memory 4.0 -verbose 2 < cooccurrence.bin > cooccurrence.shuf.bin

SHUFFLING COOCCURRENCES

array size: 255013683

Shuffling by chunks: processed 63589490 lines.

Wrote 1 temporary file(s).

Merging temp files: processed 63589490 lines.

$ build/glove -save-file vectors -threads 8 -input-file cooccurrence.shuf.bin -x-max 10 -iter 15 - vector-size 50 -binary 2 -vocab-file vocab.txt -verbose 2

TRAINING MODEL

Read 63589490 lines.

Initializing parameters...done.

vector size: 50

vocab size: 107254

x_max: 10.000000

alpha: 0.750000

07/22/16 - 01:27.37PM, iter: 001, cost: 0.083864

将生成的vectors.txt使用 https://github.com/zhangweijiqn/testPython/blob/master/src/Word2vec/test_glove.py进行测试，可以进行similar words查找。

猜你喜欢