Glove笔记

论文出处:

下面准备测试的是python实现版本:

安装:
sudo pip install glove_python
下载源码:

执行测试examples:
ipython -i -- examples/example.py -c my_corpus.txt -t 10
报错:未安装gensim

sudo pip install --upgrade gensim

安装smart_open(文件open):
sudo pip install smart_open

安装。。。。。。(依赖的文件太多,放弃,考虑使用C语言版)

下面测试C语言版:
源码下载及测试:
$ git clone http://github.com/stanfordnlp/glove$ cd glove && make$ ./demo.sh
执行后可以看到生成了下面的文件:
cooccurrence.bin
cooccurrence.shuf.bin
vocab.txt #词典中词及对应的索引
vectors.bin #词向量二进制文件
vectors.txt #词典中词对应的词向量
vectors.bin看上去和word2vec生成的词向量二进制文件类似,那么是否可以直接利用word2vec去读取vectors.bin呢,使用word2vec去测试一下:
model = word2vec.load( "vectors.bin" )
报下面的错误:
ValueError: invalid literal for int() with base 10: '\x95\x87\xf2\x9aD\xa7\xf2?;\xbe6\x9dg\x05\xe8?\xf6\xf8\x93?\xfb*\xdb?%M}\xb9\xa1\x92\xd1?.\x85\xba\x19\xd7\xb4\xdb\xbf\x17\x16'
进一步查看load方法,是可以读取txt文件的:
def load(fname, kind= 'auto' , *args, **kwargs):
if kind == 'auto' :
if fname.endswith( '.bin' ):
kind = 'bin'
elif fname.endswith( '.txt' ):
kind = 'txt'
else :
raise Exception( 'Could not identify kind' )
if kind == 'bin' :
return word2vec.WordVectors.from_binary(fname, *args, **kwargs)
elif kind == 'txt' :
return word2vec.WordVectors.from_text(fname, *args, **kwargs)
elif kind == 'mmap' :
return word2vec.WordVectors.from_mmap(fname, *args, **kwargs)
else :
raise Exception( 'Unknown kind' )
from_text方法中有如下代码:
with open(fname, 'rb' ) as fin:
header = fin.readline()
vocab_size, vector_size = list(map(int, header.split()))
可以看到第一行header是读取的 vocab_size, vector_size,而查看vectors.txt中的数据,第一行就是word和其word vector。
是否可以在vectors.txt第一行插入vocab_size, vector_size这两个值就可以直接load呢(有待测试。。。。。。)
这里采用的方法是参考的博文: http://blog.csdn.net/adooadoo/article/details/38505497
测试代码在: https://github.com/eclipse-du/glove_py_model_load/blob/master/glove_dist.py,经过测试是可以运行的。
上面的测试是使用demo.sh是根据text8生成的文件,那么自己的语料库如果用来训练?
查看demo.sh代码后,发现通过简单修改就可以将inputfile改为自定义的,修改后代码如下:
#!/bin/bash
set -e
make
CORPUS=data_fenci.txt #这里是已经分好词的文件路径。
VOCAB_FILE=vocab.txt #输出的字典
COOCCURRENCE_FILE=cooccurrence.bin
COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin
BUILDDIR=build
SAVE_FILE=vectors
VERBOSE=2
MEMORY=4.0
VOCAB_MIN_COUNT=5
VECTOR_SIZE=50
MAX_ITER=15
WINDOW_SIZE=15
BINARY=2
NUM_THREADS=8
X_MAX=10

echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE"
$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE"
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE"
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
执行上面的代码,在结果中看到实际执行的cmd如下,很多参数都可以进行设置,如windowSize,vectorSize:
$make
mkdir -p build
gcc src/glove.c -o build/glove -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/shuffle.c -o build/shuffle -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/cooccur.c -o build/cooccur -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/vocab_count.c -o build/vocab_count -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
$ build/vocab_count -min-count 5 -verbose 2 < /home/zhangwj/Applications/Scrapy/baike/files/data_fenci.txt > vocab.txt
BUILDING VOCABULARY
Processed 19899975 tokens.
Counted 388318 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 107254.

$ build/cooccur -memory 4.0 -vocab-file vocab.txt -verbose 2 - window-size 15 < /home/zhangwj/Applications/Scrapy/baike/files/data_fenci.txt > cooccurrence.bin
COUNTING COOCCURRENCES
window size: 15
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "vocab.txt"...loaded 107254 words.
Building lookup table...table contains 106188469 elements.
Processed 19899975 tokens.
Writing cooccurrences to disk.........3 files in total.
Merging cooccurrence files: processed 63589490 lines.

$ build/shuffle -memory 4.0 -verbose 2 < cooccurrence.bin > cooccurrence.shuf.bin
SHUFFLING COOCCURRENCES
array size: 255013683
Shuffling by chunks: processed 63589490 lines.
Wrote 1 temporary file(s).
Merging temp files: processed 63589490 lines.

$ build/glove -save-file vectors -threads 8 -input-file cooccurrence.shuf.bin -x-max 10 -iter 15 - vector-size 50 -binary 2 -vocab-file vocab.txt -verbose 2
TRAINING MODEL
Read 63589490 lines.
Initializing parameters...done.
vector size: 50
vocab size: 107254
x_max: 10.000000
alpha: 0.750000
07/22/16 - 01:27.37PM, iter: 001, cost: 0.083864

将生成的vectors.txt使用 https://github.com/zhangweijiqn/testPython/blob/master/src/Word2vec/test_glove.py进行测试,可以进行similar words查找。

















猜你喜欢

转载自blog.csdn.net/zhangweijiqn/article/details/53214512