In the Word2vec model, the algorithm can calculate a d-dimensional vector for each word through an unsupervised method, that is, map each word to a point in the d-dimensional space, and the distance between the points in the d-dimensional space (ie The distance of the d-dimensional vector corresponding to each word) can reflect the similarity between words.
dav/word2vec is a classic code that uses multi-threading to train word vectors, which is very easy to use and very efficient. However, the code is written in Linux C, and the code is dispatched using the shell. Install Cygwin on Windows to run the code.
1. Download Cygwin, the installation package can be downloaded from the official website, or you can add QQ group 426491390 to download from the group file.
2. Install gcc, make, wget, and unzip commands in Cygwin. For example, the method of installing the make command is shown in the following figure:
3. Download and unzip the dav/word2vec code, the code can also be downloaded from the QQ group 426491390 from the group file.
4. Open Cygwin, enter the folder where dav/word2vec is located (the windows directory is in the /cygdrive directory of Cygwin), and execute the command:
cd scripts
sh demo-word.sh
At this time, I found an error. This is because the src/makefile in word2vec compiles word2vec.c into word2vec rather than word2vec.exe. Similarly, scripts/demo-word.sh executes bin/word2vec instead of word2vec. exe, so make some modifications to both src/makefile and scripts/demo-word.sh. The dev/word2vec version in the QQ group 426491390 group file has modified these two files.
Modify src/makefile to:
SCRIPTS_DIR=../scripts
BIN_DIR=../bin
CC = gcc
#The -Ofast might not work with older versions of gcc; in that case, use -O2
CFLAGS = -lm -pthread -O2 -Wall -funroll-loops -Wno-unused-result
all: word2vec word2phrase distance word-analogy compute-accuracy
word2vec : word2vec.c
$(CC) word2vec.c -o ${BIN_DIR}/word2vec.exe $(CFLAGS)
word2phrase : word2phrase.c
$(CC) word2phrase.c -o ${BIN_DIR}/word2phrase.exe $(CFLAGS)
distance : distance.c
$(CC) distance.c -o ${BIN_DIR}/distance.exe $(CFLAGS)
word-analogy : word-analogy.c
$(CC) word-analogy.c -o ${BIN_DIR}/word-analogy.exe $(CFLAGS)
compute-accuracy : compute-accuracy.c
$(CC) compute-accuracy.c -o ${BIN_DIR}/compute-accuracy.exe $(CFLAGS)
chmod +x ${SCRIPTS_DIR}/*.sh
clean:
pushd ${BIN_DIR} && rm -rf word2vec.exe word2phrase.exe distance.exe word-analogy.exe compute-accuracy.exe; popd
Modify scripts/demo-word.sh to:
#!/bin/bash
DATA_DIR=../data
BIN_DIR=../bin
SRC_DIR=../src
TEXT_DATA=$DATA_DIR/text8
ZIPPED_TEXT_DATA="${TEXT_DATA}.zip"
VECTOR_DATA=$DATA_DIR/text8-vector.bin
pushd ${SRC_DIR} && make; popd
if [ ! -e $VECTOR_DATA ]; then
if [ ! -e $TEXT_DATA ]; then
if [ ! -e $ZIPPED_TEXT_DATA ]; then
wget http://mattmahoney.net/dc/text8.zip -O $ZIPPED_TEXT_DATA
fi
unzip $ZIPPED_TEXT_DATA
mv text8 $TEXT_DATA
fi
echo -----------------------------------------------------------------------------------------------------
echo -- Training vectors...
time $BIN_DIR/word2vec.exe -train $TEXT_DATA -output $VECTOR_DATA -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1
fi
echo -----------------------------------------------------------------------------------------------------
echo -- distance...
$BIN_DIR/distance.exe $DATA_DIR/$VECTOR_DATA
cd scripts
Enter the scripts folder with and run again :
sh demo-word.sh
Start training after entering:
After the training is completed, the synonym query program will be automatically run:
Why is the most similar synonym to small in the example large? Because the synonyms calculated by Word2vec tend to be similar in usage rather than real semantic similarity.
More knowledge related to Word2vec can be discussed in the QQ group 426491390 .