Run the C language version of Word2Vec to train word vectors under Windows

In the Word2vec model, the algorithm can calculate a d-dimensional vector for each word through an unsupervised method, that is, map each word to a point in the d-dimensional space, and the distance between the points in the d-dimensional space (ie The distance of the d-dimensional vector corresponding to each word) can reflect the similarity between words.

dav/word2vec is a classic code that uses multi-threading to train word vectors, which is very easy to use and very efficient. However, the code is written in Linux C, and the code is dispatched using the shell. Install Cygwin on Windows to run the code.

1. Download Cygwin, the installation package can be downloaded from the official website, or you can add QQ group 426491390 to download from the group file.
2. Install gcc, make, wget, and unzip commands in Cygwin. For example, the method of installing the make command is shown in the following figure:

write picture description here

3. Download and unzip the dav/word2vec code, the code can also be downloaded from the QQ group 426491390 from the group file.
4. Open Cygwin, enter the folder where dav/word2vec is located (the windows directory is in the /cygdrive directory of Cygwin), and execute the command:

cd scripts
sh demo-word.sh

At this time, I found an error. This is because the src/makefile in word2vec compiles word2vec.c into word2vec rather than word2vec.exe. Similarly, scripts/demo-word.sh executes bin/word2vec instead of word2vec. exe, so make some modifications to both src/makefile and scripts/demo-word.sh. The dev/word2vec version in the QQ group 426491390 group file has modified these two files.

Modify src/makefile to:

SCRIPTS_DIR=../scripts
BIN_DIR=../bin

CC = gcc
#The -Ofast might not work with older versions of gcc; in that case, use -O2
CFLAGS = -lm -pthread -O2 -Wall -funroll-loops -Wno-unused-result

all: word2vec word2phrase distance word-analogy compute-accuracy

word2vec : word2vec.c
    $(CC) word2vec.c -o ${BIN_DIR}/word2vec.exe $(CFLAGS)
word2phrase : word2phrase.c
    $(CC) word2phrase.c -o ${BIN_DIR}/word2phrase.exe $(CFLAGS)
distance : distance.c
    $(CC) distance.c -o ${BIN_DIR}/distance.exe $(CFLAGS)
word-analogy : word-analogy.c
    $(CC) word-analogy.c -o ${BIN_DIR}/word-analogy.exe $(CFLAGS)
compute-accuracy : compute-accuracy.c
    $(CC) compute-accuracy.c -o ${BIN_DIR}/compute-accuracy.exe $(CFLAGS)
    chmod +x ${SCRIPTS_DIR}/*.sh

clean:
    pushd ${BIN_DIR} && rm -rf word2vec.exe word2phrase.exe distance.exe word-analogy.exe compute-accuracy.exe; popd

Modify scripts/demo-word.sh to:

#!/bin/bash

DATA_DIR=../data
BIN_DIR=../bin
SRC_DIR=../src

TEXT_DATA=$DATA_DIR/text8
ZIPPED_TEXT_DATA="${TEXT_DATA}.zip"
VECTOR_DATA=$DATA_DIR/text8-vector.bin

pushd ${SRC_DIR} && make; popd

if [ ! -e $VECTOR_DATA ]; then

  if [ ! -e $TEXT_DATA ]; then
    if [ ! -e $ZIPPED_TEXT_DATA ]; then
        wget http://mattmahoney.net/dc/text8.zip -O $ZIPPED_TEXT_DATA
    fi
    unzip $ZIPPED_TEXT_DATA
    mv text8 $TEXT_DATA
  fi
  echo -----------------------------------------------------------------------------------------------------
  echo -- Training vectors...
  time $BIN_DIR/word2vec.exe -train $TEXT_DATA -output $VECTOR_DATA -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1

fi

echo -----------------------------------------------------------------------------------------------------
echo -- distance...

$BIN_DIR/distance.exe $DATA_DIR/$VECTOR_DATA

cd scriptsEnter the scripts folder with and run again :

sh demo-word.sh

Start training after entering:

write picture description here

After the training is completed, the synonym query program will be automatically run:

write picture description here

Why is the most similar synonym to small in the example large? Because the synonyms calculated by Word2vec tend to be similar in usage rather than real semantic similarity.

More knowledge related to Word2vec can be discussed in the QQ group 426491390 .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325391126&siteId=291194637