Technical application: word2vec experiment

Data Sources

Chinese Wikipedia data download:
zhwiki-latest-pages-articles.xml.bz2
enwiki-latest-pages-articles.xml.bz2

Extract the text of the article

  • Ubuntu 18
  • Python 3
sudo apt-get install python3-pip
pip3 install setuptools
git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor/
sudo python setup.py install
python3 wikiextractor/WikiExtractor.py -b 1000M -o result.txt zhwiki-latest-pages-articles.xml

Participle

Here we use HanLP for word segmentation

Note: You need to convert traditional Chinese to simplified Chinese, and then perform word segmentation

Calculation model

Note: Need to download and compile the environment

./word2vec -train tt.txt -output vectors.bin -cbow 1 -size 80 -window 5 -negative 80 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15

  • tt.txt is the output file after word segmentation just now
  • vectors.bin is the output file after training
  • cbow 0 means do not use cbow model, the default is Skip-Gram model.
  • size 80 The vector dimension of each word is 80,
  • Window 5 The training window size is 5 is to consider the first five words and the last five words of a word (the actual code also has a random window selection process, the window size is less than or equal to 5).
  • negative 0 NEG method is not applicable
  • hs 1 does not use the NEG method, but uses the HS method.
  • sampe refers to the sampling threshold. If a word appears more frequently in the training sample, the more it will be sampled.
  • A binary of 1 refers to the resulting binary storage, and a 0 to normal storage (when the normal storage is open, you can see the words and the corresponding vector).

Verify the model

After training is complete, execute the command:
./distance vectors.bin

Guess you like

Origin www.cnblogs.com/duchaoqun/p/12735707.html