"Vector word" training with Word2Vec Chinese word vector (a) - the use of search dogs news dataset

Sogou news dataset to train the Chinese word vector (Word2Vec), do yourself a lot of time to step on the pit, hoping to share let everyone detours.

After completion of this study, you can click on the Wikipedia word vector training to further improve their word vector model!

Reference article: Sogou corpus word2vec get word vector
Natural Language Processing Starter (a) ------ training Sogou news corpus processing and word2vec word vector
word2vec Use Summary

Download dataset

The laboratory using a search dog news dataset Download

There are mini version and the full version to choose from. I downloaded the full version of tar.gz format.
Here Insert Picture Description

Data Processing Set

(A) extracting documents

Open the Windows command prompt (cmd), go to the document file is located, enter:

tar -zvxf news_sohusite_xml.full.tar.gz

You can download files under news_sohusite_xml.full.tar.gz decompression is news_sohusite_xml.dat

(B) Documentation extract

We look at the data (because the data is too big, I opened with Pycharm) after extracting
Here Insert Picture Description
found two Key Point : (1) document encoding a problem, we need it for transcoding (2) document storage format is uml , url is page links, contenttitle is the page title, content is the content of the page, you can get information according to their needs.

Using cmd , re-extraction "process while transcoding content Content" in:

type news_sohusite_xml.dat | iconv -f gbk -t utf-8 -c | findstr "<content>"  > corpus.txt 

This time may be error, due to the lack iconv.exe , you need to download win_iconv - encoding conversion tools , unzip download, copy iconv.exe to C: \ Windows \ System32, can be used.

Stored in the document corpus.txt , the effect as shown in FIG.
Here Insert Picture Description

(C) the word document

Establish corpusSegDone.txt file as a result of word save the file. Enter the following code word, every line on the printing process 100 once, you can see the progress.

import jieba
import re

filePath = 'corpus.txt'
fileSegWordDonePath = 'corpusSegDone.txt'

# 将每一行文本依次存放到一个列表
fileTrainRead = []
with open(filePath, encoding='utf-8') as fileTrainRaw:
    for line in fileTrainRaw:
        fileTrainRead.append(line)

# 去除标点符号
fileTrainClean = []
remove_chars = '[·’!"#$%&\'()*+,-./:;<=>?@,。?★、…【】《》?“”‘’![\\]^_`{|}~]+'
for i in range(len(fileTrainRead)):
    string = re.sub(remove_chars, "", fileTrainRead[i])
    fileTrainClean.append(string)

# 用jieba进行分词
fileTrainSeg = []
file_userDict = 'dict.txt'  # 自定义的词典
jieba.load_userdict(file_userDict)
for i in range(len(fileTrainClean)):
    fileTrainSeg.append([' '.join(jieba.cut(fileTrainClean[i][7:-7], cut_all=False))])  # 7和-7作用是过滤掉<content>标签,可能要根据自己的做出调整
    if i % 100 == 0:  # 每处理100个就打印一次
        print(i)

with open(fileSegWordDonePath, 'wb') as fW:
    for i in range(len(fileTrainSeg)):
        fW.write(fileTrainSeg[i][0].encode('utf-8'))
        fW.write('\n'.encode("utf-8"))


Segmentation procedure shown below, a total of I 140w row, with time 45 min.
Here Insert Picture Description
The results shown in FIG word:
Here Insert Picture Description

With gensim training vector word

Most of the comments made, directly attached to the code:

import logging
import sys
import gensim.models as word2vec
from gensim.models.word2vec import LineSentence, logger


def train_word2vec(dataset_path, out_vector):
    # 设置输出日志
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
    # 把语料变成句子集合
    sentences = LineSentence(dataset_path)
    # 训练word2vec模型(size为向量维度,window为词向量上下文最大距离,min_count需要计算词向量的最小词频)
    model = word2vec.Word2Vec(sentences, size=100, sg=1, window=5, min_count=5, workers=4, iter=5)
    # (iter随机梯度下降法中迭代的最大次数,sg为1是Skip-Gram模型)
    # 保存word2vec模型(创建临时文件以便以后增量训练)
    model.save("word2vec.model")
    model.wv.save_word2vec_format(out_vector, binary=False)


# 加载模型
def load_word2vec_model(w2v_path):
    model = word2vec.KeyedVectors.load_word2vec_format(w2v_path, binary=True)
    return model


# 计算词语的相似词
def calculate_most_similar(model, word):
    similar_words = model.most_similar(word)
    print(word)
    for term in similar_words:
        print(term[0], term[1])


if __name__ == '__main__':
    dataset_path = "corpusSegDone.txt"
    out_vector = 'corpusSegDone.vector'
    train_word2vec(dataset_path, out_vector)


But I know that will not be so simple. I made the first warning:
Here Insert Picture Description

UserWarning: C extension not loaded, training will be slow.

I did not see the beginning, the results can only handle to the back of a EPOCH 160 word / s, and usually hundreds of thousands, I trained for a full 10 hours to no avail! ! ! Really very SLOW ::> _ <::

After reviewing the information, I learned that the lack of C extensions, there are three solutions:

  • It-yourself installation, you need Visual Studio see this blogger

  • With conda installation, automatically binds the C compiler see this blogger

  • Uninstall higher version of gensim package, install the 3.7.1 version, I use this, faster.

# 首先打开cmd,卸载gensim

pip uninstall gensim

#接着安装3.7.1版本

pip install gensim==3.7.1

Run again, it will not display this warning, but a warning to change it! Very powerful O__O "
Here Insert Picture Description

UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: 点击这里

Opening the link, it found smart_open - Python tool for transferring large files streamed. We talk about it aside, we might take a look at the current rate of processing of the word, reached his 30 w / s, with a foreign proceeding ape saying, "Now the program is running within seconds which made a drastic change in the execution time. "
Here Insert Picture Description
look back smart_open , Open () to do it can do, but also can reduce code written and produced fewer errors. May pip install smart_open, use smart_open.open when opening large files () can be.

When using 1h 30min, was "word2vec.model" model
Here Insert Picture Description
commented training code, the code is running to load the model:

# 加载模型
def load_word2vec_model(w2v_path):
    model = word2vec.Word2Vec.load(w2v_path)
    return model

model = load_word2vec_model("word2vec.model")  # 加载模型

Use the following code calculates the closest word with "China":

def calculate_most_similar(self, word):
    similar_words = self.wv.most_similar(word)
    print(word)
    for term in similar_words:
        print(term[0], term[1])

model = load_word2vec_model("word2vec.model")
calculate_most_similar(model, "中国")

The results are as follows :( seems okay)
Here Insert Picture Description
similar to a man's O_o
Here Insert Picture Description

The vector can print out the word:

print(model['男人'])

Here Insert Picture Description
Identify the most unsocial words:

# 找出不合群的词
def find_word_dismatch(self, list):
    print(self.wv.doesnt_match(list))

list = ["早饭", "吃饭", "恰饭", "嘻哈"]
find_word_dismatch(model, list)

Here Insert Picture Description
Complete code:

import logging
import sys
import gensim.models as word2vec
from gensim.models.word2vec import LineSentence, logger
# import smart_open


def train_word2vec(dataset_path, out_vector):
    # 设置输出日志
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
    # 把语料变成句子集合
    sentences = LineSentence(dataset_path)
    # sentences = LineSentence(smart_open.open(dataset_path, encoding='utf-8'))  # 或者用smart_open打开
    # 训练word2vec模型(size为向量维度,window为词向量上下文最大距离,min_count需要计算词向量的最小词频)
    model = word2vec.Word2Vec(sentences, size=100, sg=1, window=5, min_count=5, workers=4, iter=5)
    # (iter随机梯度下降法中迭代的最大次数,sg为1是Skip-Gram模型)
    # 保存word2vec模型
    model.save("word2vec.model")
    model.wv.save_word2vec_format(out_vector, binary=False)


# 加载模型
def load_word2vec_model(w2v_path):
    model = word2vec.Word2Vec.load(w2v_path)
    return model


# 计算词语最相似的词
def calculate_most_similar(self, word):
    similar_words = self.wv.most_similar(word)
    print(word)
    for term in similar_words:
        print(term[0], term[1])


# 计算两个词相似度
def calculate_words_similar(model, word1, word2):
    print(model.similarity(word1, word2))


# 找出不合群的词
def find_word_dismatch(self, list):
    print(self.wv.doesnt_match(list))


if __name__ == '__main__':
    dataset_path = "corpusSegDone.txt"
    out_vector = 'corpusSegDone.vector'
    train_word2vec(dataset_path, out_vector)  # 训练模型
    model = load_word2vec_model("word2vec.model")  # 加载模型

    # calculate_most_similar(model, "吃饭")  # 找相近词

    # calculate_words_similar(model, "男人", "女人")  # 两个词相似度

    # print(model.wv.__getitem__('男人'))  # 词向量

    # list = ["早饭", "吃饭", "恰饭", "嘻哈"]

    # find_word_dismatch(model, list)

\ ^ O ^ / finally finished it! What problems can leave a message, I will try to answer! └ (^ o ^) ┘ there is something wrong, but also look exhibitions!

Welcome back to see the next one: with Wikipedia corpus training Chinese word vector

Reference article

Once again thank the three articles, I saved this chicken dishes!

Reference article: Sogou corpus word2vec get word vector
Natural Language Processing Starter (a) ------ training Sogou news corpus processing and word2vec word vector
word2vec Use Summary

Released eight original articles · won praise 0 · Views 623

Guess you like

Origin blog.csdn.net/qq_42491242/article/details/104782989