"Vector word" training with Word2Vec Chinese word vectors (b) - the use of wiki encyclopedia Corpus

This article is the "word vector 'with Word2Vec training Chinese word vector (a) - the use of search dogs news data set based on the two corpora merge, and then trained better word vector model.

Reference: Based word2vec wiki using the Chinese word training corpus vector
small projects (Gensim Library) - Wikipedia Chinese data processing

Download dataset

Click here to download: Wiki Corpus selected date.

The file is downloaded zhwiki-latest-pages-articles.xml.bz2, size 1.75GB (especially slow download speed).
Here Insert Picture Description

Data Processing Set

(A) text extraction

There are two ways to extract text from a compressed file:

  • Use gensim.corpora of WikiCorpus directly to Wikipedia corpus processing. Save to wikiCorpus / wikiCorpus.txt
from gensim.corpora import WikiCorpus

if __name__ == '__main__':

    print('主程序开始...')

    input_file_name = 'zhwiki-latest-pages-articles.xml.bz2'
    output_file_name = 'wikiCorpus/wikiCorpus.txt'
    print('开始读入wiki数据...')
    input_file = WikiCorpus(input_file_name, lemmatize=False, dictionary={})
    print('wiki数据读入完成!')

    print('处理程序开始...')
    count = 0
    with open(output_file_name, 'wb') as output_file:
        for text in input_file.get_texts():
            output_file.write(' '.join(text).encode("utf-8"))
            output_file.write('\n'.encode("utf-8"))
            count = count + 1
            if count % 10000 == 0:
                print('目前已处理%d条数据' % count)
    print('处理程序结束!')

    output_file.close()
    print('主程序结束!')

  • Download Wikipedia Extractor extracts text.
    Here Insert Picture Description
    Processing the data set is completed, a total of 35w + pieces of data, the following effects. We found that punctuation has been removed, simply do not like too! ! !
    Here Insert Picture Description

(B) simplify

Download opencc , perform character conversion.

The wikiCorpus.txt copied to the folder, the indicator used cd to the folder input opencc -i wikiCorpus.txt -o wikiCorpusSim.txt -c t2s.json
Here Insert Picture Description
to complete the conversion, the converted wikiCorpusSim.txt saved to Python Project , results as shown:
Here Insert Picture Description

(C) the word document

If done Sogou data set before it, this step should be very familiar.

import jieba

filePath = 'wikiCorpus/wikiCorpusSim.txt'  # 简化后文本
fileSegWordDonePath = 'wikiCorpus/wikiCorpusSegDone.txt'  # 分词处理后文本

# 将每一行文本依次存放到一个列表
fileTrainRead = []
with open(filePath, encoding='utf-8') as fileTrainRaw:
    for line in fileTrainRaw:
        fileTrainRead.append(line)

# 用jieba进行分词
fileTrainSeg = []
file_userDict = 'dict.txt'  # 自定义的词典
jieba.load_userdict(file_userDict)
for i in range(len(fileTrainRead)):
    fileTrainSeg.append([' '.join(jieba.cut(fileTrainRead[i], cut_all=False))])
    if i % 100 == 0:  # 每处理100个就打印一次
        print(i)

# 处理后写入文件
with open(fileSegWordDonePath, 'wb') as fW:
    for i in range(len(fileTrainSeg)):
        fW.write(fileTrainSeg[i][0].encode('utf-8'))
        fW.write('\n'.encode("utf-8"))
 

Here Insert Picture Description

(D) combining data sets

The good search dog before processing the data sets and merge wiki, generate Fried Chicken invincible Chinese corpus.


filePath1 = 'wikiCorpus/wikiCorpusSegDone.txt'  # wiki分词后语料库
filePath2 = 'sougouCorpus/sougouCorpusSegDone.txt'  # 搜狗语料库
filePath3 = 'corpusFinal.txt'  # 最终语料库


fileFinal = []
countS = 0  # sougou计数
countW = 0  # wiki计数

# 打开搜狗语料库
with open(filePath2, encoding='utf-8') as ff1:
    print("---成功导入搜狗语料库---")
    for line in ff1:
        fileFinal.append(line)
        countS = countS + 1
        if countS % 10000 == 0:
            print("---------已导入%d篇搜狗文章---------" % countS)

# 打开维基语料库
with open(filePath1, encoding='utf-8') as ff2:
    print("---成功导入wiki语料库---")
    for line in ff2:
        fileFinal.append(line)
        countW = countW + 1
        if countW % 10000 == 0:
            print("---------已导入%d篇维基文章---------" % countW)

# 打开最终文档,逐行写入
with open(filePath3, 'wb') as ff:
    for i in range(len(fileFinal)):
        ff.write(fileFinal[i].encode('utf-8'))
        if i % 10000 == 0:
            print("---------已合并%d篇文章---------" % i)

print("---------一共读入%d行---------" % len(fileFinal))
print("---------完成合并---------")

A total of 212w + rows of data (which contribute 71w row + Sogou 141w line)
Here Insert Picture Description
size of 3.35GB
Here Insert Picture Description

Trainer

Or the old code is used

import logging
import sys
import gensim.models as word2vec
from gensim.models.word2vec import LineSentence, logger
# import smart_open


def train_word2vec(dataset_path, out_vector):
    # 设置输出日志
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
    # 把语料变成句子集合
    sentences = LineSentence(dataset_path)
    # sentences = LineSentence(smart_open.open(dataset_path, encoding='utf-8'))  # 或者用smart_open打开
    # 训练word2vec模型(size为向量维度,window为词向量上下文最大距离,min_count需要计算词向量的最小词频)
    model = word2vec.Word2Vec(sentences, size=128, sg=1, window=6, min_count=5, workers=4, iter=6)
    # (iter随机梯度下降法中迭代的最大次数,sg为1是Skip-Gram模型)
    # 保存word2vec模型
    model.save("word2vec.model")
    model.wv.save_word2vec_format(out_vector, binary=False)


# 加载模型
def load_word2vec_model(w2v_path):
    model = word2vec.Word2Vec.load(w2v_path)
    return model


# 计算词语最相似的词
def calculate_most_similar(self, word):
    similar_words = self.wv.most_similar(word)
    print(word)
    for term in similar_words:
        print(term[0], term[1])


# 计算两个词相似度
def calculate_words_similar(self, word1, word2):
    print(self.wv.similarity(word1, word2))


# 找出不合群的词
def find_word_dismatch(self, list):
    print(self.wv.doesnt_match(list))


if __name__ == '__main__':
    dataset_path = 'corpusFinal.txt'
    out_vector = 'corpusFinal.vector'
    train_word2vec(dataset_path, out_vector)  # 训练模型

    # model = load_word2vec_model("word2vec.model")  # 加载模型

    # calculate_most_similar(model, "病毒")  # 找相近词

    # calculate_words_similar(model, "法律", "制度")  # 两个词相似度

    # print(model.wv.__getitem__('男人'))  # 词向量

    # list = ["早饭", "吃饭", "恰饭", "嘻哈"]

    # find_word_dismatch(model, list)



Associated with the "virus" of words:
Here Insert Picture Description

Reference article

Based word2vec using the Chinese word training corpus wiki vector
small projects (Gensim Library) - Wikipedia Chinese data processing

Released eight original articles · won praise 0 · Views 622

Guess you like

Origin blog.csdn.net/qq_42491242/article/details/104854322